2021-11-02 Meeting notes
Date
Nov 2, 2021
Participants
@Edmond Chuc (Unlicensed)
Uwe Schindler
@Javier Sanchez
@TERN Data
@Gerhard Weis
Goals
Discussion topics
Time | Item | Presenter | Notes |
---|---|---|---|
|
|
| |
Meeting notes
Javi went through his report with Uwe and discussed his findings
Javi to check the mapping and the dynamic value types inferred by Elasticsearch. Uwe says use with caution
Javi - some caching question - hard to test since Elasticsearch performs caching.
Uwe - proper testing - restart Elasticsearch or clear the filesystem cache for Elasticsearch
Uwe - do we do just terms aggregations? Javi - Yes, mostly terms aggregations
Uwe - Do we do range queries? Yes, for a small few things
Javi's findings - most things are slightly faster, but trying to avoid caching in tests
Javi - With no nested documents, the number of documents is the same as the actual number of documents. This is good as our data grows.
Javi - happy with the results and will implement for real
Uwe - regex for regions - can be improve
Uwe - How should we proceed now?
Schema looks fine, not much more to improve
Field type guessing in the mapping - should be improved
Javi - worth trying to denormalise the regions
Uwe - very small improvement, remove labels used by the user-interface
Datatype is not really used currently in the user-interface
Date time is checked to format it
Will be used in the future for filtering the values
Have separate indexes for data types behind an alias
Javi - question about splitting into separate indexes
What is the performance of querying an alias with multiple indexes? Versus querying just one index
Uwe - It might get faster with aliases due to executing the query in parallel (plus with sharding)
Good idea to not create too many indexes but group them by similar attributes
Make as just one shard, not multiple per index
Fact - approximately 15 Lucene small indexes per Elasticsearch index
Uwe - question, are the indexes static after creation?
Yes - we are not performing incremental updates
Improvement - merge indexes into one large Lucene segment
Only do this if the index doesn't change afterwards
Why do this? Each small put will create small Lucene indexes. As indexing goes on, it will merge and create a larger segment (once it gets big enough). So it becomes larger to smaller (in logarithmic fashion).
Combine it to improve by reducing the number of Lucene segments. See the Elasticsearch Force merge API
Using the Elasticsearch Force merge API at the end of ingest will reduce the Lucene indexes into one big segment.
Good to merge when the index becomes read-only
Uwe - switch off refresh rate. We don't query the index while we are indexing. Refreshing is expensive, so switch it off. See Unset or increase the refresh interval
Call the Refresh API at the end once ingest is done.
Uwe - Test if queries are faster after performing force merge
max_num_segments should be set to 1. This makes sure at the end, after calling force merge, it will contain one Lucene segment.
Opened files and number of files will be dramatically less
Javi - what triggers merge?
Uwe - only during indexing
depends on the size
Why doesn’t Elasticsearch perform this automatically? Because it usually expects more incremental updates to an index. It only merges incrementally when it meets some threshold.
Next week Uwe is meeting his parents. Meet in two weeks.
Send Uwe the dynamic mapping once it’s updated with the changes
Javi - will implement the changes and suggestions
Max and min heap is set to half the RAM of the machine (8GB)
2 data nodes, 3 masters, 2 clients
Our hard disks are very slow (no SSD)
Uwe - recommend reviewing this
Never go past 30GB
Use as less heap space as possible and much disk space for disk caching
Avoid swap memory
If you don’t get out of memory exceptions, then don’t change heap space
For each node, only run Elasticsearch, nothing else
Data nodes should have as much space as possible outside of heap
Add more nodes for more parallelism during queries
Use iotop to see IO usage
Possibly provide Uwe access to Elasticsearch or the virtual machines.
More details:
TERN Ecoplots Elasticsearch Mapping Recommendations - Google Docs