2021-11-02 Meeting notes

Date

Nov 2, 2021

Participants

@Edmond Chuc (Unlicensed)
Uwe Schindler
@Javier Sanchez
@TERN Data
@Gerhard Weis

Goals

Discussion topics

Time	Item	Presenter	Notes

Time	Item	Presenter	Notes

Meeting notes

Javi went through his report with Uwe and discussed his findings
Javi to check the mapping and the dynamic value types inferred by Elasticsearch. Uwe says use with caution
Javi - some caching question - hard to test since Elasticsearch performs caching.
Uwe - proper testing - restart Elasticsearch or clear the filesystem cache for Elasticsearch
Uwe - do we do just terms aggregations? Javi - Yes, mostly terms aggregations
Uwe - Do we do range queries? Yes, for a small few things
Javi's findings - most things are slightly faster, but trying to avoid caching in tests
Javi - With no nested documents, the number of documents is the same as the actual number of documents. This is good as our data grows.
Javi - happy with the results and will implement for real
Uwe - regex for regions - can be improve
Uwe - How should we proceed now?
- Schema looks fine, not much more to improve
  - Field type guessing in the mapping - should be improved
- Javi - worth trying to denormalise the regions
- Uwe - very small improvement, remove labels used by the user-interface
Datatype is not really used currently in the user-interface
- Date time is checked to format it
- Will be used in the future for filtering the values
- Have separate indexes for data types behind an alias
Javi - question about splitting into separate indexes
- What is the performance of querying an alias with multiple indexes? Versus querying just one index
  - Uwe - It might get faster with aliases due to executing the query in parallel (plus with sharding)
  - Good idea to not create too many indexes but group them by similar attributes
  - Make as just one shard, not multiple per index
    - Fact - approximately 15 Lucene small indexes per Elasticsearch index
Uwe - question, are the indexes static after creation?
- Yes - we are not performing incremental updates
- Improvement - merge indexes into one large Lucene segment
  - Only do this if the index doesn't change afterwards
  - Why do this? Each small put will create small Lucene indexes. As indexing goes on, it will merge and create a larger segment (once it gets big enough). So it becomes larger to smaller (in logarithmic fashion).
  - Combine it to improve by reducing the number of Lucene segments. See the Elasticsearch Force merge API
  - Using the Elasticsearch Force merge API at the end of ingest will reduce the Lucene indexes into one big segment.
  - Good to merge when the index becomes read-only
Uwe - switch off refresh rate. We don't query the index while we are indexing. Refreshing is expensive, so switch it off. See Unset or increase the refresh interval
- Call the Refresh API at the end once ingest is done.
Uwe - Test if queries are faster after performing force merge
- max_num_segments should be set to 1. This makes sure at the end, after calling force merge, it will contain one Lucene segment.
- Opened files and number of files will be dramatically less
Javi - what triggers merge?
- Uwe - only during indexing
- depends on the size
- Why doesn’t Elasticsearch perform this automatically? Because it usually expects more incremental updates to an index. It only merges incrementally when it meets some threshold.
Next week Uwe is meeting his parents. Meet in two weeks.
- Send Uwe the dynamic mapping once it’s updated with the changes
- Javi - will implement the changes and suggestions
Max and min heap is set to half the RAM of the machine (8GB)
- 2 data nodes, 3 masters, 2 clients
- Our hard disks are very slow (no SSD)
- Uwe - recommend reviewing this
- Never go past 30GB
- Use as less heap space as possible and much disk space for disk caching
  - Avoid swap memory
  - If you don’t get out of memory exceptions, then don’t change heap space
  - For each node, only run Elasticsearch, nothing else
  - Data nodes should have as much space as possible outside of heap
  - Add more nodes for more parallelism during queries
- Use iotop to see IO usage
Possibly provide Uwe access to Elasticsearch or the virtual machines.

More details:
TERN Ecoplots Elasticsearch Mapping Recommendations - Google Docs

Ecoplots Elasticsearch

2021-11-02 Meeting notes

Analytics

Date

Participants

Goals

Discussion topics

Meeting notes

Action items

Decisions

Related content