/
2021-11-02 Meeting notes

2021-11-02 Meeting notes

Date

Nov 2, 2021

Participants

  • @Edmond Chuc (Unlicensed)

  • Uwe Schindler

  • @Javier Sanchez

  • @TERN Data

  • @Gerhard Weis

Goals

  •  

Discussion topics

Time

Item

Presenter

Notes

Time

Item

Presenter

Notes

 

 

 









Meeting notes

  • Javi went through his report with Uwe and discussed his findings

  • Javi to check the mapping and the dynamic value types inferred by Elasticsearch. Uwe says use with caution

  • Javi - some caching question - hard to test since Elasticsearch performs caching.

  • Uwe - proper testing - restart Elasticsearch or clear the filesystem cache for Elasticsearch

  • Uwe - do we do just terms aggregations? Javi - Yes, mostly terms aggregations

  • Uwe - Do we do range queries? Yes, for a small few things

  • Javi's findings - most things are slightly faster, but trying to avoid caching in tests

  • Javi - With no nested documents, the number of documents is the same as the actual number of documents. This is good as our data grows.

  • Javi - happy with the results and will implement for real

  • Uwe - regex for regions - can be improve

  • Uwe - How should we proceed now?

    • Schema looks fine, not much more to improve

      • Field type guessing in the mapping - should be improved

    • Javi - worth trying to denormalise the regions

    • Uwe - very small improvement, remove labels used by the user-interface

  • Datatype is not really used currently in the user-interface

    • Date time is checked to format it

    • Will be used in the future for filtering the values

    • Have separate indexes for data types behind an alias

  • Javi - question about splitting into separate indexes

    • What is the performance of querying an alias with multiple indexes? Versus querying just one index

      • Uwe - It might get faster with aliases due to executing the query in parallel (plus with sharding)

      • Good idea to not create too many indexes but group them by similar attributes

      • Make as just one shard, not multiple per index

        • Fact - approximately 15 Lucene small indexes per Elasticsearch index

  • Uwe - question, are the indexes static after creation?

    • Yes - we are not performing incremental updates

    • Improvement - merge indexes into one large Lucene segment

      • Only do this if the index doesn't change afterwards

      • Why do this? Each small put will create small Lucene indexes. As indexing goes on, it will merge and create a larger segment (once it gets big enough). So it becomes larger to smaller (in logarithmic fashion).

      • Combine it to improve by reducing the number of Lucene segments. See the Elasticsearch Force merge API

      • Using the Elasticsearch Force merge API at the end of ingest will reduce the Lucene indexes into one big segment.

      • Good to merge when the index becomes read-only

  • Uwe - switch off refresh rate. We don't query the index while we are indexing. Refreshing is expensive, so switch it off. See Unset or increase the refresh interval

  • Uwe - Test if queries are faster after performing force merge

    • max_num_segments should be set to 1. This makes sure at the end, after calling force merge, it will contain one Lucene segment.

    • Opened files and number of files will be dramatically less

  • Javi - what triggers merge?

    • Uwe - only during indexing

    • depends on the size

    • Why doesn’t Elasticsearch perform this automatically? Because it usually expects more incremental updates to an index. It only merges incrementally when it meets some threshold.

  • Next week Uwe is meeting his parents. Meet in two weeks.

    • Send Uwe the dynamic mapping once it’s updated with the changes

    • Javi - will implement the changes and suggestions

  • Max and min heap is set to half the RAM of the machine (8GB)

    • 2 data nodes, 3 masters, 2 clients

    • Our hard disks are very slow (no SSD)

    • Uwe - recommend reviewing this

    • Never go past 30GB

    • Use as less heap space as possible and much disk space for disk caching

      • Avoid swap memory

      • If you don’t get out of memory exceptions, then don’t change heap space

      • For each node, only run Elasticsearch, nothing else

      • Data nodes should have as much space as possible outside of heap

      • Add more nodes for more parallelism during queries

    • Use iotop to see IO usage

  • Possibly provide Uwe access to Elasticsearch or the virtual machines.

 

More details:
TERN Ecoplots Elasticsearch Mapping Recommendations - Google Docs

Action items

Decisions

Add label

Related content

Summary of recommendations/changes in Mapping
Summary of recommendations/changes in Mapping
Read with this
2021-11-16 Meeting notes
2021-11-16 Meeting notes
More like this