2021-11-16 Meeting notes

Date

Nov 16, 2021

Participants

@Edmond Chuc (Unlicensed)
Uwe Schindler
@Javier Sanchez
@Gerhard Weis
@TERN Data

Goals

Discussion topics

Time	Item	Presenter	Notes

Time	Item	Presenter	Notes

Meeting notes

Uwe - Javi, you already implemented the suggestions from last time?
Javi - Yes, and it's working pretty fast and well.
Javi - summary of recommendations - Summary of recommendations/changes in Mapping
Javi - Removed regular expressions
Javi - using Clear cache API to do testing on performance
- Before performing aggregations, I clear the cache
Javi - Most of the mappings are the same for the observation indices
Javi - Once cache is cleared, each query takes around 1 second or more.
Javi - Doesn't matter if it's cached or not, it's fast due to the new improvements
Uwe - Did you do a test with one big index or not?
- Javi - No, since the final product will have different indices, I didn't test it
Uwe - Did you force merge?
- Javi - Yes, the results I'm showing all have cache cleared and performed force merge.
Javi - I also disabled refresh during indexing which improved the ingestion by 3-4 times.
Uwe - Are we still using dynamic mapping?
- Javi - these new ones I am using dynamic templates
- Javi - E.g., for regions, I am using dynamic templates region:*
  - And we aren't using the full-text search
Uwe - How did you do it for the numeric and non-numeric fields?
- Javi - in the mapping using value_label_field, value_type_field and value_value_field with the path_match and type fields.
- Uwe - Why did you use ignore_malformed for an integer field?
- Javi - can't remember why we have this enabled. We use it for dates and date times.
- Javi - also not sure why I mapped a boolean as a keyword
- Uwe - terms aggregations sometimes return a numeric value instead of a boolean, just beware
- Uwe - otherwise looks fine, that's how I would have done it too.
Uwe - what are the number of fields?
- Javi - Right now for the full alias it's 733 fields.
- Javi - Let me check just one index only...
  - 87 for plotdata_ecoplots3-data-observations-ausplots2-plant-occurrence
  - 71 plotdata_ecoplots3-data-observations-ausplots2-land-surface-substrate
Uwe - does the UI feel more smoother and faster?
- Javi - Yes, i've removed all the labels from the index and the index size is much smaller.
- Uwe - the real reason I asked to remove the labels is to not get field number explosions.
- Uwe - the other thing to do is the set the index to 0 so that those fields are not indexed, indexed no, doc value is also no. This way, you can use it when you retrieve the data but it's not indexed.
- Javi - Right now I've removed all the labels and created a few more indices just for the labels as look ups and now it's pretty fast.
- Uwe - Yup perfect.
- Javi - showing demo
- Uwe - Oh it's really fast!
Uwe - How many indices will you have at the end?
- Javi - Around 10-20 indices or more per dataset on average (probably)
- Uwe - you will see a little slowdown as you add more indices but should be fine for now. You can always group some datasets together into one index as well.
- Uwe - if you see some slowdown and it gets too slow, then you can merge static datasets that no longer change anymore into one index to reduce the number of indices.
  - Or use the merge API to combine indices into one
- Javi - when do you think we may run into this problem?
- Uwe - Not really problem, question of non-functional requirements and what performance is acceptable
- Uwe - but this I see more as a maintenance task rather than a development task, something to keep in mind in the future.
- Javi - that's good to know, we do have some one-off datasets

Edmond got disconnected from Zoom Some of the conversation was missed for a minute.

Uwe - because you have multiple indices, the queries are doing it in parallel and merge the result in the response
- Javi - We are using the scroll API to get like 20 million documents, since the indices are overall smaller, it is a slight improvement as well.
- Javi - for the UI we are using pagination
Uwe - is date range working too?
- Javi - yes, and the queries now are simpler too.
Javi - No longer need to use nested aggregations as well
- Uwe - Yes, just terms aggregations and nothing more. That's good. Really a great success in my opinion.
- Javi - Yes I am quite happy with the improvement
Javi - Gerhard did a few changes to the cluster, so maybe he wants to update Uwe.
- Gerhard - I changed the heap settings to 25% of the maximum ram and added a few more cpu cores and a bit more ram.
- Uwe - Last time we said I'll get access to the machines via SSH, is that still required? For now, seems fine.
- Gerhard - using the readonlyrest extension for authorization
  - Uwe - be careful and disable scripting if you're not using the script API
    - Not really sure if the read-only will affect this but if you're not using it, then disable it to be safer.
  - Gerhard - it's not even read-only from the outside, and it will eventually be completely blocked from the outside
Uwe - thought about using open search?
- Gerhard - thought about but haven't tried it
- Uwe - I think most is the same, elasticsearch stuff is identical but authorization handling is different
Uwe - you're using postman and not kibana?
- Javi - I use postman in my day-to-day work and it can save queries
Uwe - How much code changes was it in the UI and API?
- Javi - Project is split into 3
  - API
  - UI
  - Indexer
- Javi - Most of the changes were happening in the indexer
- Javi - The API project was just changing the queries.
- Javi - the UI now looks up the labels
- Uwe - Oh, I thought the API would look for the labels.
Uwe - Will the API be used by external users?
- Javi - Yes
- Uwe - will they also have to use URIs?
- Javi - For now yes
- Gerhard - we can also have R or Python wrappers around the APIs to improve the UX.
Javi - We have some issues with the map and was wondering if you can help us Uwe.
- Javi - We are doing geo aggregations.
- Uwe - So are the coordinates saved here in the documents?
- Javi - shows kibana
- Javi - using geopoint
- Uwe - unfortunately I don't have much experience with the geo aggregations
- Javi - shows the clustering of sites and how zooming in and out calculates the aggregations to return the response, a cluster of sites.
- Javi - we will have 10s of thousands of sites in the coming datasets
  - Javi - we like square data grids like DataOne shows DataOne portal
  - Javi - Looking at geotile API in elasticsearch
  - Uwe - unfortunately I have no idea and haven't tried this before
  - Javi - currently we are using the geohash API but I want to use the geotile API.
- Uwe - At PANGAEA we are assigning names of the regions but not anything like this with a map with rectangle clustering
- Uwe - I prefer full-text search instead of arbitrary maps broken into clusters of things
- Uwe - I prefer the current clustering that you have instead of the rectangular one.
Uwe - probably worthwhile to think about utilising full-text search in combination with the current search options.
Guru - Javi is currently doing aggregations in real-time, which causes some slowness. Is there a better way to do this to index it in advance.
- Uwe - A bit strange why it's slow for around 800 sites.
- Javi - Every time you zoom in or out is performing a new aggregation. So it's not really slow, just inefficient because it's performing a lot of aggregations.
- Uwe - At PANGAEA we are adding full-text search and everything during indexing time and have it as a grid and index it.
- Gerhard - not really slow currently and the animation with the zoom in and out is blocking the UI.
- Uwe - currently don't think there's much to do now as the aggregations is fast. You can precalculate it in the future.
- Javi - I read many people are moving to Elasticsearch for geo instead of using something like PostGIS
Uwe - https://wiki.pangaea.de/wiki/Topic

Ecoplots Elasticsearch

2021-11-16 Meeting notes

Analytics

Date

Participants

Goals

Discussion topics

Meeting notes

Action items

Decisions