2021-11-16 Meeting notes
Date
Nov 16, 2021
Participants
@Edmond Chuc (Unlicensed)
Uwe Schindler
@Javier Sanchez
@Gerhard Weis
@TERN Data
Goals
Discussion topics
Time | Item | Presenter | Notes |
---|---|---|---|
|
|
| |
Meeting notes
Uwe - Javi, you already implemented the suggestions from last time?
Javi - Yes, and it's working pretty fast and well.
Javi - summary of recommendations - Summary of recommendations/changes in Mapping
Javi - Removed regular expressions
Javi - using Clear cache API to do testing on performance
Before performing aggregations, I clear the cache
Javi - Most of the mappings are the same for the observation indices
Javi - Once cache is cleared, each query takes around 1 second or more.
Javi - Doesn't matter if it's cached or not, it's fast due to the new improvements
Uwe - Did you do a test with one big index or not?
Javi - No, since the final product will have different indices, I didn't test it
Uwe - Did you force merge?
Javi - Yes, the results I'm showing all have cache cleared and performed force merge.
Javi - I also disabled refresh during indexing which improved the ingestion by 3-4 times.
Uwe - Are we still using dynamic mapping?
Javi - these new ones I am using dynamic templates
Javi - E.g., for regions, I am using dynamic templates
region:*
And we aren't using the full-text search
Uwe - How did you do it for the numeric and non-numeric fields?
Javi - in the mapping using
value_label_field
,value_type_field
andvalue_value_field
with thepath_match
andtype
fields.Uwe - Why did you use
ignore_malformed
for an integer field?Javi - can't remember why we have this enabled. We use it for dates and date times.
Javi - also not sure why I mapped a boolean as a keyword
Uwe - terms aggregations sometimes return a numeric value instead of a boolean, just beware
Uwe - otherwise looks fine, that's how I would have done it too.
Uwe - what are the number of fields?
Javi - Right now for the full alias it's 733 fields.
Javi - Let me check just one index only...
87 for
plotdata_ecoplots3-data-observations-ausplots2-plant-occurrence
71
plotdata_ecoplots3-data-observations-ausplots2-land-surface-substrate
Uwe - does the UI feel more smoother and faster?
Javi - Yes, i've removed all the labels from the index and the index size is much smaller.
Uwe - the real reason I asked to remove the labels is to not get field number explosions.
Uwe - the other thing to do is the set the index to 0 so that those fields are not indexed, indexed no, doc value is also no. This way, you can use it when you retrieve the data but it's not indexed.
Javi - Right now I've removed all the labels and created a few more indices just for the labels as look ups and now it's pretty fast.
Uwe - Yup perfect.
Javi - showing demo
Uwe - Oh it's really fast!
Uwe - How many indices will you have at the end?
Javi - Around 10-20 indices or more per dataset on average (probably)
Uwe - you will see a little slowdown as you add more indices but should be fine for now. You can always group some datasets together into one index as well.
Uwe - if you see some slowdown and it gets too slow, then you can merge static datasets that no longer change anymore into one index to reduce the number of indices.
Or use the merge API to combine indices into one
Javi - when do you think we may run into this problem?
Uwe - Not really problem, question of non-functional requirements and what performance is acceptable
Uwe - but this I see more as a maintenance task rather than a development task, something to keep in mind in the future.
Javi - that's good to know, we do have some one-off datasets
Edmond got disconnected from Zoom Some of the conversation was missed for a minute.
Uwe - because you have multiple indices, the queries are doing it in parallel and merge the result in the response
Javi - We are using the scroll API to get like 20 million documents, since the indices are overall smaller, it is a slight improvement as well.
Javi - for the UI we are using pagination
Uwe - is date range working too?
Javi - yes, and the queries now are simpler too.
Javi - No longer need to use nested aggregations as well
Uwe - Yes, just terms aggregations and nothing more. That's good. Really a great success in my opinion.
Javi - Yes I am quite happy with the improvement
Javi - Gerhard did a few changes to the cluster, so maybe he wants to update Uwe.
Gerhard - I changed the heap settings to 25% of the maximum ram and added a few more cpu cores and a bit more ram.
Uwe - Last time we said I'll get access to the machines via SSH, is that still required? For now, seems fine.
Gerhard - using the readonlyrest extension for authorization
Uwe - be careful and disable scripting if you're not using the script API
Not really sure if the read-only will affect this but if you're not using it, then disable it to be safer.
Gerhard - it's not even read-only from the outside, and it will eventually be completely blocked from the outside
Uwe - thought about using open search?
Gerhard - thought about but haven't tried it
Uwe - I think most is the same, elasticsearch stuff is identical but authorization handling is different
Uwe - you're using postman and not kibana?
Javi - I use postman in my day-to-day work and it can save queries
Uwe - How much code changes was it in the UI and API?
Javi - Project is split into 3
API
UI
Indexer
Javi - Most of the changes were happening in the indexer
Javi - The API project was just changing the queries.
Javi - the UI now looks up the labels
Uwe - Oh, I thought the API would look for the labels.
Uwe - Will the API be used by external users?
Javi - Yes
Uwe - will they also have to use URIs?
Javi - For now yes
Gerhard - we can also have R or Python wrappers around the APIs to improve the UX.
Javi - We have some issues with the map and was wondering if you can help us Uwe.
Javi - We are doing geo aggregations.
Uwe - So are the coordinates saved here in the documents?
Javi - shows kibana
Javi - using
geopoint
Uwe - unfortunately I don't have much experience with the geo aggregations
Javi - shows the clustering of sites and how zooming in and out calculates the aggregations to return the response, a cluster of sites.
Javi - we will have 10s of thousands of sites in the coming datasets
Javi - we like square data grids like DataOne shows DataOne portal
Javi - Looking at geotile API in elasticsearch
Uwe - unfortunately I have no idea and haven't tried this before
Javi - currently we are using the geohash API but I want to use the geotile API.
Uwe - At PANGAEA we are assigning names of the regions but not anything like this with a map with rectangle clustering
Uwe - I prefer full-text search instead of arbitrary maps broken into clusters of things
Uwe - I prefer the current clustering that you have instead of the rectangular one.
Uwe - probably worthwhile to think about utilising full-text search in combination with the current search options.
Guru - Javi is currently doing aggregations in real-time, which causes some slowness. Is there a better way to do this to index it in advance.
Uwe - A bit strange why it's slow for around 800 sites.
Javi - Every time you zoom in or out is performing a new aggregation. So it's not really slow, just inefficient because it's performing a lot of aggregations.
Uwe - At PANGAEA we are adding full-text search and everything during indexing time and have it as a grid and index it.
Gerhard - not really slow currently and the animation with the zoom in and out is blocking the UI.
Uwe - currently don't think there's much to do now as the aggregations is fast. You can precalculate it in the future.
Javi - I read many people are moving to Elasticsearch for geo instead of using something like PostGIS