/
2021-11-16 Meeting notes

2021-11-16 Meeting notes

Date

Nov 16, 2021

Participants

  • @Edmond Chuc (Unlicensed)

  • Uwe Schindler

  • @Javier Sanchez

  • @Gerhard Weis

  • @TERN Data

Goals

  •  

Discussion topics

Time

Item

Presenter

Notes

Time

Item

Presenter

Notes

 

 

 









Meeting notes

  • Uwe - Javi, you already implemented the suggestions from last time?

  • Javi - Yes, and it's working pretty fast and well.

  • Javi - summary of recommendations - Summary of recommendations/changes in Mapping

  • Javi - Removed regular expressions

  • Javi - using Clear cache API to do testing on performance

    • Before performing aggregations, I clear the cache

  • Javi - Most of the mappings are the same for the observation indices

  • Javi - Once cache is cleared, each query takes around 1 second or more.

  • Javi - Doesn't matter if it's cached or not, it's fast due to the new improvements

  • Uwe - Did you do a test with one big index or not?

    • Javi - No, since the final product will have different indices, I didn't test it

  • Uwe - Did you force merge?

    • Javi - Yes, the results I'm showing all have cache cleared and performed force merge.

  • Javi - I also disabled refresh during indexing which improved the ingestion by 3-4 times.

  • Uwe - Are we still using dynamic mapping?

    • Javi - these new ones I am using dynamic templates

    • Javi - E.g., for regions, I am using dynamic templates region:*

      • And we aren't using the full-text search

  • Uwe - How did you do it for the numeric and non-numeric fields?

    • Javi - in the mapping using value_label_field, value_type_field and value_value_field with the path_match and type fields.

    • Uwe - Why did you use ignore_malformed for an integer field?

    • Javi - can't remember why we have this enabled. We use it for dates and date times.

    • Javi - also not sure why I mapped a boolean as a keyword

    • Uwe - terms aggregations sometimes return a numeric value instead of a boolean, just beware

    • Uwe - otherwise looks fine, that's how I would have done it too.

  • Uwe - what are the number of fields?

    • Javi - Right now for the full alias it's 733 fields.

    • Javi - Let me check just one index only...

      • 87 for plotdata_ecoplots3-data-observations-ausplots2-plant-occurrence

      • 71 plotdata_ecoplots3-data-observations-ausplots2-land-surface-substrate

  • Uwe - does the UI feel more smoother and faster?

    • Javi - Yes, i've removed all the labels from the index and the index size is much smaller.

    • Uwe - the real reason I asked to remove the labels is to not get field number explosions.

    • Uwe - the other thing to do is the set the index to 0 so that those fields are not indexed, indexed no, doc value is also no. This way, you can use it when you retrieve the data but it's not indexed.

    • Javi - Right now I've removed all the labels and created a few more indices just for the labels as look ups and now it's pretty fast.

    • Uwe - Yup perfect.

    • Javi - showing demo

    • Uwe - Oh it's really fast!

  • Uwe - How many indices will you have at the end?

    • Javi - Around 10-20 indices or more per dataset on average (probably)

    • Uwe - you will see a little slowdown as you add more indices but should be fine for now. You can always group some datasets together into one index as well.

    • Uwe - if you see some slowdown and it gets too slow, then you can merge static datasets that no longer change anymore into one index to reduce the number of indices.

      • Or use the merge API to combine indices into one

    • Javi - when do you think we may run into this problem?

    • Uwe - Not really problem, question of non-functional requirements and what performance is acceptable

    • Uwe - but this I see more as a maintenance task rather than a development task, something to keep in mind in the future.

    • Javi - that's good to know, we do have some one-off datasets

Edmond got disconnected from Zoom Some of the conversation was missed for a minute.

  • Uwe - because you have multiple indices, the queries are doing it in parallel and merge the result in the response

    • Javi - We are using the scroll API to get like 20 million documents, since the indices are overall smaller, it is a slight improvement as well.

    • Javi - for the UI we are using pagination

  • Uwe - is date range working too?

    • Javi - yes, and the queries now are simpler too.

  • Javi - No longer need to use nested aggregations as well

    • Uwe - Yes, just terms aggregations and nothing more. That's good. Really a great success in my opinion.

    • Javi - Yes I am quite happy with the improvement

  • Javi - Gerhard did a few changes to the cluster, so maybe he wants to update Uwe.

    • Gerhard - I changed the heap settings to 25% of the maximum ram and added a few more cpu cores and a bit more ram.

    • Uwe - Last time we said I'll get access to the machines via SSH, is that still required? For now, seems fine.

    • Gerhard - using the readonlyrest extension for authorization

      • Uwe - be careful and disable scripting if you're not using the script API

        • Not really sure if the read-only will affect this but if you're not using it, then disable it to be safer.

      • Gerhard - it's not even read-only from the outside, and it will eventually be completely blocked from the outside

  • Uwe - thought about using open search?

    • Gerhard - thought about but haven't tried it

    • Uwe - I think most is the same, elasticsearch stuff is identical but authorization handling is different

  • Uwe - you're using postman and not kibana?

    • Javi - I use postman in my day-to-day work and it can save queries

  • Uwe - How much code changes was it in the UI and API?

    • Javi - Project is split into 3

      • API

      • UI

      • Indexer

    • Javi - Most of the changes were happening in the indexer

    • Javi - The API project was just changing the queries.

    • Javi - the UI now looks up the labels

    • Uwe - Oh, I thought the API would look for the labels.

  • Uwe - Will the API be used by external users?

    • Javi - Yes

    • Uwe - will they also have to use URIs?

    • Javi - For now yes

    • Gerhard - we can also have R or Python wrappers around the APIs to improve the UX.

  • Javi - We have some issues with the map and was wondering if you can help us Uwe.

    • Javi - We are doing geo aggregations.

    • Uwe - So are the coordinates saved here in the documents?

    • Javi - shows kibana

    • Javi - using geopoint

    • Uwe - unfortunately I don't have much experience with the geo aggregations

    • Javi - shows the clustering of sites and how zooming in and out calculates the aggregations to return the response, a cluster of sites.

    • Javi - we will have 10s of thousands of sites in the coming datasets

      • Javi - we like square data grids like DataOne shows DataOne portal

      • Javi - Looking at geotile API in elasticsearch

      • Uwe - unfortunately I have no idea and haven't tried this before

      • Javi - currently we are using the geohash API but I want to use the geotile API.

    • Uwe - At PANGAEA we are assigning names of the regions but not anything like this with a map with rectangle clustering

    • Uwe - I prefer full-text search instead of arbitrary maps broken into clusters of things

    • Uwe - I prefer the current clustering that you have instead of the rectangular one.

  • Uwe - probably worthwhile to think about utilising full-text search in combination with the current search options.

  • Guru - Javi is currently doing aggregations in real-time, which causes some slowness. Is there a better way to do this to index it in advance.

    • Uwe - A bit strange why it's slow for around 800 sites.

    • Javi - Every time you zoom in or out is performing a new aggregation. So it's not really slow, just inefficient because it's performing a lot of aggregations.

    • Uwe - At PANGAEA we are adding full-text search and everything during indexing time and have it as a grid and index it.

    • Gerhard - not really slow currently and the animation with the zoom in and out is blocking the UI.

    • Uwe - currently don't think there's much to do now as the aggregations is fast. You can precalculate it in the future.

    • Javi - I read many people are moving to Elasticsearch for geo instead of using something like PostGIS

  • Uwe - https://wiki.pangaea.de/wiki/Topic

Action items

Decisions

Add label