Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current Restore this Version View Version History

« Previous Version 3 Current »

Date

Participants

Goals

Discussion topics

Time

Item

Presenter

Notes





Meeting notes

  • Uwe - Javi, you already implemented the suggestions from last time?

  • Javi - Yes, and it's working pretty fast and well.

  • Javi - summary of recommendations - Summary of recommendations/changes in Mapping

  • Javi - Removed regular expressions

  • Javi - using Clear cache API to do testing on performance

    • Before performing aggregations, I clear the cache

  • Javi - Most of the mappings are the same for the observation indices

  • Javi - Once cache is cleared, each query takes around 1 second or more.

  • Javi - Doesn't matter if it's cached or not, it's fast due to the new improvements

  • Uwe - Did you do a test with one big index or not?

    • Javi - No, since the final product will have different indices, I didn't test it

  • Uwe - Did you force merge?

    • Javi - Yes, the results I'm showing all have cache cleared and performed force merge.

  • Javi - I also disabled refresh during indexing which improved the ingestion by 3-4 times.

  • Uwe - Are we still using dynamic mapping?

    • Javi - these new ones I am using dynamic templates

    • Javi - E.g., for regions, I am using dynamic templates region:*

      • And we aren't using the full-text search

  • Uwe - How did you do it for the numeric and non-numeric fields?

    • Javi - in the mapping using value_label_field, value_type_field and value_value_field with the path_match and type fields.

    • Uwe - Why did you use ignore_malformed for an integer field?

    • Javi - can't remember why we have this enabled. We use it for dates and date times.

    • Javi - also not sure why I mapped a boolean as a keyword

    • Uwe - terms aggregations sometimes return a numeric value instead of a boolean, just beware

    • Uwe - otherwise looks fine, that's how I would have done it too.

  • Uwe - what are the number of fields?

    • Javi - Right now for the full alias it's 733 fields.

    • Javi - Let me check just one index only...

      • 87 for plotdata_ecoplots3-data-observations-ausplots2-plant-occurrence

      • 71 plotdata_ecoplots3-data-observations-ausplots2-land-surface-substrate

  • Uwe - does the UI feel more smoother and faster?

    • Javi - Yes, i've removed all the labels from the index and the index size is much smaller.

    • Uwe - the real reason I asked to remove the labels is to not get field number explosions.

    • Uwe - the other thing to do is the set the index to 0 so that those fields are not indexed, indexed no, doc value is also no. This way, you can use it when you retrieve the data but it's not indexed.

    • Javi - Right now I've removed all the labels and created a few more indices just for the labels as look ups and now it's pretty fast.

    • Uwe - Yup perfect.

    • Javi - showing demo

    • Uwe - Oh it's really fast!

  • Uwe - How many indices will you have at the end?

    • Javi - Around 10-20 indices or more per dataset on average (probably)

    • Uwe - you will see a little slowdown as you add more indices but should be fine for now. You can always group some datasets together into one index as well.

    • Uwe - if you see some slowdown and it gets too slow, then you can merge static datasets that no longer change anymore into one index to reduce the number of indices.

      • Or use the merge API to combine indices into one

    • Javi - when do you think we may run into this problem?

    • Uwe - Not really problem, question of non-functional requirements and what performance is acceptable

    • Uwe - but this I see more as a maintenance task rather than a development task, something to keep in mind in the future.

    • Javi - that's good to know, we do have some one-off datasets

Edmond got disconnected from Zoom Some of the conversation was missed for a minute.

  • Uwe - because you have multiple indices, the queries are doing it in parallel and merge the result in the response

    • Javi - We are using the scroll API to get like 20 million documents, since the indices are overall smaller, it is a slight improvement as well.

    • Javi - for the UI we are using pagination

  • Uwe - is date range working too?

    • Javi - yes, and the queries now are simpler too.

  • Javi - No longer need to use nested aggregations as well

    • Uwe - Yes, just terms aggregations and nothing more. That's good. Really a great success in my opinion.

    • Javi - Yes I am quite happy with the improvement

  • Javi - Gerhard did a few changes to the cluster, so maybe he wants to update Uwe.

    • Gerhard - I changed the heap settings to 25% of the maximum ram and added a few more cpu cores and a bit more ram.

    • Uwe - Last time we said I'll get access to the machines via SSH, is that still required? For now, seems fine.

    • Gerhard - using the readonlyrest extension for authorization

      • Uwe - be careful and disable scripting if you're not using the script API

        • Not really sure if the read-only will affect this but if you're not using it, then disable it to be safer.

      • Gerhard - it's not even read-only from the outside, and it will eventually be completely blocked from the outside

  • Uwe - thought about using open search?

    • Gerhard - thought about but haven't tried it

    • Uwe - I think most is the same, elasticsearch stuff is identical but authorization handling is different

  • Uwe - you're using postman and not kibana?

    • Javi - I use postman in my day-to-day work and it can save queries

  • Uwe - How much code changes was it in the UI and API?

    • Javi - Project is split into 3

      • API

      • UI

      • Indexer

    • Javi - Most of the changes were happening in the indexer

    • Javi - The API project was just changing the queries.

    • Javi - the UI now looks up the labels

    • Uwe - Oh, I thought the API would look for the labels.

  • Uwe - Will the API be used by external users?

    • Javi - Yes

    • Uwe - will they also have to use URIs?

    • Javi - For now yes

    • Gerhard - we can also have R or Python wrappers around the APIs to improve the UX.

  • Javi - We have some issues with the map and was wondering if you can help us Uwe.

    • Javi - We are doing geo aggregations.

    • Uwe - So are the coordinates saved here in the documents?

    • Javi - shows kibana

    • Javi - using geopoint

    • Uwe - unfortunately I don't have much experience with the geo aggregations

    • Javi - shows the clustering of sites and how zooming in and out calculates the aggregations to return the response, a cluster of sites.

    • Javi - we will have 10s of thousands of sites in the coming datasets

      • Javi - we like square data grids like DataOne shows DataOne portal

      • Javi - Looking at geotile API in elasticsearch

      • Uwe - unfortunately I have no idea and haven't tried this before

      • Javi - currently we are using the geohash API but I want to use the geotile API.

    • Uwe - At PANGAEA we are assigning names of the regions but not anything like this with a map with rectangle clustering

    • Uwe - I prefer full-text search instead of arbitrary maps broken into clusters of things

    • Uwe - I prefer the current clustering that you have instead of the rectangular one.

  • Uwe - probably worthwhile to think about utilising full-text search in combination with the current search options.

  • Guru - Javi is currently doing aggregations in real-time, which causes some slowness. Is there a better way to do this to index it in advance.

    • Uwe - A bit strange why it's slow for around 800 sites.

    • Javi - Every time you zoom in or out is performing a new aggregation. So it's not really slow, just inefficient because it's performing a lot of aggregations.

    • Uwe - At PANGAEA we are adding full-text search and everything during indexing time and have it as a grid and index it.

    • Gerhard - not really slow currently and the animation with the zoom in and out is blocking the UI.

    • Uwe - currently don't think there's much to do now as the aggregations is fast. You can precalculate it in the future.

    • Javi - I read many people are moving to Elasticsearch for geo instead of using something like PostGIS

Action items

  •  

Decisions