Following some of the recommendations presented on Monday 25th, here are some stats to help to make decision about how ES performance improves.
https://docs.google.com/document/d/11rdMT-ZoFpOmJY4ND1ujb5kwyiSZCMro7_47QbkiY9U
Region field
Change mapping for “regions” field, getting rid of “nested” type and using a plain “keyword” field (concatenating region_type and region) and then to use “regex” in aggregations.
Data changes
New mapping for regions:
"regions": {"type": "keyword"}, "region_type": {"type": "keyword"},
Data before:
"regions": [ { "uri": "http://linked.data.gov.au/dataset/asgs2016/stateorterritory/5", "label": "Western Australia", "dataset": { "uri": "http://linked.data.gov.au/dataset/asgs2016/stateorterritory", "label": "States and territories" } }, { "uri": "http://linked.data.gov.au/dataset/wwf-terr-ecoregions/14110", "label": "Southwest Australia savanna", "dataset": { "uri": "http://linked.data.gov.au/dataset/wwf-terr-ecoregions", "label": "WWF ecoregions" } }, { "uri": "http://linked.data.gov.au/dataset/local-gov-areas-2011/56790", "label": "Northampton (S)", "dataset": { "uri": "http://linked.data.gov.au/dataset/local-gov-areas-2011", "label": "Local government areas" } }, { "uri": "http://linked.data.gov.au/dataset/nrm-2017/5010", "label": "Northern Agricultural Region", "dataset": { "uri": "http://linked.data.gov.au/dataset/nrm-2017", "label": "NRM regions" } }, { "uri": "http://linked.data.gov.au/dataset/capad-2018-terrestrial/BHA_26", "label": "Eurardy", "dataset": { "uri": "http://linked.data.gov.au/dataset/capad-2018-terrestrial", "label": "Terrestrial CAPAD regions" } }, { "uri": "http://linked.data.gov.au/dataset/bioregion/GES01", "label": "Geraldton Hills", "dataset": { "uri": "http://linked.data.gov.au/dataset/bioregion", "label": "Subregions" } }, { "uri": "http://linked.data.gov.au/dataset/bioregion/GES", "label": "Geraldton Sandplains", "dataset": { "uri": "http://linked.data.gov.au/dataset/bioregion/IBRA7", "label": "Bioregions" } } ],
Data after:
"regions": [ "http://linked.data.gov.au/dataset/asgs2016/stateorterritory|http://linked.data.gov.au/dataset/asgs2016/stateorterritory/5", "http://linked.data.gov.au/dataset/wwf-terr-ecoregions|http://linked.data.gov.au/dataset/wwf-terr-ecoregions/14110", "http://linked.data.gov.au/dataset/local-gov-areas-2011|http://linked.data.gov.au/dataset/local-gov-areas-2011/56790", "http://linked.data.gov.au/dataset/nrm-2017|http://linked.data.gov.au/dataset/nrm-2017/5010", "http://linked.data.gov.au/dataset/capad-2018-terrestrial|http://linked.data.gov.au/dataset/capad-2018-terrestrial/BHA_26", "http://linked.data.gov.au/dataset/bioregion|http://linked.data.gov.au/dataset/bioregion/GES01", "http://linked.data.gov.au/dataset/bioregion/IBRA7|http://linked.data.gov.au/dataset/bioregion/GES" ], "region_types": [ "http://linked.data.gov.au/dataset/asgs2016/stateorterritory", "http://linked.data.gov.au/dataset/wwf-terr-ecoregions", "http://linked.data.gov.au/dataset/local-gov-areas-2011", "http://linked.data.gov.au/dataset/nrm-2017", "http://linked.data.gov.au/dataset/capad-2018-terrestrial", "http://linked.data.gov.au/dataset/bioregion", "http://linked.data.gov.au/dataset/bioregion/IBRA7" ]
Index size stats
Approach | No docs | No docs (incl. hidden docs) | Docs increase |
---|---|---|---|
Nested docs | 2,563,630 | 27,284,158 | x10.64278 |
Keyword | 2,563,630 | 10,292,406 | x4.014778 |
Old index is ~1.65 times greater than new index in terms of number of documents
ES Queries
Old query for regions aggregation:
{ "aggs": { "nested_agg": { "nested": { "path": "regions" }, "aggs": { "value": { "terms": { "field": "regions.dataset.uri", "size": 1000 } } } } }, "size": 0 }
{ "aggs": { "nested_agg": { "nested": { "path": "regions" }, "aggs": { "filtering": { "filter": { "term": { "regions.dataset.uri": "http://linked.data.gov.au/dataset/asgs2016/stateorterritory" } }, "aggs": { "value": { "terms": { "field": "regions.uri", "size": 1000 } } } } } } }, "size": 0 }
New queries:
{ "aggs": { "regions": { "terms": { "field": "region_types" } } }, "size": 0, "track_total_hits": true }
{ "aggs": { "regions": { "terms": { "field": "regions", "include": "http://linked.data.gov.au/dataset/bioregion\\|.*" } } }, "size": 0, "track_total_hits": true }
Requests time stats
Summary Excel:
JSONs with results of tests:
POSTMAN queries (collection) → Importable to Postman by anyone.
Not nested regions is slighly faster in most of the executions
Denormalise attributes
The aim of this recommendation is to get rid of nested fields (which create “hiden” Lucene documents) in order to prevent the uncontrolled growth in the size of the index.
Each attribute in the ”nested” field will be modelled as a specific column in the document, instead of nesting them in an “array/list” of documents.
This approach should:
Reduce the final number of documents.
Make aggregations by attribute value simpler and faster?
Index size stats
Approach | No docs | No docs (incl. hidden docs) | Docs increase |
---|---|---|---|
Nested docs | 2,563,630 | 27,284,158 | x10.64 |
Denormalised | 2,563,630 | 19,580,736 | x7.64 |
Old index is ~0.39 times greater than new index in terms of number of documents
Add Comment