Summary of recommendations/changes in Mapping

Following some of the recommendations presented on Monday 25th, here are some stats to help to make decision about how ES performance improves.

https://docs.google.com/document/d/11rdMT-ZoFpOmJY4ND1ujb5kwyiSZCMro7_47QbkiY9U

1 Region field
- 1.1 Data changes
  - 1.1.1 Index size stats
- 1.2 ES Queries
  - 1.2.1 Requests time stats
2 Denormalise attributes
- 2.1 Index size stats
3 No nested fields at all
- 3.1 Total number of fields in alias
  - 3.1.1 Using wc from mapping
  - 3.1.2 Using _field_caps API
4 Denormalise regions
5 Force Merge API
6 Disable refresh during indexing
7 Dynamic mapping
- 7.1 Data indices Mapping

Region field

Change mapping for “regions” field, getting rid of “nested” type and using a plain “keyword” field (concatenating region_type and region) and then to use “regex” in aggregations.

Data changes

New mapping for regions:

"regions": {"type": "keyword"},
"region_type": {"type": "keyword"},

Data before:

"regions": [
      {
        "uri": "http://linked.data.gov.au/dataset/asgs2016/stateorterritory/5",
        "label": "Western Australia",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/asgs2016/stateorterritory",
          "label": "States and territories"
        }
      },
      {
        "uri": "http://linked.data.gov.au/dataset/wwf-terr-ecoregions/14110",
        "label": "Southwest Australia savanna",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/wwf-terr-ecoregions",
          "label": "WWF ecoregions"
        }
      },
      {
        "uri": "http://linked.data.gov.au/dataset/local-gov-areas-2011/56790",
        "label": "Northampton (S)",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/local-gov-areas-2011",
          "label": "Local government areas"
        }
      },
      {
        "uri": "http://linked.data.gov.au/dataset/nrm-2017/5010",
        "label": "Northern Agricultural Region",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/nrm-2017",
          "label": "NRM regions"
        }
      },
      {
        "uri": "http://linked.data.gov.au/dataset/capad-2018-terrestrial/BHA_26",
        "label": "Eurardy",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/capad-2018-terrestrial",
          "label": "Terrestrial CAPAD regions"
        }
      },
      {
        "uri": "http://linked.data.gov.au/dataset/bioregion/GES01",
        "label": "Geraldton Hills",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/bioregion",
          "label": "Subregions"
        }
      },
      {
        "uri": "http://linked.data.gov.au/dataset/bioregion/GES",
        "label": "Geraldton Sandplains",
        "dataset": {
          "uri": "http://linked.data.gov.au/dataset/bioregion/IBRA7",
          "label": "Bioregions"
        }
      }
],

Data after:

"regions": [
      "http://linked.data.gov.au/dataset/asgs2016/stateorterritory|http://linked.data.gov.au/dataset/asgs2016/stateorterritory/5",
      "http://linked.data.gov.au/dataset/wwf-terr-ecoregions|http://linked.data.gov.au/dataset/wwf-terr-ecoregions/14110",
      "http://linked.data.gov.au/dataset/local-gov-areas-2011|http://linked.data.gov.au/dataset/local-gov-areas-2011/56790",
      "http://linked.data.gov.au/dataset/nrm-2017|http://linked.data.gov.au/dataset/nrm-2017/5010",
      "http://linked.data.gov.au/dataset/capad-2018-terrestrial|http://linked.data.gov.au/dataset/capad-2018-terrestrial/BHA_26",
      "http://linked.data.gov.au/dataset/bioregion|http://linked.data.gov.au/dataset/bioregion/GES01",
      "http://linked.data.gov.au/dataset/bioregion/IBRA7|http://linked.data.gov.au/dataset/bioregion/GES"
],
"region_types": [
      "http://linked.data.gov.au/dataset/asgs2016/stateorterritory",
      "http://linked.data.gov.au/dataset/wwf-terr-ecoregions",
      "http://linked.data.gov.au/dataset/local-gov-areas-2011",
      "http://linked.data.gov.au/dataset/nrm-2017",
      "http://linked.data.gov.au/dataset/capad-2018-terrestrial",
      "http://linked.data.gov.au/dataset/bioregion",
      "http://linked.data.gov.au/dataset/bioregion/IBRA7"
]

Index size stats

Approach	No docs	No docs (incl. hidden docs)	Docs increase

Approach	No docs	No docs (incl. hidden docs)	Docs increase
Nested docs	2,563,630	27,284,158	x10.64278
Keyword	2,563,630	10,292,406	x4.014778

Old index is ~1.65 times greater than new index in terms of number of documents

ES Queries

Old query for regions aggregation:

New queries:

Requests time stats

Summary Excel:

JSONs with results of tests:

POSTMAN queries (collection) → Importable to Postman by anyone.

Not nested regions is slighly faster in most of the executions

Denormalise attributes

The aim of this recommendation is to get rid of nested fields (which create “hiden” Lucene documents) in order to prevent the uncontrolled growth in the size of the index.

Each attribute in the ”nested” field will be modelled as a specific column in the document, instead of nesting them in an “array/list” of documents.

This approach should:

Reduce the final number of documents.
Make aggregations by attribute value simpler and faster?

Index size stats

Approach	No docs	No docs (incl. hidden docs)	Docs increase

Approach	No docs	No docs (incl. hidden docs)	Docs increase
Nested docs	2,563,630	27,284,158	x10.64
Denormalised	2,563,630	19,580,736	x7.64

Old index is ~0.39 times greater than new index in terms of number of documents

Data before denormalisation:

Date after denormalisation:

No nested fields at all

Following above practices (regions and attributes) we have got rid of all nested fields in the mapping (this is, regions, foi_attributes, obs_attributes and instr_attributes).

Total number of fields in alias

Denormalisation means having 1 field (which has sub fields) per attribute.

In order to control the number of fields (as ES has some limitations to mainly avoid “mapping explosion”, as well as internal limits in Lucene), this simple command count the number of fields in an index.

Using wc from mapping

At 1/11/2021 with 1 full dataset ingested, the total number of fields is 261.

For this testing, all original indices have been transformed and merged into only 1 index, but the real data would be: 1 index per FOI and dataset.

Using _field_caps API

At 1/11/2021 with 1 full dataset ingested, the total number of fields is 290.

Denormalise regions

ES document mapping:

Mapping generated dynamically using https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html

Force Merge API

Indices segments are merged after every indexing.

Disable refresh during indexing

Disabling index refresh makes indexing times notably faster (thoughput: ~1000 every two seconds).

1 refresh action is performed manually after indexing. Then the index segments are merged (force-merge).

Dynamic mapping

In order to ensure that the correct datatype is stored in ES for each attribute value, dynamic templating is performed during indexing following the defined rules:

Ecoplots Elasticsearch

Summary of recommendations/changes in Mapping

Analytics

Region field

Data changes

Index size stats

ES Queries

Requests time stats

Denormalise attributes

Index size stats

No nested fields at all

Total number of fields in alias

Using wc from mapping

Using _field_caps API

Denormalise regions

Force Merge API

Disable refresh during indexing

Dynamic mapping

Data indices Mapping

Related content