Questions and Errors in the Ausplots Rangelands Dataset
Questions
QUESTION 1 | There are a lot of sites where PLOT_DIMENSIONS are not explicitly recorded (values are NULL). This should not be a problem because we have PLOT_IS_100M_BY_100M field, and for most of them, the value is TRUE, so they actually are 100x100. But there are some cases where PLOT_DIMENSIONS is NULL and PLOT_IS_100M_BY_100M is FALSE, so we don’t know what is the real size of the plot. Do you have any more information about those sites? I have used this query: SELECT * FROM site_location WHERE "PLOT_DIMENSIONS" IS NULL AND "PLOT_IS_100M_BY_100M" IN (false) |
ANSWER | Email Emrys (12/05/2020): “I have fixed the issues with question 2 in SWARM.” Ben 15/5/20: “All of our plots are designed where ever possible to be 100x100 oriented to the grid (Will explain below regarding coords), but on occasions it is not possible to fit a 100 x 100 plot within a homogeneous patch of vegetation in the field, and so (Rarely) plot dimensions can be changed to eg. 200 x 50m (They always need to be 10,000M^2). This field is designed to quickly filter out those plots that “are not normal” in regards their dimensions. It looks from the comment above that Emrys has now populated the critical info!” |
ACTION / FIX | No action needed, in future ingestions data should be fixed. PENDING FUTURE REVIEW |
QUESTION 2 | We have noticed that PHOTO_STORAGE_LOCATION field in ‘photopoints’ table, from sometime in 2015, is always NULL. In all previous rows it has the value “ All photos are stored at the TERN AusPlots facility, Level 12 Schultz building, Adelaide University North Terrace Campus. Contact christina.pahl@adelaide.edu.au for access requests.” We think that even if the value is NULL, photos are still stored in TERN AusPlots facility, and we should show that message as well. Is this correct? |
ANSWER | Ben 15/5/20: “Presumably an old question as i think you’re now working on getting the photos into the new system…. But those contact details are old (by about three years) so best to update if necessary to put contact details.” |
ACTION / FIX | PENDING Link to the actual photo file or ask for update the contact details in DB. |
QUESTION 3 | There is a column for a single latitude and longitude value per site_location row. There is also latitude and longitude values from the site_location_point table as app_lat and app_long. What is the difference? Is there a difference? Can the latitude and longitude value be treated the same as the values in the site_location_point table? |
ANSWER | Emrys: “They are and should be the same. Should try and provide both as some users may prefer to work in the UTM coordinate system over the latitude and longitude. site_location lat/long is always a south-west point from the site_location_point. Can probably drop UTM coordinates but need to double check.“ Ben 15/5/20: “A and B - in reference to app_lat and App_long - Given the ease of making the tablet collect the location of the plot whilst collecting the data at the site we chose to collect that data as an ultimate back up - If site name was screwed up, or DGPS data corrupt we would still have an approximate location of the plot ( It’s only GPS not DGPS, and i don’t remember exactly what action in the app triggers it so it would not always be collected from the same location within the plot….. It is for internal use only and should not be exposed to the public” |
ACTION / FIX | We are using the SW point as location of the site, we don’t use these coordinates. |
QUESTION 3 | There are some cases where app_lat and app_long is not the same as the SW site_location_point: SELECT s.SITE_LOCATION_ID, app_lat, app_long, latitude, longitude FROM SITE_LOCATION_POINT p RIGHT JOIN SITE_LOCATION s ON p.SITE_LOCATION_ID = s.SITE_LOCATION_ID WHERE p.point='SW' AND (s.app_long is not null) AND (s.app_lat is not null) AND (s.app_long<>'No Location Data') AND (s.app_lat <>'No Location Data') AND (s.app_long<>'NC') AND (s.app_lat <>'NC') AND (round(p.latitude,3) <> round(s.app_lat,3)) AND (round(p.longitude,3) <> round(s.app_long,3)) |
ANSWER | (Emrys): Ignore the APP Lat Long- Do not use it to give you the site coordinates. It is really just a backup if everything else turns to shit |
ACTION / FIX | We are using the SW point as location of the site, we don’t use these coordinates. |
QUESTION 4 | The tables CHARACTERISATION_COARSE_FRAGS_LITHOLOGY and CHARACTERISATION_COARSE_FRAGS_shape, are supposed to be related 1 to 1 with the soil_characterisation sample, but in a few cases this is not true. Probably it is some kind of error introducing the data. select SOIL_CHARACTERISATION_ID , count(*) as c from CHARACTERISATION_COARSE_FRAGS_LITHOLOGY group by SOIL_CHARACTERISATION_ID order by c desc
select SOIL_CHARACTERISATION_ID , count(*) as c from CHARACTERISATION_COARSE_FRAGS_shape group by SOIL_CHARACTERISATION_ID order by c desc
select SOIL_CHARACTERISATION_ID , count(*) as c from CHARACTERISATION_COARSE_FRAGS_size group by SOIL_CHARACTERISATION_ID order by c desc
select SOIL_CHARACTERISATION_ID , count(*) as c from CHARACTERISATION_COARSE_FRAGS_abundance group by SOIL_CHARACTERISATION_ID order by c desc
|
ANSWER | (Emrys): “I just had a look to work out what was going on here. It seems this is collected in the soil data sheet but doesn’t appear to be going into the database, at least not anywhere I can see it.” Ben 15/5/20 - “Andrew may be able to re-run our import scripts for this data - Worth running this Question past him!” |
ACTION / FIX | Ask them for fix the data. PENDING |
QUESTION 5 | There are 3 site visits with 6 bulk density samples, apparently duplicated rows: select SITE_LOCATION_VISIT_ID, count(*) as count from SOIL_BULK_DENSITY group by SITE_LOCATION_VISIT_ID order by count desc
SITE_LOCATION_VISIT_ID COUNT
57617 6
57087 6
58009 6
select * from SOIL_BULK_DENSITY WHERE SITE_LOCATION_VISIT_ID = 57617 |
ANSWER | (Emrys): “As you point out, rows that have somehow been duplicated. We will work on getting rid of the spurious ones from our end” |
ACTION / FIX | Nothing in our side. Review data has been fixed in source. PENDING FUTURE REVIEW |
QUESTION 6 | Do we want all of the points in the site_location_point table for a site_location_id? |
ANSWER | Yes, the points help create the plot bounds. Ben 15/5/20: “Just to add to the previous comment it is important to keep and make available all of the site location info - Including the actual coordinates for the 4 plot corners collected with the dgps. Given it is for monitoring it is essential to sample exactly the same plot through time. These define the plot (incase any of our plot markers in the field are missing, and are commonly used by remote sensors to “cookie cut” imagery that they are able to directly relate to our field data!” |
ACTION / FIX | We are storing SW as site location coordinates, 4 corners as a Polygon and linking all samples and observations to the place they were taken (using the rest of the points, if apply). |
QUESTION 7 | With CORVEG, we only saw 1:1 relationships between an observation collection and an observation. AusPlots Rangelands has 1:M relationship between an observation collection and observation. When we present the integrated data to users in its tabular form, how should it be presented? So far each “theme” is a table of data, and to represent this data accurately, we would have to have multiple tables for a theme. This questions is more for the data team to review once all the data has been integrated. We will have to review this question to decide how to present the data to the users. |
ANSWER | Ben 15/5/20: “Given the complexity of parts of our data collection ( Particularly the different intercepts through the vertical projection of each point intercept) it was not possible to avoid one to many relationships, but Andrew and i were clear that it was necessary and why it was done when we designed and built the database - When you have that discussion as a team you might find it useful to include both Andrew and Myself!” |
ACTION / FIX | All parameters of a theme will be offered to the user as a table (if no additional filter are applied) |
QUESTION 8 | In the Plot Ontology, it only has properties to capture latitude and longitude, and not properties from the UTM system. Do we extend the model, or do we convert all UTM coordinates to latitude and longitude? This question is for the data team to decide whether we extend the information model or not. We can always keep it simple and only support latitude and longitude points since it can be converted to UTM. |
ANSWER | Ben 15/5/20: “We purposely designed the plot locations to be laid out using the UTM…. That meant that if anything went wrong with the dgps then we could lay out the plot (More roughly) with a hand held notebook and a hand held GPS….. Tale SW corner coord, Add 100 to easting to get SE, 100 to northing to get NW, and 100 to both to get NE. Delivery is probably an issue for you guys to discuss, but it would be nice if a user wants to use UTM that they get the same coords as we collected rather than some slight error induced from the conversion from UTM to LAT/Long and then back to UTM!” |
ACTION / FIX | REVIEW Should we include as well the UTM in the model? |
QUESTION 9 | In table herbarium_determination, what is state_accession_number? |
ANSWER | Ben 15/05/20: “When we get the herbaria to ID specimens we give them the option of keeping specimens that they think will make a valuable contribution to their collection on the understanding that they will access them into their collection - I.e. make them fully included into their Library and databasing etc. ( This cuts down the number of samples that we need to keep and gives the samples wider usage) - The accession number is essentially their primary key in their database - A unique number for every sample. it’s essential that this stays with our data and is easily publicly accessible so that anyone needing access to the sample can easily find it and arrange access.” |
ACTION / FIX | Already storing the State Accession Number in the tern-ssn:StateIdentification as dwct:identificationID |
QUESTION 10 | The Plot Ontology only supports recording location data at the site level. In AusPlots Rangelands, there are location data at the site level and sample level. Example tables: |
ANSWER | The Plot Ontology supports sites and sub-sites. Each sub-site can then have a relationship to a Location instance, which holds the location data. Ben 17/05/20: “Looks like you’ve solved this - I’ll also skip the two crossed out questions” Updated (05/06/2020): Specific location of samples are recorded as plot:SamplingPoint, which has the actual coordinates. |
ACTION / FIX | Modelled as plot:SamplingPoint |
QUESTION | The table species_growth_form is missing. |
ANSWER | This is a data team problem. Mosheh will enable this in the “datarunner” script. |
ACTION / FIX | Solved |
QUESTION | The table lut_structure_size is missing. |
ANSWER | This is a data team problem. Mosheh will enable this in the “datarunner” script. |
ACTION / FIX | Solved |
QUESTION 11 | Leaf area index data missing. Relationship to the stratum is also missing. Is it possible to provide this information? Capture this as an observation with result as a boolean. |
ANSWER | The information is stored in Adelaide. Potentially in the future, the schema will change to accommodate for the information in one of the tables, or in a new table.
|
ACTION / FIX | PENDING FUTURE REVIEW Review in late June/July 2020 if LAI data will be continue collected. |
QUESTION 12 | Is AusPlots Rangelands released annually? What is the release schedule? |
ANSWER | Ben 17/05/20: “The old AEKOS used to set up Specified ingests. We’re pretty convinced that that is not the best past forward. We would prefer you guys to interact with our data via the API so that users can access the most up to data data. It is important though if things happen this way that only sites with the “o.k. to publish” flag set to “yes” or “true” (Sorry i cant remember the exact format of the field in the DB) are made publicly available. Other data (that flag not set) is in various states of upload / curation / Waiting for herbarium id’s etc….. There is probably potential to publish some of this data in future, but we probably need to have a longer discussion of the implications / benefits of doing so…. It would be nice to have a “collected, but data not yet available” category so that we can fairly rapidly make the locations of our sample sites known!” |
ACTION / FIX | Currently we getting rid of all the data related to sites with “o.k. to publish”=FALSE. |
QUESTION 13 | Where is AusPlots Forests? Does it have a database? |
ANSWER | Ben 17/05/20: “This was initially ingested into Aekos despite my recommendation not to bother,,,, At this stage it is essentially a legacy dataset with no ability for ongoing monitoring - It is completely separate to APR and is probably a lower priority. It is a lot of work for 48 plots!” |
ACTION / FIX | Is is not been ingested as part of Ausplots Reangelads dataset. |
QUESTION 14 | Where do you source the species names from? |
ANSWER | Species name are mostly sourced from the herbarium for each state. For some, it is faster and more efficient to determine by AusPlot staff - this is true for some stuff south or west of NSW where the NSW herbarium are not familiar with the species. Ben 17/05/20: “The answer you already have is mostly right - They are indeed provided by the state and territory herbaria, with details of the id’er and accessions if relevant stored. Any work on species names etc. should be discussed with Andrew T as we are likely to make a few changes in this part of our Database to make it easier to search and rapidly exclude incomplete ids are the species or genus level. At this stage it is important that the herbarium species id is the species name used - The field name (often recorded in the form of a scientific name, but not always) is just a field name by staff while collecting in the field - It is superseded (but not replaced) by an official herbarium determination.“ |
ACTION / FIX | Nothing to fix |
QUESTION 15 | In the table point_intercept, what is the in_canopy_sky field? |
ANSWER | From within a cover, can the observer look up and see the sky. Ben 17/05/20: “The answer that you have is essentially correct, but it is a little more nuanced than that. There are two widely used measures of vegetation cover. “Opaque canopy cover” records the cover of vegetation canopy across a site. “Foliage Projected Cover” records the cover of foliage across a site. The difference being the first essentially makes a boundary around the outside of the tree canopy and calls everything inside that boundary as cover. The second FPC only classes areas inside that boundary as cover if the densiometer actually intersects with foliage. If it is within that area, but the densiometer does not intersect vegetation then it is not classed as cover. Our solution to the user being able to use either form of cover as they desire is to create the category “in Canopy Sky”. So we have three categories. No intercept = When the user sees sky through the densiometer and that is outside of any canopy boundary. a “Hit” or “Intercept” is recorded when the densiometer intercept with foliage. the In canopy Sky category is where the densiometer does not intercept with foliage, but is within the boundary of that canopy - Hence “in Canopy Sky”. If users are after Opaque canopy cover they include any In Canopy sky record with actual hits to determine cover. if the user wants FPC then they exclude in canopy sky from their cover calculations!” |
ACTION / FIX | Nothing to fix |
QUESTION 16 | In soil_bulk_density table, field sample_id defines the depth where the sample was taken ([0-0.1],[0.1,0.2],[0.2,0.3]). As the field is a Free Text Field, we can see multiple ways to fill that information (even showing metres and centrimetres, etc.) We can also see 3 site visits whose sample_id are (1,2,3) instead of (0,1,2), the most common pattern. This would crash any attempt to automatise the cleaning and reformatting of the field. If would be very helpful to clean this field in the source and maybe create some controls to validate what is introduced in DB. |
ANSWER | Email Emrys (Fri 5/06/2020): “Yes, the sample Id field for bulk densities is a mess. We will figure out the best way to homogenise this field and get back to you.” |
ACTION / FIX | Until data is fixed in the source, we will clean and format the field in the ETL. PENDING FUTURE REVIEW |
QUESTION 17 | In soil_characterisation table, there are 3 site_visits with incomplete sample depth information: Not sure if you could fix those 4 values. |
ANSWER | PENDING |
ACTION / FIX | PENDING |
QUESTION 18 | We have identified one site_visit where there is no “soil_observation_type” defined (Not Collected in DB) but there are 4 soil characterisation samples related to the site_visit which we don’t know the soil_observation_type. Query: |
ANSWER | PENDING |
ACTION / FIX | PENDING |
QUESTION | In table site_location_point, what is the field |
ANSWER | The accuracy of the location recorded. what is the measurement? - not used - will be replaced in the future. It’s to say to within a metre, how accurate the recording is. New data mostly are entered as latitude and longitude from the DGPS on the tablets. |
ACTION / FIX | Nothing to fix |
QUESTION | What are the pit markers in the site_location_visit table? |
ANSWER | Pit markers mark the soil pit outside the main plot site. Usually within 5 metres of the main plot site. The reason it is outside the main plot is because they don’t want to make a big whole in the main plot site. |
ACTION / FIX | Nothing to fix |
QUESTION | What part of the site is the soil_characterisation table recording? |
ANSWER | This table contains information about the soil pit that is outside the main plot site. It contains soil chemistry data about the main plot site. |
ACTION / FIX | Nothing to fix |
QUESTION | What is the soil_subsite_observations table? |
ANSWER | The sub-sites of the main plot site. The actual samples are collected in this sub-site. In the future, soil chemistry data may be recorded in these tables, but for now, they are recorded in soil_characterisation table. |
ACTION / FIX | Nothing to fix |
QUESTION | What is |
ANSWER | CSV added to the database during the datatransfer runner execution. Not used. |
ACTION / FIX | Nothing to fix |
QUESTION | What is |
ANSWER | CSV added to the database during the datatransfer runner execution. Not used. |
ACTION / FIX | Nothing to fix |
QUESTION | In Soil Characterisation , the values for |
ANSWER |
|
ACTION / FIX |
|
QUESTION | There are Sites (site_location) without latitude/longitude coordinates. In addition, there are Sites with no site_location_point, so there are site without any location information. |
ANSWER | (Emrys 04/02/2020): I get 48 ( not 101) that don’t have any points at all and these are forest plots ( the third letter of the site code will be an F). The data for these is stored elsewhere but we are the custodians of the physical samples- tea bags, soil etc. Other sites sometimes have less than the standard 25 points for a range of different reasons- DGPS malfunction, other groups set up the sites and only provided us with corner markings etc. |
ACTION / FIX | After fixing the sites filtering, no more sites without location have been found. |
QUESTION | The table In |
ANSWER | (Emrys 05/02/2020): “Probably makes sense to change the lookup table to fungi rather than change all the entries for each species?“ |
ACTION / FIX | Review if data have been fixed in the source PENDING FUTURE REVIEW |
Errors
ERROR | Typos in horizon classification column in soilcharacterisation table:
|
|
ANSWER |
|
|
ACTION / FIX |
|
|
TEST |
|
|
ERROR | misspelled transect identifiers |
|
ANSWER |
|
|
ACTION / FIX | Review if data have been fixed in source |
|
TEST |
|
|
ERROR | Missing transect across datatables |
|
ANSWER |
|
|
ACTION / FIX | Review if data have been fixed in source |
|
TEST |
|
|
ERROR | Negative values for basal area and basal area factor in basal area table. |
|
ANSWER |
|
|
ACTION / FIX |
|
|
TEST | select * from basal_area where basal_area_factor < 0 |
|
ERROR | text in slope and aspect columns |
|
ANSWER |
|
|
ACTION / FIX | get rid of strings, get rid of non-numeric charcacters (e.g, °) |
|
TEST | expect_numeric(site_location$slope & site_location$aspect) == TRUE |
|
In table soil_characterisation the column mottles_colour is not linked to the look-up table lut_mottle_colour.
In table soil_bulk_density the field sample_id is inputted as free text. Ask if AusPlots data entry software can be a controlled drop-down list to only enter a consistent set of text. Ideally a new look-up table should be created for this.
The table photograph is not linked in the schema. A: photograph table is a generated table from site_location in postprocessing.
The table species_distinct_per_site is missing in the schema.
In the site_location table, both site_aspect and site_slope need to be fixed. They contain free text, numbers and percentage.