Manipulating AusPlots data I: Subset data frames
The 'get_ausplots' function extracts and compiles AusPlots data allowing substantial flexibility in the selection of the required data. Up to 8 different types of data can be retrieved into distinct data frames (i.e. data on sampling sites, vegetation structure, vegetation point intercept, vegetation vouchers, vegetation basal wedge, soil characterization, soil bulk density, and soil & soil metagenomics samples). In addition, data can be filtered for particular sets of plots and/or genus/species, as well as geographically using a rectangular bounding box.
However, in some situations we are only interested in a subset of the data retrieved by 'get_ausplots'. To subset ausplot data we use the variables in the retrieved data frames corresponding to the concept by we would like to filter the data. In some occasions we would sub-setting a single data frame (i.e. type of variables) is all what we need. The retrieved data by the function ‘get_ausplots’ can be manipulated as any other R data. However, the ‘deep’ structure of the data (a list of multiple data frames) and interrelation of the data frames (via a common a common link variable) can make manipulating the data a bit more daunting (see below).
Variables in the 'site.info' data frame contain information that affect all other data frames; so typically after sub-setting the contents of the variable of interests in the 'site.info' data frame, we will also subset the remaining datasets using one of the common variables among all data frames. Common variables among datasets include 'site_location_name', 'site_location_visit_id', and 'site_unique'. Commonly 'site_unique' is the best option to ‘connect’ ausplots data frames, as it is the most specific variable representing a single visit to a particular site.
To subset a data frame we filter its data by querying the variable(s) of interest using operators. The variables of interest are typically factors, numerical, or boolean variables. Many variables retrieved by 'get_ausplots' have a 'char' class, despite conceptually falling in one of these 3 categories. Therefore, before using a variable to filter a data frame we must inspect its contents and class, and if required change its class to an adequate one. We use relational operators to filter individual variables, and logical (and occasionally arithmetic) operators to combine more than one variable in our filtering operations (R Operators).
EXAMPLES
Multiple examples includng various types of sub-settng are presented below. Exaples cover sub-setting a single data frame and all data frames, as well as not requiring variable class transformation and requiring it). All examples would start by loading the 'ausplotsR' library and extracting AusPlots data using the 'get_ausplots' function. In the examples we use the `AP.data' list of data frames that contains information for all the currently available AusPlots sites. This list was previously created in the 'Obtaining AusPlots data: 'get_ausplots' function' Step-by-Step Guide (we use the list created in Example 4).
Boxes with grey background contain code snippets, and boxes with white background containt code (text) outputs.
I. SUB-SETTING A SINGLE DATA FRAME
We might be, for example, interested in point intercept data only for vegetation of a particular height, a particular growth form, growing on particular substrate type, or found in a particular set of transects. In these examples, we use the variables in the ‘veg.PI’ data frame to filter the retrieved ausplots data in this data frame. We do not need to subset any othe data frames.
Examples
Example 1: Height
Height is 'numeric'
, so there is no need to change its class.
.
# Site Slope
# ==========
# Explore Variable Type and Change to Numeric
# -------------------------------------------
class(AP.data$site.info$site_slope)
.
## [1] "character"
..
.
summary(AP.data$site.info$site_slope)
.
## Length Class Mode
## 662 character character
.
.
AP.data$site.info$site_slope.n = as.numeric(AP.data$site.info$site_slope)
summary(AP.data$site.info$site_slope.n)
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 2.564 2.000 50.000 111
.
.
# Subset to Plos with Steep Slopes (> 20 degrees) in 'site.info' data frame
# -------------------------------------------------------------------------
slope.AP.data = AP.data
dim(slope.AP.data$site.info)
.
## [1] 662 44
.
.
#summary(AP.data$site.info$site_slope.n)
slope.AP.data$site.info = slope.AP.data$site.info[slope.AP.data$site.info$site_slope.n >= 20,]
dim(slope.AP.data$site.info)
.
## [1] 123 44
.
.
summary(slope.AP.data$site.info$site_slope.n)
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 20.00 20.75 25.00 28.92 35.00 50.00 111
.
.
.
Example 2: Transect
Transect is a 'factor', so there is no need to change its class.
.
# Transect
# ========
# Explore Variable Type and Change to Factor
# ------------------------------------------
class(AP.data$veg.PI$transect)
.
## [1] "factor"
.
.
summary(AP.data$veg.PI$transect)
.
## E1-W1 E2-W2 E3-W3 E4-W4 E5-W5 N1-S1 N2-S2 N3-S3 N4-S4 N5-S5 S1-N1 S2-N2
## 17360 53886 18164 54046 18900 24573 44070 26962 44845 27579 49639 29566
## S3-N3 S4-N4 S5-N5 W1-E1 W2-E2 W2-S2 W3-E3 W4-E4 W5-E5
## 46476 28582 46296 55976 19447 102 55135 18983 53877
.
.
# Subset an specific Transect (E1-W1) in 'veg.PI' data frame
# ----------------------------------------------------------
E1W1Tr.AP.data = AP.data
dim(E1W1Tr.AP.data$veg.PI)
.
## [1] 734464 13
.
.
#summary(AP.data$veg.PI$transect)
E1W1Tr.AP.data$veg.PI = E1W1Tr.AP.data$veg.PI[E1W1Tr.AP.data$veg.PI$transect == "E1-W1",]
levels(E1W1Tr.AP.data$veg.PI$transect)
.
## [1] "E1-W1" "E2-W2" "E3-W3" "E4-W4" "E5-W5" "N1-S1" "N2-S2" "N3-S3"
## [9] "N4-S4" "N5-S5" "S1-N1" "S2-N2" "S3-N3" "S4-N4" "S5-N5" "W1-E1"
## [17] "W2-E2" "W2-S2" "W3-E3" "W4-E4" "W5-E5"
.
.
E1W1Tr.AP.data$veg.PI$transect = droplevels(E1W1Tr.AP.data$veg.PI$transect)
levels(E1W1Tr.AP.data$veg.PI$transect)
.
## [1] "E1-W1"
.
.
dim(E1W1Tr.AP.data$veg.PI)
.
## [1] 17360 13
.
.
summary(E1W1Tr.AP.data$veg.PI$transect)
.
## E1-W1
## 17360
.
.
.
Example 3: Growth Form
Transect is a 'character' variable, so we need to change its class to 'factor'.
.
# Growth Form
# ===========
# Explore Variable Type and Change to Factor in 'veg.PI' data frame
# -----------------------------------------------------------------
class(AP.data$veg.PI$growth_form)
.
## [1] "character"
.
.
summary(AP.data$veg.PI$growth_form)
.
## Length Class Mode
## 734464 character character
.
.
AP.data$veg.PI$growth_form.f = factor(AP.data$veg.PI$growth_form)
summary(AP.data$veg.PI$growth_form.f)
.
## Aquatic Bryophyte Chenopod Cycad Epiphyte
## 3 784 19425 21 275
## Fern Forb Fungus Grass-tree Heath-shrub
## 1418 37544 25 2311 5463
## Hummock grass NC Rush Sedge Shrub
## 26311 1095 725 11663 83251
## Shrub Mallee Tree Mallee Tree/Palm Tussock grass Vine
## 1897 13811 84733 83361 2023
## NA's
## 358325
.
.
# Subset to the Tree/Palm Growth Form
# -----------------------------------
TreePalm.AP.data = AP.data
dim(TreePalm.AP.data$veg.PI)
.
## [1] 734464 14
.
.
#summary(AP.data$veg.PI$growth_form.f)
TreePalm.AP.data$veg.PI = TreePalm.AP.data$veg.PI[TreePalm.AP.data$veg.PI$growth_form.f == "Tree/Palm",]
levels(TreePalm.AP.data$veg.PI$growth_form.f)
.
## [1] "Aquatic" "Bryophyte" "Chenopod" "Cycad"
## [5] "Epiphyte" "Fern" "Forb" "Fungus"
## [9] "Grass-tree" "Heath-shrub" "Hummock grass" "NC"
## [13] "Rush" "Sedge" "Shrub" "Shrub Mallee"
## [17] "Tree Mallee" "Tree/Palm" "Tussock grass" "Vine"
.
.
TreePalm.AP.data$veg.PI$growth_form.f = droplevels(TreePalm.AP.data$veg.PI$growth_form.f)
levels(TreePalm.AP.data$veg.PI$growth_form.f)
.
## [1] "Tree/Palm"
.
.
dim(TreePalm.AP.data$veg.PI)
.
## [1] 443058 14
.
.
summary(TreePalm.AP.data$veg.PI$growth_form.f)
.
## Tree/Palm NA's
## 84733 358325
.
.
.
.
II. SUB-SETTING ALL DATA FRAMES
In some occasions, we are interested on sites located at particular states or bioregions. Alternatively, we might be only interested on data obtained in sites on steep slopes and/or with a slope facing (i.e. aspect) south. In these examples, we can use the variables in the 'site.info' data frame to filter the sites we of interest. In this case, we also need to subset the data in the remaining data frames, as we are only interested in data that has been collected in sites with particular characteristics. Therefore, we the filter the other data frames by site, selecting the sites filtered out in our first sub-setting operation on the 'site.info' data frame. To do so we use one of the variables present in all data frames that contain a site identifier (i.e. using 'site_location_name', 'site_location_visit_id', or 'site_unique'; see above).
.
Examples
.
Example 1: Site Slope
Site Slope is a 'character' variable, so we need to change its class to 'numeric'.
.
# Site Slope
# ==========
# Explore Variable Type and Change to Numeric
# -------------------------------------------
class(AP.data$site.info$site_slope)
.
## [1] "character"
.
.
summary(AP.data$site.info$site_slope)
.
## Length Class Mode
## 662 character character
.
.
AP.data$site.info$site_slope.n = as.numeric(AP.data$site.info$site_slope)
summary(AP.data$site.info$site_slope.n)
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 2.564 2.000 50.000 111
.
.
# Subset to Plos with Steep Slopes (> 20 degrees) in 'site.info' data frame
# -------------------------------------------------------------------------
slope.AP.data = AP.data
dim(slope.AP.data$site.info)
.
## [1] 662 44
.
.
#summary(AP.data$site.info$site_slope.n)
slope.AP.data$site.info = slope.AP.data$site.info[slope.AP.data$site.info$site_slope.n >= 20,]
dim(slope.AP.data$site.info)
.
## [1] 123 44
.
.
summary(slope.AP.data$site.info$site_slope.n)
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 20.00 20.75 25.00 28.92 35.00 50.00 111
.
.
# Subset to Plots with Steep Slopes in other Data Frames
# ------------------------------------------------------
# To do so we use the common variable 'site_unique'
# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(slope.AP.data$veg.PI)
.
## [1] 734464 14
.
.
slope.AP.data$veg.PI = slope.AP.data$veg.PI[slope.AP.data$veg.PI$site_unique %in% slope.AP.data$site.info$site_unique, ]
dim(slope.AP.data$veg.PI)
.
## [1] 15146 14
.
.
# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(slope.AP.data$veg.basal)
.
## [1] 8291 10
.
.
slope.AP.data$veg.basal = slope.AP.data$veg.basal[slope.AP.data$veg.basal$site_unique %in% slope.AP.data$site.info$site_unique, ]
dim(slope.AP.data$veg.basal)
.
## [1] 215 10
.
.
.
Example 2: Site Aspect
Site Aspect is a 'character' variable, so we need to change its class to 'numeric'.
.
# Site Aspect
# ===========
# Explore Variable Type and Change to Numeric
# -------------------------------------------
class(AP.data$site.info$site_aspect)
.
## [1] "character"
.
.
summary(AP.data$site.info$site_aspect)
.
## Length Class Mode
## 662 character character
.
.
AP.data$site.info$site_aspect.n = as.numeric(AP.data$site.info$site_aspect)
summary(AP.data$site.info$site_aspect.n)
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 90.0 180.0 183.7 270.0 360.0 399
.
.
# Subset to Plots with a South (SE to SW; i.e. 135 to 225) Aspect in 'site.info' data frame
# -----------------------------------------------------------------------------------------
aspect.AP.data = AP.data
dim(aspect.AP.data$site.info)
.
## [1] 662 45
.
.
#summary(AP.data$site.info$site_aspect.n)
aspect.AP.data$site.info = aspect.AP.data$site.info[(aspect.AP.data$site.info$site_aspect.n > 135 &
aspect.AP.data$site.info$site_aspect.n <= 225),]
dim(aspect.AP.data$site.info)
.
## [1] 488 45
.
.
summary(aspect.AP.data$site.info$site_aspect.n)
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 150.0 180.0 180.0 194.8 225.0 225.0 399
.
.
# Subset to Plots with a South (SE to SW) Aspect in other Data Frames
# -------------------------------------------------------------------
# To do so we use the common variable 'site_unique'
# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(aspect.AP.data$veg.PI)
.
## [1] 734464 14
.
.
aspect.AP.data$veg.PI = aspect.AP.data$veg.PI[aspect.AP.data$veg.PI$site_unique %in% aspect.AP.data$site.info$site_unique, ]
dim(aspect.AP.data$veg.PI)
.
## [1] 100649 14
.
.
# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(aspect.AP.data$veg.basal)
.
## [1] 8291 10
.
.
aspect.AP.data$veg.basal = aspect.AP.data$veg.basal[aspect.AP.data$veg.basal$site_unique %in% aspect.AP.data$site.info$site_unique, ]
dim(aspect.AP.data$veg.basal)
.
## [1] 8291 10
.
.
.
Example 3: State
State is a 'character' variable, so we need to change its class to 'factor'.
v
# State
# =====
# Explore Variable Type and Change to Factor
# ------------------------------------------
class(AP.data$site.info$state)
.
## [1] "character"
.
.
summary(AP.data$site.info$state)
.
## Length Class Mode
## 662 character character
.
.
AP.data$site.info$state.f = factor(AP.data$site.info$state)
summary(AP.data$site.info$state.f)
.
## NSW NT QLD SA VIC WA
## 87 138 127 171 18 121
.
.
# Subset to Plots in the State of Queensland in 'site.info' data frame
# --------------------------------------------------------------------
# Subset to "QLD"
QLD.AP.data = AP.data
dim(QLD.AP.data$site.info)
.
## [1] 662 46
.
.
#summary(AP.data$site.info$state.f)
QLD.AP.data$site.info = QLD.AP.data$site.info[QLD.AP.data$site.info$state.f == "QLD",]
levels(QLD.AP.data$site.info$state.f)
.
## [1] "NSW" "NT" "QLD" "SA" "VIC" "WA"
.
.
QLD.AP.data$site.info$state.f = droplevels(QLD.AP.data$site.info$state.f)
levels(QLD.AP.data$site.info$state.f)
.
## [1] "QLD"
.
.
dim(QLD.AP.data$site.info)
.
## [1] 127 46
.
.
summary(QLD.AP.data$site.info$state.f)
.
## QLD
## 127
.
.
# Subset to Plots iw the State of Queensland in other Data Frames
# ---------------------------------------------------------------
# To do so we use the common variable 'site_unique'
# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(QLD.AP.data$veg.PI)
.
## [1] 734464 14
.
.
QLD.AP.data$veg.PI = QLD.AP.data$veg.PI[QLD.AP.data$veg.PI$site_unique %in% QLD.AP.data$site.info$site_unique, ]
dim(QLD.AP.data$veg.PI)
.
## [1] 138489 14
.
.
# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(QLD.AP.data$veg.basal)
.
## [1] 8291 10
.
.
QLD.AP.data$veg.basal = QLD.AP.data$veg.basal[QLD.AP.data$veg.basal$site_unique %in% QLD.AP.data$site.info$site_unique, ]
dim(QLD.AP.data$veg.basal)
.
## [1] 1846 10
.
.
.
Example 4: Bioregion name
Bioregion name is a 'character' variable, so we need to change its class to 'factor'.
.
# Bioregion name
# ==============
# Explore Variable Type and Change to Factor
# ------------------------------------------
class(AP.data$site.info$bioregion_name)
.
## [1] "character"
.
.
summary(AP.data$site.info$bioregion_name)
.
## Length Class Mode
## 662 character character
.
.
AP.data$site.info$bioregion_name.f = factor(AP.data$site.info$bioregion_name)
summary(AP.data$site.info$bioregion_name.f)
.
## ARP AUA AVW BBS BHC BRT CEK CHC COO COP CYP DAB DAC DAL DMR EIU ESP EYB
## 3 15 4 2 34 6 3 13 32 2 19 1 3 1 3 7 1 3
## FIN FLB GAS GAW GES GFU GSD GUP GVD HAM JAF KAN LSD MAC MAL MDD MGD MII
## 18 50 2 3 3 41 1 33 5 6 3 11 3 28 3 52 34 2
## MUL MUR NAN NSS NUL PCK PIL RIV SSD STP STU SWA SYB VIB
## 7 6 2 3 13 3 35 32 48 40 6 4 9 4
.
.
# Subset to Plots in Bioregions in the Eastern (~ Qld) Gulf of Carpentaria in 'site.info' data frame
# --------------------------------------------------------------------------------------------------
# Subset to "CYP" (Cape York Peninsula) and "GUP" (Gulf Plains)
EGCBioregs.AP.data = AP.data
dim(EGCBioregs.AP.data$site.info)
.
## [1] 662 47
.
.
#summary(AP.data$site.info$bioregion_name.f)
EastCarpGulf.Bioreg = c("CYP", "GUP")
EGCBioregs.AP.data$site.info = EGCBioregs.AP.data$site.info[EGCBioregs.AP.data$site.info$bioregion_name.f %in% EastCarpGulf.Bioreg,]
levels(EGCBioregs.AP.data$site.info$bioregion_name.f)
.
## [1] "ARP" "AUA" "AVW" "BBS" "BHC" "BRT" "CEK" "CHC" "COO" "COP" "CYP"
## [12] "DAB" "DAC" "DAL" "DMR" "EIU" "ESP" "EYB" "FIN" "FLB" "GAS" "GAW"
## [23] "GES" "GFU" "GSD" "GUP" "GVD" "HAM" "JAF" "KAN" "LSD" "MAC" "MAL"
## [34] "MDD" "MGD" "MII" "MUL" "MUR" "NAN" "NSS" "NUL" "PCK" "PIL" "RIV"
## [45] "SSD" "STP" "STU" "SWA" "SYB" "VIB"
.
.
EGCBioregs.AP.data$site.info$bioregion_name.f = droplevels(EGCBioregs.AP.data$site.info$bioregion_name.f)
levels(EGCBioregs.AP.data$site.info$bioregion_name.f)
.
## [1] "CYP" "GUP"
.
.
dim(EGCBioregs.AP.data$site.info)
.
## [1] 52 47
.
.
summary(EGCBioregs.AP.data$site.info$bioregion_name.f)
.
## CYP GUP
## 19 33
.
.
# Subset to Plots in Bioregions in the Eastern (~ Qld) Gulf of Carpentaria in other Data Frames
# ---------------------------------------------------------------------------------------------
# To do so we use the common variable 'site_unique'
# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(EGCBioregs.AP.data$veg.PI)
.
## [1] 734464 14
.
.
EGCBioregs.AP.data$veg.PI = EGCBioregs.AP.data$veg.PI[EGCBioregs.AP.data$veg.PI$site_unique %in% EGCBioregs.AP.data$site.info$site_unique, ]
dim(EGCBioregs.AP.data$veg.PI)
.
## [1] 61426 14
.
.
# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(EGCBioregs.AP.data$veg.basal)
.
## [1] 8291 10
.
.
EGCBioregs.AP.data$veg.basal = EGCBioregs.AP.data$veg.basal[EGCBioregs.AP.data$veg.basal$site_unique %in% EGCBioregs.AP.data$site.info$site_unique, ]
dim(EGCBioregs.AP.data$veg.basal)
.
## [1] 1222 10
.
.
.