Manipulating AusPlots data I: Subset data frames


The 'get_ausplots' function extracts and compiles AusPlots data allowing substantial flexibility in the selection of the required data. Up to 8 different types of data can be retrieved into distinct data frames (i.e. data on sampling sites, vegetation structure, vegetation point intercept, vegetation vouchers, vegetation basal wedge, soil characterization, soil bulk density, and soil & soil metagenomics samples). In addition, data can be filtered for particular sets of plots and/or genus/species, as well as geographically using a rectangular bounding box.


However, in some situations we are only interested in a subset of the data retrieved by 'get_ausplots'. To subset ausplot data we use the variables in the retrieved data frames corresponding to the concept by we would like to filter the data. In some occasions we would sub-setting a single data frame (i.e. type of variables) is all what we need. The retrieved data by the function ‘get_ausplots’ can be manipulated as any other R data. However, the ‘deep’ structure of the data (a list of multiple data frames) and interrelation of the data frames (via a common a common link variable) can make manipulating the data a bit more daunting (see below). 


Variables in the 'site.info' data frame contain information that affect all other data frames; so typically after sub-setting the contents of the variable of interests in the 'site.info' data frame, we will also subset the remaining datasets using one of the common variables among all data frames. Common variables among datasets include 'site_location_name', 'site_location_visit_id', and 'site_unique'.  Commonly 'site_unique' is the best option to ‘connect’ ausplots data frames, as it is the most specific variable representing a single visit to a particular site.


To subset a data frame we filter its data by querying the variable(s) of interest using operators. The variables of interest are typically factors, numerical, or boolean variables. Many variables retrieved by 'get_ausplots' have a 'char' class, despite conceptually falling in one of these 3 categories. Therefore, before using a variable to filter a data frame we must inspect its contents and class, and if required change its class to an adequate one. We use relational operators to filter individual variables, and logical (and occasionally arithmetic) operators to combine more than one variable in our filtering operations (R Operators).




EXAMPLES

Multiple examples includng various types  of sub-settng are presented below. Exaples cover sub-setting a single data frame and all data frames, as well as not requiring variable class transformation and requiring it). All examples would start by loading the 'ausplotsR' library and extracting AusPlots data using the 'get_ausplots' function. In the examples we use the `AP.data' list of data frames that contains information for all the currently available AusPlots sites. This list was previously created in the 'Obtaining AusPlots data: 'get_ausplots' function' Step-by-Step Guide (we use the list created in Example 4). 

Boxes with grey background contain code snippets, and boxes with white background containt code (text) outputs.



I. SUB-SETTING A SINGLE DATA FRAME

We might be, for example, interested in point intercept data only for vegetation of a particular height, a particular growth form, growing on particular substrate type, or found in a particular set of transects. In these examples, we use the variables in the ‘veg.PI’ data frame to filter the retrieved ausplots data in this data frame. We do not need to subset any othe data frames.


Examples


Example 1: Height

Height is 'numeric', so there is no need to change its class.

.

# Site Slope
# ==========

# Explore Variable Type and Change to Numeric
# -------------------------------------------
class(AP.data$site.info$site_slope)

.

## [1] "character"

..

.

summary(AP.data$site.info$site_slope)

.

##    Length     Class      Mode 
##       662 character character

.

.

AP.data$site.info$site_slope.n = as.numeric(AP.data$site.info$site_slope)
summary(AP.data$site.info$site_slope.n)

.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   2.564   2.000  50.000     111

.

.

# Subset to Plos with Steep Slopes (> 20 degrees) in 'site.info' data frame
# -------------------------------------------------------------------------
slope.AP.data = AP.data
dim(slope.AP.data$site.info)

.

## [1] 662  44

.

.

#summary(AP.data$site.info$site_slope.n)
slope.AP.data$site.info = slope.AP.data$site.info[slope.AP.data$site.info$site_slope.n >= 20,] 
dim(slope.AP.data$site.info)

.

## [1] 123  44

.

.

summary(slope.AP.data$site.info$site_slope.n)

.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20.00   20.75   25.00   28.92   35.00   50.00     111

.

.

.

Example 2: Transect

Transect is a 'factor', so there is no need to change its class.

.

# Transect
# ========

# Explore Variable Type and Change to Factor
# ------------------------------------------
class(AP.data$veg.PI$transect)

.

## [1] "factor"

.

.

summary(AP.data$veg.PI$transect)

.

## E1-W1 E2-W2 E3-W3 E4-W4 E5-W5 N1-S1 N2-S2 N3-S3 N4-S4 N5-S5 S1-N1 S2-N2 
## 17360 53886 18164 54046 18900 24573 44070 26962 44845 27579 49639 29566 
## S3-N3 S4-N4 S5-N5 W1-E1 W2-E2 W2-S2 W3-E3 W4-E4 W5-E5 
## 46476 28582 46296 55976 19447   102 55135 18983 53877

.

.

# Subset an specific Transect (E1-W1) in 'veg.PI' data frame
# ----------------------------------------------------------
E1W1Tr.AP.data = AP.data
dim(E1W1Tr.AP.data$veg.PI)

.

## [1] 734464     13

.

.

#summary(AP.data$veg.PI$transect)
E1W1Tr.AP.data$veg.PI = E1W1Tr.AP.data$veg.PI[E1W1Tr.AP.data$veg.PI$transect == "E1-W1",] 
levels(E1W1Tr.AP.data$veg.PI$transect)

.

##  [1] "E1-W1" "E2-W2" "E3-W3" "E4-W4" "E5-W5" "N1-S1" "N2-S2" "N3-S3"
##  [9] "N4-S4" "N5-S5" "S1-N1" "S2-N2" "S3-N3" "S4-N4" "S5-N5" "W1-E1"
## [17] "W2-E2" "W2-S2" "W3-E3" "W4-E4" "W5-E5"

.

.

E1W1Tr.AP.data$veg.PI$transect = droplevels(E1W1Tr.AP.data$veg.PI$transect)
levels(E1W1Tr.AP.data$veg.PI$transect)

.

## [1] "E1-W1"

.

.


dim(E1W1Tr.AP.data$veg.PI)

.

## [1] 17360    13

.

.

summary(E1W1Tr.AP.data$veg.PI$transect)

.

## E1-W1 
## 17360

.

.

.

Example 3: Growth Form

Transect is a 'character' variable, so we need to change its class to 'factor'.

.

# Growth Form
# ===========

# Explore Variable Type and Change to Factor in 'veg.PI' data frame
# -----------------------------------------------------------------
class(AP.data$veg.PI$growth_form)

.

## [1] "character"

.

.

summary(AP.data$veg.PI$growth_form)

.

##    Length     Class      Mode 
##    734464 character character

.

.

AP.data$veg.PI$growth_form.f = factor(AP.data$veg.PI$growth_form)
summary(AP.data$veg.PI$growth_form.f)

.

##       Aquatic     Bryophyte      Chenopod         Cycad      Epiphyte 
##             3           784         19425            21           275 
##          Fern          Forb        Fungus    Grass-tree   Heath-shrub 
##          1418         37544            25          2311          5463 
## Hummock grass            NC          Rush         Sedge         Shrub 
##         26311          1095           725         11663         83251 
##  Shrub Mallee   Tree Mallee     Tree/Palm Tussock grass          Vine 
##          1897         13811         84733         83361          2023 
##          NA's 
##        358325

.

.

# Subset to the Tree/Palm Growth Form
# -----------------------------------
TreePalm.AP.data = AP.data
dim(TreePalm.AP.data$veg.PI)

.

## [1] 734464     14

.

.

#summary(AP.data$veg.PI$growth_form.f)
TreePalm.AP.data$veg.PI = TreePalm.AP.data$veg.PI[TreePalm.AP.data$veg.PI$growth_form.f == "Tree/Palm",] 
levels(TreePalm.AP.data$veg.PI$growth_form.f)

.

##  [1] "Aquatic"       "Bryophyte"     "Chenopod"      "Cycad"        
##  [5] "Epiphyte"      "Fern"          "Forb"          "Fungus"       
##  [9] "Grass-tree"    "Heath-shrub"   "Hummock grass" "NC"           
## [13] "Rush"          "Sedge"         "Shrub"         "Shrub Mallee" 
## [17] "Tree Mallee"   "Tree/Palm"     "Tussock grass" "Vine"

.

.

TreePalm.AP.data$veg.PI$growth_form.f = droplevels(TreePalm.AP.data$veg.PI$growth_form.f)
levels(TreePalm.AP.data$veg.PI$growth_form.f)

.

## [1] "Tree/Palm"

.

.

dim(TreePalm.AP.data$veg.PI)

.

## [1] 443058     14

.

.

summary(TreePalm.AP.data$veg.PI$growth_form.f)

.

## Tree/Palm      NA's 
##     84733    358325

.

.

.

.

II. SUB-SETTING ALL DATA FRAMES

In some occasions, we are interested on sites located at particular states or bioregions. Alternatively, we might be only interested on data obtained in sites on steep slopes and/or with a slope facing (i.e. aspect) south. In these examples, we can use the variables in the 'site.info' data frame to filter the sites we of interest. In this case, we also need to subset the data in the remaining data frames, as we are only interested in data that has been collected in sites with particular characteristics. Therefore, we the filter the other data frames by site, selecting the sites filtered out in our first sub-setting operation on the 'site.info' data frame. To do so we use one of the variables present in all data frames that contain a site identifier (i.e. using 'site_location_name', 'site_location_visit_id', or 'site_unique'; see above).

.

Examples

.

Example 1: Site Slope

Site Slope is a 'character' variable, so we need to change its class to 'numeric'.

.

# Site Slope
# ==========

# Explore Variable Type and Change to Numeric
# -------------------------------------------
class(AP.data$site.info$site_slope)

.

## [1] "character"

.

.

summary(AP.data$site.info$site_slope)

.

##    Length     Class      Mode 
##       662 character character

.

.

AP.data$site.info$site_slope.n = as.numeric(AP.data$site.info$site_slope)
summary(AP.data$site.info$site_slope.n)

.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   2.564   2.000  50.000     111

.

.

# Subset to Plos with Steep Slopes (> 20 degrees) in 'site.info' data frame
# -------------------------------------------------------------------------
slope.AP.data = AP.data
dim(slope.AP.data$site.info)

.

## [1] 662  44

.

.

#summary(AP.data$site.info$site_slope.n)
slope.AP.data$site.info = slope.AP.data$site.info[slope.AP.data$site.info$site_slope.n >= 20,] 
dim(slope.AP.data$site.info)

.

## [1] 123  44

.

.

summary(slope.AP.data$site.info$site_slope.n)

.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20.00   20.75   25.00   28.92   35.00   50.00     111

.

.

# Subset to Plots with Steep Slopes in other Data Frames
# ------------------------------------------------------
# To do so we use the common variable 'site_unique'

# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(slope.AP.data$veg.PI)

.

## [1] 734464     14

.

.

slope.AP.data$veg.PI = slope.AP.data$veg.PI[slope.AP.data$veg.PI$site_unique %in% slope.AP.data$site.info$site_unique, ]
dim(slope.AP.data$veg.PI)

.

## [1] 15146    14

.

.

# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(slope.AP.data$veg.basal)

.

## [1] 8291   10

.

.

slope.AP.data$veg.basal = slope.AP.data$veg.basal[slope.AP.data$veg.basal$site_unique %in% slope.AP.data$site.info$site_unique, ]
dim(slope.AP.data$veg.basal)

.

## [1] 215  10

.

.

.

Example 2: Site Aspect

Site Aspect is a 'character' variable, so we need to change its class to 'numeric'.

.

# Site Aspect
# ===========

# Explore Variable Type and Change to Numeric
# -------------------------------------------
class(AP.data$site.info$site_aspect)

.

## [1] "character"

.

.

summary(AP.data$site.info$site_aspect)

.

##    Length     Class      Mode 
##       662 character character

.

.

AP.data$site.info$site_aspect.n = as.numeric(AP.data$site.info$site_aspect)
summary(AP.data$site.info$site_aspect.n)

.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    90.0   180.0   183.7   270.0   360.0     399

.

.

# Subset to Plots with a South (SE to SW; i.e. 135 to 225) Aspect in 'site.info' data frame
# -----------------------------------------------------------------------------------------
aspect.AP.data = AP.data
dim(aspect.AP.data$site.info)

.

## [1] 662  45

.

.

#summary(AP.data$site.info$site_aspect.n)
aspect.AP.data$site.info = aspect.AP.data$site.info[(aspect.AP.data$site.info$site_aspect.n > 135 &
aspect.AP.data$site.info$site_aspect.n <= 225),] 
dim(aspect.AP.data$site.info)

.

## [1] 488  45

.

.

summary(aspect.AP.data$site.info$site_aspect.n)

.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   150.0   180.0   180.0   194.8   225.0   225.0     399

.

.

# Subset to Plots with a South (SE to SW) Aspect in other Data Frames
# -------------------------------------------------------------------
# To do so we use the common variable 'site_unique'

# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(aspect.AP.data$veg.PI)

.

## [1] 734464     14

.

.

aspect.AP.data$veg.PI = aspect.AP.data$veg.PI[aspect.AP.data$veg.PI$site_unique %in% aspect.AP.data$site.info$site_unique, ]
dim(aspect.AP.data$veg.PI)

.

## [1] 100649     14

.

.

# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(aspect.AP.data$veg.basal)

.

## [1] 8291   10

.

.

aspect.AP.data$veg.basal = aspect.AP.data$veg.basal[aspect.AP.data$veg.basal$site_unique %in% aspect.AP.data$site.info$site_unique, ]
dim(aspect.AP.data$veg.basal)

.

## [1] 8291   10

.

.

.

Example 3: State

State is a 'character' variable, so we need to change its class to 'factor'.

v

# State
# =====

# Explore Variable Type and Change to Factor
# ------------------------------------------
class(AP.data$site.info$state)

.

## [1] "character"

.

.

summary(AP.data$site.info$state)

.

##    Length     Class      Mode 
##       662 character character

.

.

AP.data$site.info$state.f = factor(AP.data$site.info$state)
summary(AP.data$site.info$state.f)

.

## NSW  NT QLD  SA VIC  WA 
##  87 138 127 171  18 121

.

.

# Subset to Plots in the State of Queensland in 'site.info' data frame
# --------------------------------------------------------------------
# Subset to "QLD"
QLD.AP.data = AP.data
dim(QLD.AP.data$site.info)

.

## [1] 662  46

.

.

#summary(AP.data$site.info$state.f)
QLD.AP.data$site.info = QLD.AP.data$site.info[QLD.AP.data$site.info$state.f == "QLD",] 
levels(QLD.AP.data$site.info$state.f)

.

## [1] "NSW" "NT"  "QLD" "SA"  "VIC" "WA"

.

.

QLD.AP.data$site.info$state.f = droplevels(QLD.AP.data$site.info$state.f)
levels(QLD.AP.data$site.info$state.f)

.

## [1] "QLD"

.

.

dim(QLD.AP.data$site.info)

.

## [1] 127  46

.

.

summary(QLD.AP.data$site.info$state.f)

.

## QLD 
## 127

.

.

# Subset to Plots iw the State of Queensland in other Data Frames
# ---------------------------------------------------------------
# To do so we use the common variable 'site_unique'

# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(QLD.AP.data$veg.PI)

.

## [1] 734464     14

.

.

QLD.AP.data$veg.PI = QLD.AP.data$veg.PI[QLD.AP.data$veg.PI$site_unique %in% QLD.AP.data$site.info$site_unique, ]
dim(QLD.AP.data$veg.PI)

.

## [1] 138489     14

.

.

# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(QLD.AP.data$veg.basal)

.

## [1] 8291   10

.

.

QLD.AP.data$veg.basal = QLD.AP.data$veg.basal[QLD.AP.data$veg.basal$site_unique %in% QLD.AP.data$site.info$site_unique, ]
dim(QLD.AP.data$veg.basal)

.

## [1] 1846   10

.

.

.

Example 4: Bioregion name

Bioregion name is a 'character' variable, so we need to change its class to 'factor'.

.

# Bioregion name
# ==============

# Explore Variable Type and Change to Factor
# ------------------------------------------
class(AP.data$site.info$bioregion_name)

.

## [1] "character"

.

.

summary(AP.data$site.info$bioregion_name)

.

##    Length     Class      Mode 
##       662 character character

.

.

AP.data$site.info$bioregion_name.f = factor(AP.data$site.info$bioregion_name)
summary(AP.data$site.info$bioregion_name.f)

.

## ARP AUA AVW BBS BHC BRT CEK CHC COO COP CYP DAB DAC DAL DMR EIU ESP EYB 
##   3  15   4   2  34   6   3  13  32   2  19   1   3   1   3   7   1   3 
## FIN FLB GAS GAW GES GFU GSD GUP GVD HAM JAF KAN LSD MAC MAL MDD MGD MII 
##  18  50   2   3   3  41   1  33   5   6   3  11   3  28   3  52  34   2 
## MUL MUR NAN NSS NUL PCK PIL RIV SSD STP STU SWA SYB VIB 
##   7   6   2   3  13   3  35  32  48  40   6   4   9   4

.

.

# Subset to Plots in Bioregions in the Eastern (~ Qld) Gulf of Carpentaria in 'site.info' data frame
# --------------------------------------------------------------------------------------------------
# Subset to "CYP" (Cape York Peninsula) and "GUP" (Gulf Plains) 
EGCBioregs.AP.data = AP.data
dim(EGCBioregs.AP.data$site.info)

.

## [1] 662  47

.

.

#summary(AP.data$site.info$bioregion_name.f)
EastCarpGulf.Bioreg = c("CYP", "GUP")
EGCBioregs.AP.data$site.info = EGCBioregs.AP.data$site.info[EGCBioregs.AP.data$site.info$bioregion_name.f %in% EastCarpGulf.Bioreg,] 
levels(EGCBioregs.AP.data$site.info$bioregion_name.f)

.

##  [1] "ARP" "AUA" "AVW" "BBS" "BHC" "BRT" "CEK" "CHC" "COO" "COP" "CYP"
## [12] "DAB" "DAC" "DAL" "DMR" "EIU" "ESP" "EYB" "FIN" "FLB" "GAS" "GAW"
## [23] "GES" "GFU" "GSD" "GUP" "GVD" "HAM" "JAF" "KAN" "LSD" "MAC" "MAL"
## [34] "MDD" "MGD" "MII" "MUL" "MUR" "NAN" "NSS" "NUL" "PCK" "PIL" "RIV"
## [45] "SSD" "STP" "STU" "SWA" "SYB" "VIB"

.

.

EGCBioregs.AP.data$site.info$bioregion_name.f = droplevels(EGCBioregs.AP.data$site.info$bioregion_name.f)
levels(EGCBioregs.AP.data$site.info$bioregion_name.f)

.

## [1] "CYP" "GUP"

.

.

dim(EGCBioregs.AP.data$site.info)

.

## [1] 52 47

.

.

summary(EGCBioregs.AP.data$site.info$bioregion_name.f)

.

## CYP GUP 
##  19  33

.

.

# Subset to Plots in Bioregions in the Eastern (~ Qld) Gulf of Carpentaria in other Data Frames
# ---------------------------------------------------------------------------------------------
# To do so we use the common variable 'site_unique'

# Subset in 'veg.PI' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(EGCBioregs.AP.data$veg.PI)

.

## [1] 734464     14

.

.

EGCBioregs.AP.data$veg.PI = EGCBioregs.AP.data$veg.PI[EGCBioregs.AP.data$veg.PI$site_unique %in% EGCBioregs.AP.data$site.info$site_unique, ]
dim(EGCBioregs.AP.data$veg.PI)

.

## [1] 61426    14

.

.

# Subset in 'veg.basal' Data Frame
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dim(EGCBioregs.AP.data$veg.basal)

.

## [1] 8291   10

.

.

EGCBioregs.AP.data$veg.basal = EGCBioregs.AP.data$veg.basal[EGCBioregs.AP.data$veg.basal$site_unique %in% EGCBioregs.AP.data$site.info$site_unique, ]
dim(EGCBioregs.AP.data$veg.basal)

.

## [1] 1222   10

.

.

.