Pretreatment of Environmental Data

Last updated: 2022-01-18

Checks: 6 1

Knit directory: Barley1kGBS_Proj/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

Seed: set.seed(20210831)

The command set.seed(20210831) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: detected

The following chunks had caches available:

unnamed-chunk-15
unnamed-chunk-18
unnamed-chunk-5
unnamed-chunk-6

To ensure reproducibility of the results, delete the cache directory PreprocessEnvData_cache and re-run the analysis. To have workflowr automatically delete the cache directory prior to building the file, set delete_cache = TRUE when running wflow_build() or wflow_publish().

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: da73ff9

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version da73ff9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/GenomeEnvAssociation_cache/
    Ignored:    analysis/GenomicVarPartitioning_cache/
    Ignored:    analysis/PreprocessEnvData_cache/

Untracked files:
    Untracked:  VennDiagram2022-01-18_16-08-05.log
    Untracked:  VennDiagram2022-01-18_16-10-17.log

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/PreprocessEnvData.Rmd) and HTML (public/PreprocessEnvData.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	970d65b	Che-Wei Chang	2022-01-18	update website
html	970d65b	Che-Wei Chang	2022-01-18	update website
html	a85e611	Che-Wei Chang	2021-11-30	Build site.
html	bc726c2	Che-Wei Chang	2021-10-22	Build site.
html	5f82e26	Che-Wei Chang	2021-10-22	Build site.
html	f1ac360	Che-Wei Chang	2021-10-22	Build site.
html	1202876	Che-Wei Chang	2021-10-22	Host with GitLab.
Rmd	7c6b41a	Che-Wei Chang	2021-10-22	wflow_publish(list.files("./analysis", pattern = "*.Rmd", full.names = T))

Introduction

This file shows how the raw environmental data processed and generating the synthetic environmental variables. The raw environmental data was extracted from tif files in advance according to the coordinates of collection sites by using the raster package. The tif files are available in the databases as described in our publication. The coordinate data is available in the supplementary files of our publication or ./data/geo_coordinates_244accessions.tsv.

Load raw environmental data

rm(list = ls())
env <- read.table("./data/RawEnv_244accessions.tsv" , header = T, stringsAsFactors = F)

Get data of growing seasons

Because the growing season of wild barley in the Southern Levant is roughly between October and April of the next year, we select the climate data for this time interval. Bioclimatic variables are further calculated accordingly.

## generate bioclimatic variables using growing season data
## extract data of growing season (October ~ April)
gs.prec <- env[,grep(colnames(env), pattern = "prec_[01][01234]")]
gs.tavg <- env[,grep(colnames(env), pattern = "tavg_[01][01234]")]
gs.tmax <- env[,grep(colnames(env), pattern = "tmax_[01][01234]")]
gs.tmin <- env[,grep(colnames(env), pattern = "tmin_[01][01234]")]
gs.srad <- env[,grep(colnames(env), pattern = "srad_[01][01234]")]

## generate 'bioclimatic' variables
gsmt <- apply(gs.tavg,1,mean) # mean temperature of growing season
gstsd <- apply(gs.tavg,1,sd) # standard deviation of monthly mean temperature of growing season
gstmax <- apply(gs.tmax,1, max) # maximal temperature of growing season
gstmin <- apply(gs.tmin,1, min) # minimal temperature of growing season
gsmdr <- apply(gs.tmax - gs.tmin,1,mean) #  mean diurnal range (mean of monthly (max temp - min temp))
gsiso <- gsmdr/(gstmax - gstmin) # isothermality of growing season

gsprec <-apply(gs.prec,1,sum) # total precipitation of growing season
gscvprec <- apply(gs.prec,1, function(x){100*(sd(x)/mean(x))}) # coefficient of variation of precipitation of off season
gsmsrad <- apply(gs.srad,1,mean) # average solar radiation of of growing season

bio <- data.frame(gsmt, gstsd,gstmax,gstmin,gsmdr, gsiso,gsprec,gscvprec,gsmsrad)

envdata <- cbind.data.frame(env[,!grepl(colnames(env), pattern = "prec_|tavg_|tmax_|tmin_|srad_|bio_")], bio)

Generate synthetic environmental variables

For those highly collinear environmental variables, we would like to combine them into a synthetic variable. This synthetic environmental varaible is defined as the first dimension of principal components (PCs) of highly collinear environmental variable.

About the customized index

To identify clusters of highly collinear variables, we designed a customized index based on the sum of squares (SS) by considering SS as "the amount of information". The index is written as

\[\frac{\Sigma SS_{PC1,n}}{SS_{Total}-\Sigma SS_{PC1,n}}\] In the equation, \(SS_{PC1,n}\) is the SS of the first PCs of \(n\) groups of standardized variables, and \(SS_{Total}\) is the SS of all standardized variables. If variables with little collinearity are grouped together, \(\Sigma SS_{PC1,n}\) will decrease and \(SS_{Total}-\Sigma SS_{PC1,n}\) will increase ,so the index will decrease. In contrast, if highly collinear variables are grouped, the index will increase. Different grouping combinations are considered for calculation by cutting a tree of hierarchical clustering based on \(1-R^2\). Please note that if there is a group containing only one variable, then this group will be excluded from the calculation of \(SS_{PC1,n}\) (so if we cut a tree at \(1-R^2=0\) and there is not any pair of variables with \(R^2=1\), we should have the index of 0).

## Define a function to carry out clustering with the customized index
collinear.clust <- function(X, res = 0.001, hc.method = 'complete', eigen.d = 1){
  require(dendextend)
  
  # 1. construct distance matrix
  R2.dist <- 1-(cor(X)^2)
  
  # 2. make a clustering tree
  hc <- hclust(as.dist(R2.dist), method = hc.method) 
  
  dd <- as.dendrogram(hc)
  
  thres = seq(0, 1, res)
  tmp.save <- list()
  
  for(i in 1:length(thres)){
    
    grpindx <- cutree(hc, h = thres[i])
    expvar <- sapply(unique(grpindx), function(x){
      if(sum(grpindx == x) > 1){                       # exlude the clusters having only one variable
        d <- svd(scale(env)[,names(grpindx)[grpindx == x]])$d^2
        return(d[1:eigen.d])
      }else{return(0)}
    })
    tmp.save <- c(tmp.save, list(expvar))
    
  }
  SST <- sum(svd(X)$d^2)
  SSr <- sapply(tmp.save, function(x){sum(x)})/(SST - sapply(tmp.save, function(x){sum(x)}))

  out <- 
    list("clusters" = cutree(hc, h = thres[which.max(SSr)]),
         "Cluster_Index" = data.frame("Threshold" = thres, "CI" = SSr))
  return(out)
  
} # 'collinear.clust' end

Identify collinear groups

library(dendextend)
env <- envdata[,-(1:3)]
env$Aspect <- cos(env$Aspect) # express aspect by cos(x)

R2.dist <- 1-(cor(env)^2)
hc <- hclust(as.dist(R2.dist), method = "complete") 

rownames(env) <- envdata$vcfID
grp <- collinear.clust(X = scale(env))

A plot showing the optimal split point of \(1-R^2\)

Version	Author	Date
bc726c2	Che-Wei Chang	2021-10-22

A dendrogram showing the clustering results

Version	Author	Date
bc726c2	Che-Wei Chang	2021-10-22

Extract the first PCs

After identifying the collinear groups, we extract their first PCs as the synthetic environmental variables. For the convenience, I use some abbreviations to name the variables. Later on, they would be replace with intelligible names as those used in our publication.

lapply(1:max(grp$clusters), function(x){(names(grp$clusters)[grp$clusters == x])}) # check grouping result again

[[1]]
[1] "Latitude" "gsprec"   "gsmsrad" 

[[2]]
[1] "Longitude"

[[3]]
[1] "Aspect"

[[4]]
[1] "Slope"

[[5]]
[1] "Elevation" "gsmt"      "gstmax"    "gstmin"   

[[6]]
[1] "BLDFIE_sl1" "BLDFIE_sl2"

[[7]]
[1] "BLDFIE_sl3" "BLDFIE_sl4"

[[8]]
[1] "CECSOL_sl1" "CECSOL_sl2" "CECSOL_sl3" "CECSOL_sl4" "WWP_sl1"   
[6] "WWP_sl2"    "WWP_sl3"    "WWP_sl4"   

[[9]]
[1] "CLYPPT_sl1" "CLYPPT_sl2" "CLYPPT_sl3" "CLYPPT_sl4" "SNDPPT_sl1"
[6] "SNDPPT_sl2" "SNDPPT_sl3" "SNDPPT_sl4"

[[10]]
[1] "AWCh1_sl1" "AWCh2_sl1" "AWCh3_sl1"

[[11]]
[1] "AWCh1_sl2" "AWCh1_sl3" "AWCh1_sl4" "AWCh2_sl2" "AWCh2_sl3" "AWCh2_sl4"
[7] "AWCh3_sl2" "AWCh3_sl3" "AWCh3_sl4"

[[12]]
[1] "ORCDRC_sl1" "ORCDRC_sl2" "ORCDRC_sl3"

[[13]]
[1] "ORCDRC_sl4"

[[14]]
[1] "PHIHOX_sl1" "PHIHOX_sl2" "PHIHOX_sl3" "PHIHOX_sl4"

[[15]]
[1] "SLTPPT_sl1" "SLTPPT_sl2" "SLTPPT_sl3" "SLTPPT_sl4"

[[16]]
[1] "gstsd"

[[17]]
[1] "gsmdr" "gsiso"

[[18]]
[1] "gscvprec"

# name the new environmental variables (including synthetic variables)
new.var.name <- c("Lat_gsprec_msrad",
                  "Longitude",
                  "Aspect",
                  "Slope",
                  "Elev_gsmt_tmax_tmin",
                  "BD_sl12",
                  "BD_sl34",
                  "CEC_WWP_sl1234",
                  "CL_SND_sl1234",
                  "AWCh123_sl1",
                  "AWCh123_sl234",
                  "OCC_sl123",
                  "OCC_sl4",
                  "pH_sl1234",
                  "SLT_sl1234",
                  "gstsd",
                  "gsmdr_iso",
                  "gscvprec"
)

# generate synthetic variables
grpindx <- split(names(grp$clusters), f = grp$clusters) 
syn.env <- matrix(NA, ncol = length(grpindx), nrow = nrow(env))

for(i in 1:length(grpindx)){
  if(length(grpindx[[i]]) >1){
    pc <- prcomp((scale(env, center = T, scale = T)[,grpindx[[i]]]))
    syn.env[,i] <- pc$x[,1]
  }else{
    syn.env[,i] <- scale(env)[,grpindx[[i]]]
  }
} # for i loop end

# get the rotations and explained variation of the first PCs
syn.env.info <- 
  lapply(1:length(grpindx),function(i){
    
    if(length(grpindx[[i]]) >1){
      pc <- prcomp((scale(env, center = T, scale = T)[,grpindx[[i]]]))
      out <- list(rotation = pc$rotation[,"PC1"],
           exp.var = pc$sdev[1]^2/sum(pc$sdev^2))
      return(out)
    }else{
      return(NULL)
    }
    
  })
colnames(syn.env) <- names(syn.env.info) <- new.var.name
rownames(syn.env) <- rownames(env)

variable selection according to VIF

In the last step, we conduct variable selection according to the variance inflation factor (VIF).

Define a function to calculate VIF

We define an R function cal.VIF to calculate VIF, which is written as \[ VIF=\frac{1}{1-R^2} \]

cal.VIF <- function(X){
  if(!is.data.frame(X)){stop("The input data format must be 'data.frame'.")}
  library(car)
  out <- 
    lapply( 1:ncol(X), function(i){
      overall.vif <- 1/(1-summary(lm(eval(parse(text = paste(colnames(X)[i], "~.", sep = ""))), data = X))$r.squared) # calculate the overall VIF -> 1/(1-R^2)
      vifout <- overall.vif
      return(vifout)
    }) # lapply end 
  names(out) <- colnames(X)
  return(out)
} # function 'cal.VIF' end

Variable selection

We carry out variable selection by removing the variable with the largest VIF in each iteration. Because the precipitation and temperature are regarded as the most important selective gradients (Hubner et al. 2009), we manually keep them in the selection process by putting them in a whitelist.

syn.env <- as.data.frame(syn.env)

VIF <- unlist(cal.VIF(syn.env))
white.list <- c("Lat_gsprec_msrad", "Elev_gsmt_tmax_tmin")
todrop <- c()
while(any(VIF > 5)){
  tmp <- VIF[-grep(names(VIF), pattern = paste(white.list, collapse = "|"))] # artificially exclude the variables related to mean precipitation and temperature from the selection process
  todrop <- c(todrop, names(tmp)[which.max(tmp)]
              ) # add the variable with the largest VIF to a list
  VIF <- unlist(cal.VIF(syn.env[,-which(colnames(syn.env) %in% todrop)]))
}

Check results of varaible selection

The variable selection removed six variables, including five synthetic variables:
* AWCh123_sl234: Soil water capacity (depth 5cm-30cm)
* CEC_WWP_sl1234: Soil cation exchange capacity + water capacity until wilting point (depth 0cm-30cm)
* Longitude * CL_SND_sl1234: Soil clay content + sand content (depth 0cm-30cm)
* gsmdr_iso: Mean diurnal range + Isothermality in the growing season
* BD_sl12: Soil bulk density (depth 0cm-5cm)

BD_sl12 was the last variable to be removed in the selection process. However, the BD_sl34 (soil bulk density at a depth of 15-30cm) was kept. In general, we may expect the characters of a shallow soil layer be more important than a deep soil layer because the deep layer supposedly influcence the plant growth only if roots reach it. Besides, in the VIFs shown below, the VIF of BD_sl12 is just 5.03, which is close to our threshold.

todrop

[1] "AWCh123_sl234"  "CEC_WWP_sl1234" "Longitude"      "CL_SND_sl1234" 
[5] "gsmdr_iso"      "BD_sl12"

sapply(cal.VIF(syn.env[,-which(colnames(syn.env) %in% todrop[-length(todrop)])]), function(x){x[length(x)]})

   Lat_gsprec_msrad              Aspect               Slope Elev_gsmt_tmax_tmin 
           4.015157            1.181480            1.568493            2.642213 
            BD_sl12             BD_sl34         AWCh123_sl1           OCC_sl123 
           5.032382            3.887125            2.540763            4.676382 
            OCC_sl4           pH_sl1234          SLT_sl1234               gstsd 
           2.731634            2.674157            2.071673            1.538495 
           gscvprec 
           3.160113

In addition, if we droop BD_sl34 instead, we can also make all VIF < 5. Thus, I arbitrarily decided to remove BD_sl34 instead of BD_sl12

todrop <- gsub(todrop, pattern = "BD_sl12", replacement = "BD_sl34")
sapply(cal.VIF(syn.env[,-which(colnames(syn.env) %in% todrop)]), function(x){x[length(x)]})

   Lat_gsprec_msrad              Aspect               Slope Elev_gsmt_tmax_tmin 
           3.921094            1.177832            1.473049            2.297442 
            BD_sl12         AWCh123_sl1           OCC_sl123             OCC_sl4 
           2.605299            2.208383            3.979243            2.624242 
          pH_sl1234          SLT_sl1234               gstsd            gscvprec 
           2.618653            2.066102            1.518079            3.101438

Rename variables with intelligible names

Lastly, we rename the selected environmental varaibles with intelligible names as the names used in our publication.

orig.lab <- c("Lat_gsprec_msrad","Aspect","Slope","Elev_gsmt_tmax_tmin","BD_sl12","AWCh123_sl1","OCC_sl123",
              "OCC_sl4","pH_sl1234","SLT_sl1234","gstsd","gscvprec")
new.lab <- c("Latitude+Rain+Solar_rad", "Aspect", "Slope", "Elevation+Temperature", "Soil_bulk_density", "Soil_water_capacity",
             "Soil_carbon_content(0-15cm)", "Soil_carbon_content(30cm)", "Soil_pH", "Soil_silt_content", "StdDev_Temperature", "CoefVar_Rain")

selected.env <- syn.env[,-which(colnames(syn.env) %in% todrop)]
colnames(selected.env) <- new.lab[match(colnames(selected.env),orig.lab)]

selected.env.info <- syn.env.info[-which(colnames(syn.env) %in% todrop)]
names(selected.env.info) <- new.lab[match(names(selected.env.info),orig.lab)]

Summary of environmental data

Number of synthetic variables

sum(!(sapply(selected.env.info, is.null)))

[1] 7

Percentage of variance explained by the first PCs

Since we take the first PCs as synthetic environmental variables, it is necessary to know how many percentages of variation of a collinear group is captured by its first PC. We can find that most of the synthetic variables explain more than 90% of variation of the corresponding collinear group.

Env	Exp_Var
Latitude+Rain+Solar_rad	0.9100658
Elevation+Temperature	0.9314568
Soil_bulk_density	0.9468696
Soil_water_capacity	0.9615446
Soil_carbon_content(0-15cm)	0.8834362
Soil_pH	0.9689760
Soil_silt_content	0.9695062

Rotations of the first PCs

We also look at the rotations that used to generate the synthetic environmental variables.

sapply(selected.env.info, function(x){
  if(is.null(x)){return(NULL)}
  return(x$rotation)
})

$`Latitude+Rain+Solar_rad`
  Latitude     gsprec    gsmsrad 
 0.5884235  0.5565450 -0.5865283 

$Aspect
NULL

$Slope
NULL

$`Elevation+Temperature`
 Elevation       gsmt     gstmax     gstmin 
 0.5084846 -0.5130144 -0.4907933 -0.4872182 

$Soil_bulk_density
BLDFIE_sl1 BLDFIE_sl2 
-0.7071068 -0.7071068 

$Soil_water_capacity
 AWCh1_sl1  AWCh2_sl1  AWCh3_sl1 
-0.5748436 -0.5819211 -0.5752588 

$`Soil_carbon_content(0-15cm)`
ORCDRC_sl1 ORCDRC_sl2 ORCDRC_sl3 
 0.5568221  0.5897761  0.5849046 

$`Soil_carbon_content(30cm)`
NULL

$Soil_pH
PHIHOX_sl1 PHIHOX_sl2 PHIHOX_sl3 PHIHOX_sl4 
 0.4951438  0.5019901  0.5027992  0.5000314 

$Soil_silt_content
SLTPPT_sl1 SLTPPT_sl2 SLTPPT_sl3 SLTPPT_sl4 
-0.4994935 -0.5021920 -0.5030756 -0.4952014 

$StdDev_Temperature
NULL

$CoefVar_Rain
NULL

Correlation between environmental variables

Finally, we look at the correlation between all selected environmental variables with a heatmap.

Version	Author	Date
bc726c2	Che-Wei Chang	2021-10-22

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.5     car_3.0-6         carData_3.0-3     dendextend_1.13.2
[5] workflowr_1.6.2  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1  xfun_0.27         reshape2_1.4.3    purrr_0.3.4      
 [5] haven_2.3.1       colorspace_2.0-2  vctrs_0.3.8       generics_0.1.0   
 [9] htmltools_0.5.2   viridisLite_0.4.0 yaml_2.2.1        utf8_1.2.2       
[13] rlang_0.4.12      later_1.3.0       pillar_1.6.4      withr_2.4.2      
[17] foreign_0.8-71    glue_1.4.2        DBI_1.1.1         readxl_1.3.1     
[21] plyr_1.8.6        lifecycle_1.0.1   stringr_1.4.0     cellranger_1.1.0 
[25] munsell_0.5.0     gtable_0.3.0      zip_2.1.1         evaluate_0.14    
[29] forcats_0.5.1     knitr_1.36        rio_0.5.16        fastmap_1.1.0    
[33] httpuv_1.6.3      curl_4.3          fansi_0.5.0       Rcpp_1.0.7       
[37] promises_1.2.0.1  scales_1.1.1      backports_1.1.5   abind_1.4-5      
[41] farver_2.1.0      fs_1.5.0          gridExtra_2.3     hms_1.1.1        
[45] digest_0.6.28     openxlsx_4.1.4    stringi_1.7.5     dplyr_1.0.7      
[49] grid_3.6.0        rprojroot_1.3-2   tools_3.6.0       magrittr_2.0.1   
[53] tibble_3.1.5      crayon_1.4.1      whisker_0.4       pkgconfig_2.0.3  
[57] ellipsis_0.3.2    data.table_1.12.8 assertthat_0.2.1  rmarkdown_2.3    
[61] viridis_0.5.1     R6_2.5.1          git2r_0.26.1      compiler_3.6.0