Last updated: 2022-01-18
Checks: 6 1
Knit directory: Barley1kGBS_Proj/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.
The command set.seed(20210831)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
To ensure reproducibility of the results, delete the cache directory PreprocessEnvData_cache
and re-run the analysis. To have workflowr automatically delete the cache directory prior to building the file, set delete_cache = TRUE
when running wflow_build()
or wflow_publish()
.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version da73ff9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/GenomeEnvAssociation_cache/
Ignored: analysis/GenomicVarPartitioning_cache/
Ignored: analysis/PreprocessEnvData_cache/
Untracked files:
Untracked: VennDiagram2022-01-18_16-08-05.log
Untracked: VennDiagram2022-01-18_16-10-17.log
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/PreprocessEnvData.Rmd
) and HTML (public/PreprocessEnvData.html
) files. If you've configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 970d65b | Che-Wei Chang | 2022-01-18 | update website |
html | 970d65b | Che-Wei Chang | 2022-01-18 | update website |
html | a85e611 | Che-Wei Chang | 2021-11-30 | Build site. |
html | bc726c2 | Che-Wei Chang | 2021-10-22 | Build site. |
html | 5f82e26 | Che-Wei Chang | 2021-10-22 | Build site. |
html | f1ac360 | Che-Wei Chang | 2021-10-22 | Build site. |
html | 1202876 | Che-Wei Chang | 2021-10-22 | Host with GitLab. |
Rmd | 7c6b41a | Che-Wei Chang | 2021-10-22 | wflow_publish(list.files("./analysis", pattern = "*.Rmd", full.names = T)) |
This file shows how the raw environmental data processed and generating the synthetic environmental variables. The raw environmental data was extracted from tif
files in advance according to the coordinates of collection sites by using the raster
package. The tif
files are available in the databases as described in our publication. The coordinate data is available in the supplementary files of our publication or ./data/geo_coordinates_244accessions.tsv
.
rm(list = ls())
env <- read.table("./data/RawEnv_244accessions.tsv" , header = T, stringsAsFactors = F)
Because the growing season of wild barley in the Southern Levant is roughly between October and April of the next year, we select the climate data for this time interval. Bioclimatic variables are further calculated accordingly.
## generate bioclimatic variables using growing season data
## extract data of growing season (October ~ April)
gs.prec <- env[,grep(colnames(env), pattern = "prec_[01][01234]")]
gs.tavg <- env[,grep(colnames(env), pattern = "tavg_[01][01234]")]
gs.tmax <- env[,grep(colnames(env), pattern = "tmax_[01][01234]")]
gs.tmin <- env[,grep(colnames(env), pattern = "tmin_[01][01234]")]
gs.srad <- env[,grep(colnames(env), pattern = "srad_[01][01234]")]
## generate 'bioclimatic' variables
gsmt <- apply(gs.tavg,1,mean) # mean temperature of growing season
gstsd <- apply(gs.tavg,1,sd) # standard deviation of monthly mean temperature of growing season
gstmax <- apply(gs.tmax,1, max) # maximal temperature of growing season
gstmin <- apply(gs.tmin,1, min) # minimal temperature of growing season
gsmdr <- apply(gs.tmax - gs.tmin,1,mean) # mean diurnal range (mean of monthly (max temp - min temp))
gsiso <- gsmdr/(gstmax - gstmin) # isothermality of growing season
gsprec <-apply(gs.prec,1,sum) # total precipitation of growing season
gscvprec <- apply(gs.prec,1, function(x){100*(sd(x)/mean(x))}) # coefficient of variation of precipitation of off season
gsmsrad <- apply(gs.srad,1,mean) # average solar radiation of of growing season
bio <- data.frame(gsmt, gstsd,gstmax,gstmin,gsmdr, gsiso,gsprec,gscvprec,gsmsrad)
envdata <- cbind.data.frame(env[,!grepl(colnames(env), pattern = "prec_|tavg_|tmax_|tmin_|srad_|bio_")], bio)
For those highly collinear environmental variables, we would like to combine them into a synthetic variable. This synthetic environmental varaible is defined as the first dimension of principal components (PCs) of highly collinear environmental variable.
To identify clusters of highly collinear variables, we designed a customized index based on the sum of squares (SS) by considering SS as "the amount of information". The index is written as
\[\frac{\Sigma SS_{PC1,n}}{SS_{Total}-\Sigma SS_{PC1,n}}\] In the equation, \(SS_{PC1,n}\) is the SS of the first PCs of \(n\) groups of standardized variables, and \(SS_{Total}\) is the SS of all standardized variables. If variables with little collinearity are grouped together, \(\Sigma SS_{PC1,n}\) will decrease and \(SS_{Total}-\Sigma SS_{PC1,n}\) will increase ,so the index will decrease. In contrast, if highly collinear variables are grouped, the index will increase. Different grouping combinations are considered for calculation by cutting a tree of hierarchical clustering based on \(1-R^2\). Please note that if there is a group containing only one variable, then this group will be excluded from the calculation of \(SS_{PC1,n}\) (so if we cut a tree at \(1-R^2=0\) and there is not any pair of variables with \(R^2=1\), we should have the index of 0).
## Define a function to carry out clustering with the customized index
collinear.clust <- function(X, res = 0.001, hc.method = 'complete', eigen.d = 1){
require(dendextend)
# 1. construct distance matrix
R2.dist <- 1-(cor(X)^2)
# 2. make a clustering tree
hc <- hclust(as.dist(R2.dist), method = hc.method)
dd <- as.dendrogram(hc)
thres = seq(0, 1, res)
tmp.save <- list()
for(i in 1:length(thres)){
grpindx <- cutree(hc, h = thres[i])
expvar <- sapply(unique(grpindx), function(x){
if(sum(grpindx == x) > 1){ # exlude the clusters having only one variable
d <- svd(scale(env)[,names(grpindx)[grpindx == x]])$d^2
return(d[1:eigen.d])
}else{return(0)}
})
tmp.save <- c(tmp.save, list(expvar))
}
SST <- sum(svd(X)$d^2)
SSr <- sapply(tmp.save, function(x){sum(x)})/(SST - sapply(tmp.save, function(x){sum(x)}))
out <-
list("clusters" = cutree(hc, h = thres[which.max(SSr)]),
"Cluster_Index" = data.frame("Threshold" = thres, "CI" = SSr))
return(out)
} # 'collinear.clust' end
library(dendextend)
env <- envdata[,-(1:3)]
env$Aspect <- cos(env$Aspect) # express aspect by cos(x)
R2.dist <- 1-(cor(env)^2)
hc <- hclust(as.dist(R2.dist), method = "complete")
rownames(env) <- envdata$vcfID
grp <- collinear.clust(X = scale(env))
Version | Author | Date |
---|---|---|
bc726c2 | Che-Wei Chang | 2021-10-22 |
After identifying the collinear groups, we extract their first PCs as the synthetic environmental variables. For the convenience, I use some abbreviations to name the variables. Later on, they would be replace with intelligible names as those used in our publication.
lapply(1:max(grp$clusters), function(x){(names(grp$clusters)[grp$clusters == x])}) # check grouping result again
[[1]]
[1] "Latitude" "gsprec" "gsmsrad"
[[2]]
[1] "Longitude"
[[3]]
[1] "Aspect"
[[4]]
[1] "Slope"
[[5]]
[1] "Elevation" "gsmt" "gstmax" "gstmin"
[[6]]
[1] "BLDFIE_sl1" "BLDFIE_sl2"
[[7]]
[1] "BLDFIE_sl3" "BLDFIE_sl4"
[[8]]
[1] "CECSOL_sl1" "CECSOL_sl2" "CECSOL_sl3" "CECSOL_sl4" "WWP_sl1"
[6] "WWP_sl2" "WWP_sl3" "WWP_sl4"
[[9]]
[1] "CLYPPT_sl1" "CLYPPT_sl2" "CLYPPT_sl3" "CLYPPT_sl4" "SNDPPT_sl1"
[6] "SNDPPT_sl2" "SNDPPT_sl3" "SNDPPT_sl4"
[[10]]
[1] "AWCh1_sl1" "AWCh2_sl1" "AWCh3_sl1"
[[11]]
[1] "AWCh1_sl2" "AWCh1_sl3" "AWCh1_sl4" "AWCh2_sl2" "AWCh2_sl3" "AWCh2_sl4"
[7] "AWCh3_sl2" "AWCh3_sl3" "AWCh3_sl4"
[[12]]
[1] "ORCDRC_sl1" "ORCDRC_sl2" "ORCDRC_sl3"
[[13]]
[1] "ORCDRC_sl4"
[[14]]
[1] "PHIHOX_sl1" "PHIHOX_sl2" "PHIHOX_sl3" "PHIHOX_sl4"
[[15]]
[1] "SLTPPT_sl1" "SLTPPT_sl2" "SLTPPT_sl3" "SLTPPT_sl4"
[[16]]
[1] "gstsd"
[[17]]
[1] "gsmdr" "gsiso"
[[18]]
[1] "gscvprec"
# name the new environmental variables (including synthetic variables)
new.var.name <- c("Lat_gsprec_msrad",
"Longitude",
"Aspect",
"Slope",
"Elev_gsmt_tmax_tmin",
"BD_sl12",
"BD_sl34",
"CEC_WWP_sl1234",
"CL_SND_sl1234",
"AWCh123_sl1",
"AWCh123_sl234",
"OCC_sl123",
"OCC_sl4",
"pH_sl1234",
"SLT_sl1234",
"gstsd",
"gsmdr_iso",
"gscvprec"
)
# generate synthetic variables
grpindx <- split(names(grp$clusters), f = grp$clusters)
syn.env <- matrix(NA, ncol = length(grpindx), nrow = nrow(env))
for(i in 1:length(grpindx)){
if(length(grpindx[[i]]) >1){
pc <- prcomp((scale(env, center = T, scale = T)[,grpindx[[i]]]))
syn.env[,i] <- pc$x[,1]
}else{
syn.env[,i] <- scale(env)[,grpindx[[i]]]
}
} # for i loop end
# get the rotations and explained variation of the first PCs
syn.env.info <-
lapply(1:length(grpindx),function(i){
if(length(grpindx[[i]]) >1){
pc <- prcomp((scale(env, center = T, scale = T)[,grpindx[[i]]]))
out <- list(rotation = pc$rotation[,"PC1"],
exp.var = pc$sdev[1]^2/sum(pc$sdev^2))
return(out)
}else{
return(NULL)
}
})
colnames(syn.env) <- names(syn.env.info) <- new.var.name
rownames(syn.env) <- rownames(env)
In the last step, we conduct variable selection according to the variance inflation factor (VIF).
We define an R function cal.VIF
to calculate VIF, which is written as \[ VIF=\frac{1}{1-R^2} \]
cal.VIF <- function(X){
if(!is.data.frame(X)){stop("The input data format must be 'data.frame'.")}
library(car)
out <-
lapply( 1:ncol(X), function(i){
overall.vif <- 1/(1-summary(lm(eval(parse(text = paste(colnames(X)[i], "~.", sep = ""))), data = X))$r.squared) # calculate the overall VIF -> 1/(1-R^2)
vifout <- overall.vif
return(vifout)
}) # lapply end
names(out) <- colnames(X)
return(out)
} # function 'cal.VIF' end
We carry out variable selection by removing the variable with the largest VIF in each iteration. Because the precipitation and temperature are regarded as the most important selective gradients (Hubner et al. 2009), we manually keep them in the selection process by putting them in a whitelist.
syn.env <- as.data.frame(syn.env)
VIF <- unlist(cal.VIF(syn.env))
white.list <- c("Lat_gsprec_msrad", "Elev_gsmt_tmax_tmin")
todrop <- c()
while(any(VIF > 5)){
tmp <- VIF[-grep(names(VIF), pattern = paste(white.list, collapse = "|"))] # artificially exclude the variables related to mean precipitation and temperature from the selection process
todrop <- c(todrop, names(tmp)[which.max(tmp)]
) # add the variable with the largest VIF to a list
VIF <- unlist(cal.VIF(syn.env[,-which(colnames(syn.env) %in% todrop)]))
}
The variable selection removed six variables, including five synthetic variables:
* AWCh123_sl234
: Soil water capacity (depth 5cm-30cm)
* CEC_WWP_sl1234
: Soil cation exchange capacity + water capacity until wilting point (depth 0cm-30cm)
* Longitude
* CL_SND_sl1234
: Soil clay content + sand content (depth 0cm-30cm)
* gsmdr_iso
: Mean diurnal range + Isothermality in the growing season
* BD_sl12
: Soil bulk density (depth 0cm-5cm)
BD_sl12
was the last variable to be removed in the selection process. However, the BD_sl34
(soil bulk density at a depth of 15-30cm) was kept. In general, we may expect the characters of a shallow soil layer be more important than a deep soil layer because the deep layer supposedly influcence the plant growth only if roots reach it. Besides, in the VIFs shown below, the VIF of BD_sl12
is just 5.03, which is close to our threshold.
todrop
[1] "AWCh123_sl234" "CEC_WWP_sl1234" "Longitude" "CL_SND_sl1234"
[5] "gsmdr_iso" "BD_sl12"
sapply(cal.VIF(syn.env[,-which(colnames(syn.env) %in% todrop[-length(todrop)])]), function(x){x[length(x)]})
Lat_gsprec_msrad Aspect Slope Elev_gsmt_tmax_tmin
4.015157 1.181480 1.568493 2.642213
BD_sl12 BD_sl34 AWCh123_sl1 OCC_sl123
5.032382 3.887125 2.540763 4.676382
OCC_sl4 pH_sl1234 SLT_sl1234 gstsd
2.731634 2.674157 2.071673 1.538495
gscvprec
3.160113
In addition, if we droop BD_sl34
instead, we can also make all VIF < 5. Thus, I arbitrarily decided to remove BD_sl34
instead of BD_sl12
todrop <- gsub(todrop, pattern = "BD_sl12", replacement = "BD_sl34")
sapply(cal.VIF(syn.env[,-which(colnames(syn.env) %in% todrop)]), function(x){x[length(x)]})
Lat_gsprec_msrad Aspect Slope Elev_gsmt_tmax_tmin
3.921094 1.177832 1.473049 2.297442
BD_sl12 AWCh123_sl1 OCC_sl123 OCC_sl4
2.605299 2.208383 3.979243 2.624242
pH_sl1234 SLT_sl1234 gstsd gscvprec
2.618653 2.066102 1.518079 3.101438
Lastly, we rename the selected environmental varaibles with intelligible names as the names used in our publication.
orig.lab <- c("Lat_gsprec_msrad","Aspect","Slope","Elev_gsmt_tmax_tmin","BD_sl12","AWCh123_sl1","OCC_sl123",
"OCC_sl4","pH_sl1234","SLT_sl1234","gstsd","gscvprec")
new.lab <- c("Latitude+Rain+Solar_rad", "Aspect", "Slope", "Elevation+Temperature", "Soil_bulk_density", "Soil_water_capacity",
"Soil_carbon_content(0-15cm)", "Soil_carbon_content(30cm)", "Soil_pH", "Soil_silt_content", "StdDev_Temperature", "CoefVar_Rain")
selected.env <- syn.env[,-which(colnames(syn.env) %in% todrop)]
colnames(selected.env) <- new.lab[match(colnames(selected.env),orig.lab)]
selected.env.info <- syn.env.info[-which(colnames(syn.env) %in% todrop)]
names(selected.env.info) <- new.lab[match(names(selected.env.info),orig.lab)]
sum(!(sapply(selected.env.info, is.null)))
[1] 7
Since we take the first PCs as synthetic environmental variables, it is necessary to know how many percentages of variation of a collinear group is captured by its first PC. We can find that most of the synthetic variables explain more than 90% of variation of the corresponding collinear group.
Env | Exp_Var |
---|---|
Latitude+Rain+Solar_rad | 0.9100658 |
Elevation+Temperature | 0.9314568 |
Soil_bulk_density | 0.9468696 |
Soil_water_capacity | 0.9615446 |
Soil_carbon_content(0-15cm) | 0.8834362 |
Soil_pH | 0.9689760 |
Soil_silt_content | 0.9695062 |
We also look at the rotations that used to generate the synthetic environmental variables.
sapply(selected.env.info, function(x){
if(is.null(x)){return(NULL)}
return(x$rotation)
})
$`Latitude+Rain+Solar_rad`
Latitude gsprec gsmsrad
0.5884235 0.5565450 -0.5865283
$Aspect
NULL
$Slope
NULL
$`Elevation+Temperature`
Elevation gsmt gstmax gstmin
0.5084846 -0.5130144 -0.4907933 -0.4872182
$Soil_bulk_density
BLDFIE_sl1 BLDFIE_sl2
-0.7071068 -0.7071068
$Soil_water_capacity
AWCh1_sl1 AWCh2_sl1 AWCh3_sl1
-0.5748436 -0.5819211 -0.5752588
$`Soil_carbon_content(0-15cm)`
ORCDRC_sl1 ORCDRC_sl2 ORCDRC_sl3
0.5568221 0.5897761 0.5849046
$`Soil_carbon_content(30cm)`
NULL
$Soil_pH
PHIHOX_sl1 PHIHOX_sl2 PHIHOX_sl3 PHIHOX_sl4
0.4951438 0.5019901 0.5027992 0.5000314
$Soil_silt_content
SLTPPT_sl1 SLTPPT_sl2 SLTPPT_sl3 SLTPPT_sl4
-0.4994935 -0.5021920 -0.5030756 -0.4952014
$StdDev_Temperature
NULL
$CoefVar_Rain
NULL
Finally, we look at the correlation between all selected environmental variables with a heatmap.
Version | Author | Date |
---|---|---|
bc726c2 | Che-Wei Chang | 2021-10-22 |
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.3.5 car_3.0-6 carData_3.0-3 dendextend_1.13.2
[5] workflowr_1.6.2
loaded via a namespace (and not attached):
[1] tidyselect_1.1.1 xfun_0.27 reshape2_1.4.3 purrr_0.3.4
[5] haven_2.3.1 colorspace_2.0-2 vctrs_0.3.8 generics_0.1.0
[9] htmltools_0.5.2 viridisLite_0.4.0 yaml_2.2.1 utf8_1.2.2
[13] rlang_0.4.12 later_1.3.0 pillar_1.6.4 withr_2.4.2
[17] foreign_0.8-71 glue_1.4.2 DBI_1.1.1 readxl_1.3.1
[21] plyr_1.8.6 lifecycle_1.0.1 stringr_1.4.0 cellranger_1.1.0
[25] munsell_0.5.0 gtable_0.3.0 zip_2.1.1 evaluate_0.14
[29] forcats_0.5.1 knitr_1.36 rio_0.5.16 fastmap_1.1.0
[33] httpuv_1.6.3 curl_4.3 fansi_0.5.0 Rcpp_1.0.7
[37] promises_1.2.0.1 scales_1.1.1 backports_1.1.5 abind_1.4-5
[41] farver_2.1.0 fs_1.5.0 gridExtra_2.3 hms_1.1.1
[45] digest_0.6.28 openxlsx_4.1.4 stringi_1.7.5 dplyr_1.0.7
[49] grid_3.6.0 rprojroot_1.3-2 tools_3.6.0 magrittr_2.0.1
[53] tibble_3.1.5 crayon_1.4.1 whisker_0.4 pkgconfig_2.0.3
[57] ellipsis_0.3.2 data.table_1.12.8 assertthat_0.2.1 rmarkdown_2.3
[61] viridis_0.5.1 R6_2.5.1 git2r_0.26.1 compiler_3.6.0