The metadata and outputs from bioinformatics processes are useful metrics to contextualise the genomic differences seen within a phylogeny. The ghruR package provides functions to combine this data into a format that is compatible with microreact. This vignette will illustrate some examples of how to use these
To reduce the data to only those samples represented in the tree, the tip labels can be used as a source of data.
library(ape)
<- ape::read.tree('input_data/microreact_single_country_e.coli.tre')
tree <- tree$tip.label
sample_ids head(sample_ids)
1] "G18763197" "G18760379" "G18755128" "G18763800" "G18761051" "G18763319" [
The ghruR function to make a microreact dataframe uses the Google Sheets as the source for the data. Therefore it uses the master sheet list to find the country specific Google sheet that contains the URLs of all the relevant data sources. It will read these data to
library(ghruR)
<- 'https://docs.google.com/spreadsheets/d/1NKO01Yo9gHnSNVkR-3FduTxH_ajSvJ57VfUTP6qRNOE'
list_of_countries_sheet_url <- 'Philippines'
country <- 'anthony.underwood@cgps.group'
user_email <- ghruR::make_microreact_metadata_for_country(country , 'Escherichia coli', user_email, list_of_countries_sheet_url = list_of_countries_sheet_url, sample_ids = sample_ids)
metadata 1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group" [
Microreact requires a row for each label in the tree. If you supply a list of sample ids, the assumption will be that one (and only one) of the ids is a non GHRU id and will use that to add a row with just the reference id as a value. If you need to add this manually, the procedure below can be followed.
<- data.frame('id' = 'NZ_HG941718.1')
ref_dataframe <- plyr::rbind.fill(metadata, ref_dataframe) metadata_with_ref
Finally the metadata can be writtent to a tsv file that can be uploaded to Microreact together with the newick tree file
write.table(metadata, 'output_data/single_country_microreact_metadata.tsv', sep = '\t', row.names = FALSE)
In case the tree contains leaves which haven’t been specified in the Epidemiological Data spreadsheets, these should be added to the Microreact metadata, otherwise the upload to Microreact will not work.If this is for just one country then the function below can be used to fill in the missiing data.
<- metadata
incomplete_metadata
<- c()
missing_samples for (i in 1:length(sample_ids)){
if (length(incomplete_metadata[incomplete_metadata[,1] == sample_ids[i],1]) == 0){
<- append(missing_samples, sample_ids[i])
missing_samples
}
}<- tibble(id = sort(missing_samples))
sorted_missing_samples
print(paste ("These samples are missing from Epi data:", paste(sorted_missing_samples$id, collapse = ", ")))
<- function(x) {
determine_county <- as.integer(substr(x,4,nchar(x)))
country_digits if(!is.na(country_digits)){
if (country_digits > 1 && country_digits < 250001) {
return("Colombia")
else if (country_digits > 250000 & country_digits < 500001) {
} return("India")
else if (country_digits > 500000 & country_digits < 750001) {
} return("Nigeria")
else if (country_digits > 750000 & country_digits < 999999) {
} return("Philippines")
}else {
return(NA)
}
}
<- ghruR::get_country_colours()
country_colours <- ghruR::get_country_locations()
country_locations
<- sorted_missing_samples %>% dplyr::mutate(Country = purrr::map_chr(id, determine_county)) %>%
missing_samples_with_info ::left_join(country_colours) %>%
dplyr::rename(Country__colour = Colour) %>%
dplyr::left_join(country_locations)
dplyr
<- plyr::rbind.fill(incomplete_metadata, missing_samples_with_info) complete_metadata
This is wrapped into a function so you do not need to run all of the commands
<- metadata
incomplete_metadata <- ghruR::fill_missing_metadata(incomplete_metadata, sample_ids) complete_metadata