Making Microreact-compatible data from GHRU metadata and outputs

Introduction

The metadata and outputs from bioinformatics processes are useful metrics to contextualise the genomic differences seen within a phylogeny. The ghruR package provides functions to combine this data into a format that is compatible with microreact. This vignette will illustrate some examples of how to use these

How to make a microreact metadata dataframe

Phylogenetic tree

To reduce the data to only those samples represented in the tree, the tip labels can be used as a source of data.

library(ape)
tree <- ape::read.tree('input_data/microreact_single_country_e.coli.tre')
sample_ids <- tree$tip.label
head(sample_ids)
[1] "G18763197" "G18760379" "G18755128" "G18763800" "G18761051" "G18763319"

Make dataframe for country

The ghruR function to make a microreact dataframe uses the Google Sheets as the source for the data. Therefore it uses the master sheet list to find the country specific Google sheet that contains the URLs of all the relevant data sources. It will read these data to

Pull data from the Epidemiological Metadata
Merge with the QC and Assembly Data (‘Sample id’, ‘Bactinspector species’, ‘Quast num contigs’, ‘Quast N50’, ‘Quast Total length’)
Merge with the MLST data (‘ST’, ‘profile’) and add a ST_autocolour column
Read the acquired AMR prediction data and reshape the data so that two columns per gene are added (gene_name and gene_name__colour). If the gene was assembled (yes, yes_nonunique), the best match is recorded in the gene_name column otherwise a ‘-’. In the gene_name__colour column ‘red’ is added if the gene was assembled (yes, yes_nonunique) , else ‘lime’ is added
Read the point AMR prediction data and reshape the data so thattwo columns per mutation are added (mutation_name and mutation_name__colour). If the gene was assembled (yes, yes_nonunique) and the mutation was found, the ‘yes’ is recorded in the mutation_name column otherwise a ‘no’. In the mutation_name__colour column ‘red’ is added if the mutation was found, else ‘lime’ is added

library(ghruR)
list_of_countries_sheet_url <- 'https://docs.google.com/spreadsheets/d/1NKO01Yo9gHnSNVkR-3FduTxH_ajSvJ57VfUTP6qRNOE'
country <- 'Philippines'
user_email <- 'anthony.underwood@cgps.group'
metadata <- ghruR::make_microreact_metadata_for_country(country , 'Escherichia coli', user_email, list_of_countries_sheet_url = list_of_countries_sheet_url, sample_ids = sample_ids)
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"
[1] "anthony.underwood@cgps.group"

Adding a row for the reference

Microreact requires a row for each label in the tree. If you supply a list of sample ids, the assumption will be that one (and only one) of the ids is a non GHRU id and will use that to add a row with just the reference id as a value. If you need to add this manually, the procedure below can be followed.

ref_dataframe <- data.frame('id' = 'NZ_HG941718.1')
metadata_with_ref <- plyr::rbind.fill(metadata, ref_dataframe)

Writing out the table

Finally the metadata can be writtent to a tsv file that can be uploaded to Microreact together with the newick tree file

write.table(metadata, 'output_data/single_country_microreact_metadata.tsv', sep = '\t', row.names = FALSE)

Add rows to metadata for missing IDs in the Epi spreadsheets

In case the tree contains leaves which haven’t been specified in the Epidemiological Data spreadsheets, these should be added to the Microreact metadata, otherwise the upload to Microreact will not work.If this is for just one country then the function below can be used to fill in the missiing data.

incomplete_metadata <- metadata

missing_samples <- c()
for (i in 1:length(sample_ids)){
  if (length(incomplete_metadata[incomplete_metadata[,1] == sample_ids[i],1]) == 0){ 
    missing_samples <- append(missing_samples, sample_ids[i])
  }
}
sorted_missing_samples <- tibble(id = sort(missing_samples))

print(paste ("These samples are missing from Epi data:", paste(sorted_missing_samples$id, collapse = ", ")))
      
determine_county <- function(x) {
  country_digits <- as.integer(substr(x,4,nchar(x)))
  if(!is.na(country_digits)){
    if (country_digits > 1 && country_digits < 250001) {
      return("Colombia")
    } else if (country_digits > 250000 & country_digits < 500001) {
      return("India")
    } else if (country_digits > 500000 & country_digits < 750001) {
      return("Nigeria")
    } else if (country_digits > 750000 & country_digits < 999999) {
      return("Philippines")
    }
  else {
    return(NA)
  }
}

country_colours <- ghruR::get_country_colours()
country_locations <- ghruR::get_country_locations() 

missing_samples_with_info <- sorted_missing_samples %>% dplyr::mutate(Country = purrr::map_chr(id, determine_county)) %>% 
  dplyr::left_join(country_colours) %>% 
  dplyr::rename(Country__colour = Colour) %>% 
  dplyr::left_join(country_locations)

complete_metadata <- plyr::rbind.fill(incomplete_metadata, missing_samples_with_info)

This is wrapped into a function so you do not need to run all of the commands

incomplete_metadata <- metadata
complete_metadata <- ghruR::fill_missing_metadata(incomplete_metadata, sample_ids)