-
Notifications
You must be signed in to change notification settings - Fork 9
Example use of MetQy for biological analysis
We will walk through some of the MetQy functions in the context of few a biological analyses as examples.
We will:
-
Use the query_genomes_to_modules function to
1.1) identify potential methanogens by searching for genomes whose organism name contain "methano", and to
1.2) evaluate metabolic processes (module completeness fraction, mcf) that are loosely related to anaerobic digestion (AD) processes across these genomes.
-
Visualise the mcf for the AD modules across the selected genomes in KEGG using a heatmap.
-
Investigate which are the genomes that have a low mcf based on the heatmap.
-
Visualise T04272's mcf for all the modules including module classes using a sunburst plot.
-
Carry out an automated analysis and report using the analysis_genomes_module_output function. This will automatically analyse all the data generated in step 1) and also the data grouped by genus (given as a factor).] This function will:
6.1) report the total number of data sets (genomes) and modules analysed,
6.2) generate a heatmap of the mcf of all genomes and modules analysed,
6.3) generate boxplots of the mcf across all genomes for each module,
6.4) generate a scatter plot of the standard deviation of the mcf across all genomes for each module,
6.5) identify any modules that have a constant (zero-variance) mcf across all genomes,
6.6) group the genomes by genus and make a heatmap of the mean mcf for each module and genus,
6.7) carry out a PCA analysis, showing the cumulative variance and a principal component (PC) plot,
6.8) visualise the PC plot overlaying the genus grouping, and
1) Identify potential methanogens and evaluate their genomes for select metabolic processes loosely relating to anaerobic digestion.
We are first interested in evaluating the mcf of organisms that have "methano" in the name across loosely AD-related modules, which have been "hand-picked".
# Load library
> library(MetQy)
# Create a folder to store output
> output_path <- "Example_1"
> dir.create(output_path)
# Modules loosely related to anaerobic digestion (AD)
> AD_modules <- c("M00087","M00356","M00357","M00563","M00567","M00596","M00530","M00377")
# Let the function find the KEGG genomes that have "methano" in the name and calculate the module completeness fraction (mcf) for the AD-relevant modules
> query_output <- query_genomes_to_modules(GENOME_INFO = "methano",MODULE_ID = AD_modules,META_OUT = T,ADD_OUT = T)
# Set OUT_MODULE_NAME to TRUE to retrieve back mcf data with module names instead of IDs.
# You can also use the module IDs to replace it with the NAME (contained in $METADATA)
# Retrieve information about matching organisms
> nrow(query_output$GENOME_INFO_DATA)
# [1] 86 - # of KEGG genomes that matched (partially) the organism name
> organisms <- query_output$GENOME_INFO_DATA$ORGANISM
# There are "Candidate" organisms in the KEGG genome database. We'll remove those.
> length(grep("candidat",organisms,ignore.case = T))
# [1] 4
# Remove candidate organisms
> candidate_index <- grep("candidat",organisms,ignore.case = T)
> query_output$GENOME_INFO_DATA[grep("candidat",organisms,ignore.case = T),]
# ID ORG_ID ORGANISM
# 22 T00823 mpl Candidatus Methanosphaerula palustris E1-9c
# 49 T02624 max Candidatus Methanomethylophilus alvus Mx1201
# 51 T02692 mer Candidatus Methanomassiliicoccus intestinalis Issoire-Mx1
# 57 T03540 mear Candidatus Methanoplasma termitum MpT1
> organisms <- organisms[-candidate_index]
> mcf_matrix <- query_output$MATRIX[-candidate_index,]
> genome_info <- query_output$GENOME_INFO_DATA[-candidate_index,]
> module_info <- query_output$METADATA[-candidate_index,]
# Retrieve the GENUS (first word) by removing the rest of the text
> genus <- gsub(pattern = "\\s.+$",replacement = "",organisms)
> length(unique(genus))
# [1] 27
2) Visualise the mcf for the user-specified modules across the selected genomes in KEGG using a heatmap
# Quick look at mcfs
> plot_heatmap(mcf_matrix,ORDER_MATRIX = T) # set ORDER_MATRIX to TRUE so that the data is ordered according to hierarchical dendogram
> plot_heatmap(mcf_matrix,ORDER_MATRIX = T,Filename = "Example_1/heatmap.png",Height = 4,Width = 5.4) # set ORDER_MATRIX to TRUE so that the data is ordered according to hierarchical dendogram

The heatmap shows that moduel M00567
(M00567: Methanogenesis, CO2 => methane) has a widespread "good" coverage-mcf >= 0.75 in 96% of genomes.
We can investigate which genomes those are, as well as get more information about the modules.
> names(module_info)
# [1] "MODULE_ID" "MODULE_NAME" "NAME_SHORT" "CLASS_I" "CLASS_II" "CLASS_III" "DEFINITION" "OPTIONAL"
> module_info[which(module_info$MODULE_ID=="M00567"),1:6]
# MODULE_ID MODULE_NAME NAME_SHORT CLASS_I CLASS_II CLASS_III
# 7 M00567 Methanogenesis, CO2 => methane Methanogenesis, CO2 => methane Pathway module Energy metabolism Methane metabolism
> small_M00567_index <- which(mcf_matrix[,7]<0.75)
> mcf_matrix[small_M00567_index,7]
# T03209 T03230 T04272
# 0.00 0.00 0.25 # mcf
# Look at the corresponding genomes (conserved order)
> genome_info[small_M00567_index,]
# ID ORG_ID ORGANISM
# T03209 bmet Bacillus methanolicus MGA3
# T03230 amq Amycolatopsis methanolica 239
# T04272 marc Methanogenic archaeon ISO4-H5
Investigate which genes are missing for M00567
to be complete in genome T04272
(Methanogenic archaeon ISO4-H5).
T04272
has an mcf for module M00567
of 0.25. This means that only a quarter of the blocks needed for the module to be complete are present.
> T04272_M00567_missing <- query_missingGenes_from_module(GENOME ="T04272",MODULE_ID = "M00567",PRINT_TO_SCREEN = F)
> print(T04272_M00567_missing)
# block_No PRESENT BLOCK_DEF MISSING_GENES
# 1 1 0 *K00200*&*K00201*&*K00202*&*K00203*&(*K00205*|*K11260*|*K00204*) K00200;K00201;K00202;K00203;K00205;K11260;K00204
# 2 2 0 *K00672* K00672
# 3 3 0 *K01499* K01499
# 4 4 0 *K00319*|*K13942* K00319;K13942
# 5 5 0 *K00320* K00320
# 6 6 0 *K00577*&*K00578*&*K00579*&*K00580*&*K00581*&*K00584* K00577;K00578;K00579;K00580;K00581;K00584
# 7 7 1 K00399&K00401&K00402
# 8 8 1 K03388&K03389&K03390
As expected, only 2 out of 8 module definition blocks are present (the bottom two).
T04272
has not been annotated with any of The K numbers flanket by asterisks (*).
Note that T04272
doesn't need to have ALL the genes listed on the MISSING_GENES
column; it only needs those to have a complete block according to the block definition (BLOCK_DEF
, logical expression). For instance, in order to have a complete first block, T04272
requires the following KEGG orthologs:
ALL THESE: K00200
, K00201
, K00202
and K00203
AND ONE OF THESE: K00205
, K11260
or K00204
.
We can also look more closely to T04272
's mcf with module information to have more of a context and potentially gain more insight.
# RETREIVE GENOME "T04272"'s mcf
> T04272_mcf <- mcf_matrix[which(rownames(mcf_matrix)=="T04272"),]
# MODULE INFORMATION
> AD_module_info <- query_output$METADATA[c(4:6,3)]
> names(AD_module_info)
# [1] "CLASS_I" "CLASS_II" "CLASS_III" "NAME_SHORT"
### Manually edit text for nicer output
# long text entry [1] "Methanogenesis, methylamine/dimethylamine/trimethylamine => methane"
> AD_module_info[6,4] <- "Methanogenesis, methylamine => methane"
# Replace spaces with new lines to make labels neater
for(c in 1:ncol(AD_module_info)) AD_module_info[,c] <- gsub(" ","\n",AD_module_info[,c])
### Manually edit text for nicer output
> AD_module_info[1,2] <- "Carbohydrate\nand lipid\nmetabolism"
> AD_module_info[1,3] <- "Fatty acid\nmetabolism"
> AD_module_info[8,4] <- "Dissimilatory\nsulfate reduction,\nsulfate => H2S"
# MAKE SUNBURST PLOT ----
# DISPLAY PLOT
> sunburst_output <- plot_sunburst(AD_module_info,fill_by = T04272_mcf,legend_name = "mcf\n",Filename = "",WIDTH = 8)
# SAVE PLOT TO FILE
> sunburst_output <- plot_sunburst(AD_module_info,fill_by = T04272_mcf,legend_name = "mcf\n",Filename = "Example_1/T04272_sunburst.png",WIDTH = 4,HEIGHT = 4)

We can use the analysis_genomes_module_output function to analyse all the data generated in step 1) and also the data grouped by genus (given as a factor), generating a report in the process.
This functions allows some analyses to be made based on grouping the data as defined by factor(s).
Thus by defining FACTOR
(which can be one factor or multiple, if it's a list), the mean mcf and the standard deviatioin of the mcf are visualised as a heatmap with a row for every group.
Here we provide the genus as a grouping factor.
Furthermore, the function generates a PC plot with data colour-code according to the grouping.
Finally, the function calculates the mean Euclidean distances from the PC space for each group as a proxy measurement for within group variation
(only for groups with more than one member).
In summary, the report will:
1) report the total number of data sets (genomes) and modules analysed,
2) generate a heatmap of the mcf of all genomes and modules analysed,
3) generate boxplots of the mcf across all genomes for each module,
4) generate a scatter plot of the standard deviation of the mcf across all genomes for each module,
5) identify any modules that have a constant (zero-variance) mcf across all genomes,
6) group the genomes by genus and make a heatmap of the mean mcf for each module and genus,
7) carry out a PCA analysis on the mcf matrix and plot the cumulative variance and a scatter plot of PCs,
8) visualise the PC plot overlaying the genus grouping, and
9) measure the within-group (genus) variance, using the mean Euclidean distance of the PCs as a proxy for spread.
The figures included in the report are shown below by section. Refer to the report (generated in the next code block) and the function documentation for more details.
## Use the analysis function to process the output from 'query_genomes_to_modules' and generate a LaTex report
# Use the genus names we have previously generated to group our data.
> useFactor <- list("genusFactor" = genus) # By naming the list item ("genusFactor"), we control the naming convention for the files generated. It would be FACTOR by default.
> analysis_output <- analysis_genomes_module_output(FRACTION_MATRIX = mcf_matrix,outPath = output_path,figType = ".png",FACTOR = useFactor)
# GENERATES THE FOLLOWING FILES IN THE OUTPUT FOLDER:
# PCA/
# module_mean_dist_output_FACTOR.rda
# pca_plot.png
# pca_var.png
# pca.rda
# plot_mean_dist_genusFactor.png
# plot_scatter_genusFactor.png
# module_allOrgs_sd_boxplot.png
# module_allOrgs_sd.png
# module_allOrgs.png
# module_constant_presence.txt
# module_mean_genusFactor.png
# module_sd_genusFactor.png
# module_output_plots.rda
# report.tex
# The files with 'genusFactor' in the name were generated from using the FACTOR argument, as we labelled the list in the definition.
Here are the module_allOrgs.png
, module_allOrgs_var.png
and module_allOrgs_var_boxplot.png
figures,
showing the heatmap (same as before) and the standard deviation across modules both as a scatter plot and as a boxplot.
See module_constant_presence.txt for information about modules that might be
absent, present or with the same value across all the genomes analysed for a specific module.



The figures module_mean_genusFactor.png
and module_sd_genusFactor.png
show the mean mcf and the standard deviation (sd) for every genus (row), respectively.


A PCA was done on the mcf matrix. The files pca_var.png
and pca_plot.png
, show the cumulative variation accounted for by the PCs and the scatter plot of the first two PCs, respectively.
File: plot_scatter_genusFactor.png
File: plot_mean_dist_genusFactor.png
