Skip to content

Example use of MetQy for biological analysis

Andrea Martinez Vernon edited this page Mar 12, 2018 · 10 revisions

We will walk through some of the MetQy functions in the context of few a biological analyses as examples.

We will:

  1. Use the query_genomes_to_modules function to

    1.1) identify potential methanogens by searching for genomes whose organism name contain "methano", and to

    1.2) evaluate metabolic processes (module completeness fraction, mcf) that are loosely related to anaerobic digestion (AD) processes across these genomes.

  2. Visualise the mcf for the AD modules across the selected genomes in KEGG using a heatmap.

  3. Investigate which are the genomes that have a low mcf based on the heatmap.

  4. Use one of these genomes (T04272) and investigate which genes are missing for that module to be complete.

  5. Visualise T04272's mcf for all the modules including module classes using a sunburst plot.

  6. Carry out an automated analysis and report using the analysis_genomes_module_output function. This will automatically analyse all the data generated in step 1) and also the data grouped by genus (given as a factor).] This function will:

    6.1) report the total number of data sets (genomes) and modules analysed,

    6.2) generate a heatmap of the mcf of all genomes and modules analysed,

    6.3) generate boxplots of the mcf across all genomes for each module,

    6.4) generate a scatter plot of the standard deviation of the mcf across all genomes for each module,

    6.5) identify any modules that have a constant (zero-variance) mcf across all genomes,

    6.6) group the genomes by genus and make a heatmap of the mean mcf for each module and genus,

    6.7) carry out a PCA analysis, showing the cumulative variance and a principal component (PC) plot,

    6.8) visualise the PC plot overlaying the genus grouping, and

    6.9) measure the within-group (genus) variance, using the mean Euclidean distance using the principal components resulting from the mcf matrix of all modules as a proxy for spread.


1) Identify potential methanogens and evaluate their genomes for select metabolic processes loosely relating to anaerobic digestion.

We are first interested in evaluating the mcf of organisms that have "methano" in the name across loosely AD-related modules, which have been "hand-picked".

# Load library
> library(MetQy)

# Create a folder to store output
> output_path <- "Example_1"
> dir.create(output_path)

# Modules loosely related to anaerobic digestion (AD)
> AD_modules   <- c("M00087","M00356","M00357","M00563","M00567","M00596","M00530","M00377")

# Let the function find the KEGG genomes that have "methano" in the name and calculate the module completeness fraction (mcf) for the AD-relevant modules
> query_output <- query_genomes_to_modules(GENOME_INFO = "methano",MODULE_ID = AD_modules,META_OUT = T,ADD_OUT = T)
# Set OUT_MODULE_NAME to TRUE to retrieve back mcf data with module names instead of IDs. 
# You can also use the module IDs to replace it with the NAME (contained in $METADATA)

# Retrieve information about matching organisms
> nrow(query_output$GENOME_INFO_DATA)
# [1] 86 - # of KEGG genomes that matched (partially) the organism name

> organisms   <- query_output$GENOME_INFO_DATA$ORGANISM
# There are "Candidate" organisms in the KEGG genome database. We'll remove those.
> length(grep("candidat",organisms,ignore.case = T))
#  [1] 4

# Remove candidate organisms
> candidate_index <- grep("candidat",organisms,ignore.case = T)

> query_output$GENOME_INFO_DATA[grep("candidat",organisms,ignore.case = T),]
#        ID ORG_ID                                                  ORGANISM
# 22 T00823    mpl               Candidatus Methanosphaerula palustris E1-9c
# 49 T02624    max              Candidatus Methanomethylophilus alvus Mx1201
# 51 T02692    mer Candidatus Methanomassiliicoccus intestinalis Issoire-Mx1
# 57 T03540   mear                    Candidatus Methanoplasma termitum MpT1

> organisms   <- organisms[-candidate_index]
> mcf_matrix  <- query_output$MATRIX[-candidate_index,]
> genome_info <- query_output$GENOME_INFO_DATA[-candidate_index,]
> module_info <- query_output$METADATA[-candidate_index,]

# Retrieve the GENUS (first word) by removing the rest of the text
> genus <- gsub(pattern = "\\s.+$",replacement = "",organisms)
> length(unique(genus))
#  [1] 27

2) Visualise the mcf for the user-specified modules across the selected genomes in KEGG using a heatmap

# Quick look at mcfs
> plot_heatmap(mcf_matrix,ORDER_MATRIX = T) # set ORDER_MATRIX to TRUE so that the data is ordered according to hierarchical dendogram

> plot_heatmap(mcf_matrix,ORDER_MATRIX = T,Filename = "Example_1/heatmap.png",Height = 4,Width = 5.4) # set ORDER_MATRIX to TRUE so that the data is ordered according to hierarchical dendogram

3) Investigate which are the genomes that have a low mcf based on the heatmap

The heatmap shows that moduel M00567 (M00567: Methanogenesis, CO2 => methane) has a widespread "good" coverage-mcf >= 0.75 in 96% of genomes. We can investigate which genomes those are, as well as get more information about the modules.

> names(module_info)
# [1] "MODULE_ID"   "MODULE_NAME" "NAME_SHORT"  "CLASS_I"     "CLASS_II"    "CLASS_III"   "DEFINITION"  "OPTIONAL"   

> module_info[which(module_info$MODULE_ID=="M00567"),1:6]
#   MODULE_ID                    MODULE_NAME                     NAME_SHORT        CLASS_I          CLASS_II          CLASS_III
# 7    M00567 Methanogenesis, CO2 => methane Methanogenesis, CO2 => methane Pathway module Energy metabolism Methane metabolism

> small_M00567_index <- which(mcf_matrix[,7]<0.75)
> mcf_matrix[small_M00567_index,7]
# T03209 T03230 T04272 
#   0.00   0.00   0.25  # mcf

# Look at the corresponding genomes (conserved order)
> genome_info[small_M00567_index,]
#        ID ORG_ID                      ORGANISM
# T03209   bmet    Bacillus methanolicus MGA3
# T03230    amq Amycolatopsis methanolica 239
# T04272   marc Methanogenic archaeon ISO4-H5

4) Find out which genes are missing for module M00567 to be complete for genome T04272

Investigate which genes are missing for M00567 to be complete in genome T04272 (Methanogenic archaeon ISO4-H5). T04272 has an mcf for module M00567 of 0.25. This means that only a quarter of the blocks needed for the module to be complete are present.

> T04272_M00567_missing <- query_missingGenes_from_module(GENOME ="T04272",MODULE_ID = "M00567",PRINT_TO_SCREEN = F)
> print(T04272_M00567_missing)
#   block_No PRESENT                                                        BLOCK_DEF                                    MISSING_GENES
# 1        1       0 *K00200*&*K00201*&*K00202*&*K00203*&(*K00205*|*K11260*|*K00204*) K00200;K00201;K00202;K00203;K00205;K11260;K00204
# 2        2       0                                                         *K00672*                                           K00672
# 3        3       0                                                         *K01499*                                           K01499
# 4        4       0                                                *K00319*|*K13942*                                    K00319;K13942
# 5        5       0                                                         *K00320*                                           K00320
# 6        6       0            *K00577*&*K00578*&*K00579*&*K00580*&*K00581*&*K00584*        K00577;K00578;K00579;K00580;K00581;K00584
# 7        7       1                                             K00399&K00401&K00402                                                 
# 8        8       1                                             K03388&K03389&K03390                                                 

As expected, only 2 out of 8 module definition blocks are present (the bottom two). T04272 has not been annotated with any of The K numbers flanket by asterisks (*). Note that T04272 doesn't need to have ALL the genes listed on the MISSING_GENES column; it only needs those to have a complete block according to the block definition (BLOCK_DEF, logical expression). For instance, in order to have a complete first block, T04272 requires the following KEGG orthologs: ALL THESE: K00200, K00201, K00202 and K00203 AND ONE OF THESE: K00205, K11260 or K00204.


5) Use a sunburst plot to visualise T04272's mcf for all the modules including module information.

We can also look more closely to T04272's mcf with module information to have more of a context and potentially gain more insight.

# RETREIVE GENOME "T04272"'s mcf
> T04272_mcf <- mcf_matrix[which(rownames(mcf_matrix)=="T04272"),]

# MODULE INFORMATION
> AD_module_info <- query_output$METADATA[c(4:6,3)]
> names(AD_module_info)
# [1] "CLASS_I"    "CLASS_II"   "CLASS_III"  "NAME_SHORT"

### Manually edit text for nicer output
# long text entry [1] "Methanogenesis, methylamine/dimethylamine/trimethylamine => methane"
> AD_module_info[6,4] <- "Methanogenesis, methylamine => methane"
# Replace spaces with new lines to make labels neater
for(c in 1:ncol(AD_module_info)) AD_module_info[,c] <- gsub(" ","\n",AD_module_info[,c])

### Manually edit text for nicer output
> AD_module_info[1,2] <- "Carbohydrate\nand lipid\nmetabolism"
> AD_module_info[1,3] <- "Fatty acid\nmetabolism"
> AD_module_info[8,4] <- "Dissimilatory\nsulfate reduction,\nsulfate => H2S"

# MAKE SUNBURST PLOT ----
# DISPLAY PLOT
> sunburst_output <- plot_sunburst(AD_module_info,fill_by = T04272_mcf,legend_name = "mcf\n",Filename = "",WIDTH = 8)
# SAVE PLOT TO FILE
> sunburst_output <- plot_sunburst(AD_module_info,fill_by = T04272_mcf,legend_name = "mcf\n",Filename = "Example_1/T04272_sunburst.png",WIDTH = 4,HEIGHT = 4)

6) Carry out an automated analysis

We can use the analysis_genomes_module_output function to analyse all the data generated in step 1) and also the data grouped by genus (given as a factor), generating a report in the process.

This functions allows some analyses to be made based on grouping the data as defined by factor(s). Thus by defining FACTOR (which can be one factor or multiple, if it's a list), the mean mcf and the standard deviatioin of the mcf are visualised as a heatmap with a row for every group. Here we provide the genus as a grouping factor. Furthermore, the function generates a PC plot with data colour-code according to the grouping. Finally, the function calculates the mean Euclidean distances from the PC space for each group as a proxy measurement for within group variation (only for groups with more than one member).

In summary, the report will:

1) report the total number of data sets (genomes) and modules analysed,
2) generate a heatmap of the mcf of all genomes and modules analysed,
3) generate boxplots of the mcf across all genomes for each module,
4) generate a scatter plot of the standard deviation of the mcf across all genomes for each module,
5) identify any modules that have a constant (zero-variance) mcf across all genomes,
6) group the genomes by genus and make a heatmap of the mean mcf for each module and genus,
7) carry out a PCA analysis on the mcf matrix and plot the cumulative variance and a scatter plot of PCs,
8) visualise the PC plot overlaying the genus grouping, and
9) measure the within-group (genus) variance, using the mean Euclidean distance of the PCs as a proxy for spread.

The figures included in the report are shown below by section. Refer to the report (generated in the next code block) and the function documentation for more details.

## Use the analysis function to process the output from 'query_genomes_to_modules' and generate a LaTex report
# Use the genus names we have previously generated to group our data.
> useFactor <- list("genusFactor" = genus) # By naming the list item ("genusFactor"), we control the naming convention for the files generated. It would be FACTOR by default. 

> analysis_output <- analysis_genomes_module_output(FRACTION_MATRIX = mcf_matrix,outPath = output_path,figType = ".png",FACTOR = useFactor)

  # GENERATES THE FOLLOWING FILES IN THE OUTPUT FOLDER:
  #     PCA/
  #         module_mean_dist_output_FACTOR.rda
  #         pca_plot.png
  #         pca_var.png
  #         pca.rda
  #         plot_mean_dist_genusFactor.png
  #         plot_scatter_genusFactor.png
  #     module_allOrgs_sd_boxplot.png
  #     module_allOrgs_sd.png
  #     module_allOrgs.png
  #     module_constant_presence.txt
  #     module_mean_genusFactor.png
  #     module_sd_genusFactor.png
  #     module_output_plots.rda
  #     report.tex

# The files with 'genusFactor' in the name were generated from using the FACTOR argument, as we labelled the list in the definition.

REPORT FIGURES RELATED TO THE STEPS OUTLINES ABOVE

2-4) mcf heatmap, scatter plot and boxplot

Here are the module_allOrgs.png, module_allOrgs_var.png and module_allOrgs_var_boxplot.png figures, showing the heatmap (same as before) and the standard deviation across modules both as a scatter plot and as a boxplot. See module_constant_presence.txt for information about modules that might be absent, present or with the same value across all the genomes analysed for a specific module.


6) Heatmaps showing the mcf and mcf standard deviation for every group

The figures module_mean_genusFactor.png and module_sd_genusFactor.png show the mean mcf and the standard deviation (sd) for every genus (row), respectively.


7) PCA: cumulative variance and PC plot of all the data

A PCA was done on the mcf matrix. The files pca_var.png and pca_plot.png, show the cumulative variation accounted for by the PCs and the scatter plot of the first two PCs, respectively.


8) Genus-level: PC plot with overlaying groups and mean Euclidean distance

File: plot_scatter_genusFactor.png


9) Genus-level: mean Euclidean dustance as a measure of within-group variance

File: plot_mean_dist_genusFactor.png

Clone this wiki locally