Skip to content

UPHL-BioNGS/Cecret

Repository files navigation

Cecret

Named after the beautiful Cecret lake

Location: 40.570°N 111.622°W , Elevation: 9,875 feet (3,010 m), Hiking level: easy

(Image credit: Intermountain Healthcare)

Table of Contents:

Introduction

Cecret was originally developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. This nextflow workflow, however, is flexible for many additional organisms and primer schemes as long as the reference genome is "small" and "good enough." In 2022, @tives82 added in contributions for Monkeypox virus, including converting IDT's primer scheme to NC_063383.1 coordinates. We are grateful to everyone that has contributed to this repo.

The nextflow workflow was built to work on linux-based operating systems. Additional config options are needed for cloud batch usage.

The library preparation method greatly impacts which bioinformatic tools are recommended for creating a consensus sequence. For example, amplicon-based library prepation methods will required primer trimming and an elevated minimum depth for base-calling. Some bait-derived library prepation methods have a PCR amplification step, and PCR duplicates will need to be removed. This has added complexity and several (admittedly confusing) options to this workflow. Please submit an issue if/when you run into issues.

It is possible to use this workflow to simply annotate fastas generated from any workflow or downloaded from GISAID or NCBI. There are also options for multiple sequence alignment (MSA) and phylogenetic tree creation from the fasta files.

Cecret is also part of the staphb-toolkit.

Dependencies

Usage

Cecret can also use a sample sheet for input with the sample name and reads separated by commas. The header must be sample,fastq_1,fastq_2. The general rule is the identifier for the file(s), the file locations, and the type if not paired-end fastq files.

Rows match files with their processing needs.

  • paired-end reads: sample,read1.fastq.gz,read2.fastq.gz
  • single-reads reads: sample,sample.fastq.gz,single
  • nanopore reads : sample,sample.fastq.gz,ont
  • fasta files: sample,sample.fasta,fasta
  • multifasta files: multifasta,multifasta.fasta,multifasta

Example sample sheet:

sample,fastq_1,fastq_2
SRR13957125,/home/eriny/sandbox/test_files/cecret/reads/SRR13957125_1.fastq.gz,/home/eriny/sandbox/test_files/cecret/reads/SRR13957125_2.fastq.gz
SRR13957170,/home/eriny/sandbox/test_files/cecret/reads/SRR13957170_1.fastq.gz,/home/eriny/sandbox/test_files/cecret/reads/SRR13957170_2.fastq.gz
SRR13957177S,/home/eriny/sandbox/test_files/cecret/single_reads/SRR13957177_1.fastq.gz,single
OQ255990.1,/home/eriny/sandbox/test_files/cecret/fastas/OQ255990.1.fasta,fasta
SRR22452244,/home/eriny/sandbox/test_files/cecret/nanopore/SRR22452244.fastq.gz,ont
# using docker on samples specified in SampleSheet.csv
nextflow run UPHL-BioNGS/Cecret -profile docker --sample_sheet SampleSheet.csv

# using a config file containing all inputs
nextflow run UPHL-BioNGS/Cecret -c file.config

Results are roughly organiized into 'params.outdir'/< analysis >/sample.result

A file summarizing all results is found in 'params.outdir'/cecret_results.csv and 'params.outdir'/cecret_results.txt.

Consensus sequences can be found in 'params.outdir'/consensus and end with *.consensus.fa.

Full workflow

alt text

Updating Cecret

nextflow pull UPHL-BioNGS/Cecret

Cecret has a weekly update schedule. Cecret's versions have three numbers : X.Y.Z. If the first number, X, changes, there has been a major modification. Params may have changed or subworkflows/channels may have been modified. If the second number, Y, changes, there has been a minor to moderate change. These are mainly for bug fixes or the changing the defaults of params. If the last number has been modified, Z, the workflow is basically the same, there have just been some updates in the containers pulled for the workflow. Most of these updates are to keep Freyja, NextClade, and Pangolin current for SARS-CoV-2 analysis.

The main components of Cecret are:

  • aci - for depth estimation over amplicons (optional, set params.aci = true)
  • artic network - for aligning and consensus creation of nanopore reads
  • bbnorm - for normalizing reads (optional, set params.bbnorm = true)
  • bcftools - for variants
  • bwa - for aligning reads to the reference
  • fastp - for cleaning reads ; (optional, set params.cleaner = 'fastp')
  • fastqc - for QC metrics
  • freyja - for multiple SARS-CoV-2 lineage classifications
  • heatcluster - for visualizing SNP matrices generated via SNP dists
  • iqtree2 - for phylogenetic tree generation (optional, set params.relatedness = true)
  • igv-reports - visualizing SNPs (optional, set params.igv_reports = true)
  • ivar - calling variants and creating a consensus fasta; default primer trimmer
  • kraken2 - for read classification
  • mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
  • minimap2 - an alternative to bwa (optional, set params.aligner = minimap2 )
  • multiqc - summary of results
  • nextclade - for SARS-CoV-2 clade classification (optional: aligned fasta can be used from this analysis when relatedness is set to "true" and msa is set to "nextclade")
  • pangolin - for SARS-CoV-2 lineage classification
  • pango aliasor - for SARS-CoV-2 lineage tracing
  • phytreeviz - for visualizing phylogenetic trees
  • samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files; optional duplication marking
  • seqyclean - for cleaning reads
  • snp-dists - for relatedness determination (optional, relatedness must be set to "true")
  • vadr - for annotating fastas like NCBI

Turning off unneeded processes

It came to my attention that some processes (like bcftools) do not work consistently. Also, they might take longer than wanted and might not even be needed for the end user. Here's the processes that can be turned off with their default values:

params.bcftools_variants = true           # vcf of variants
params.fastqc = true                      # qc on the sequencing reads
params.ivar_variants = true               # itemize the variants identified by ivar
params.samtools_stats = true              # stats about the bam files
params.samtools_coverage = true           # stats about the bam files
params.samtools_depth = true              # stats about the bam files
params.samtools_flagstat = true           # stats about the bam files
params.samtools_ampliconstats = true      # stats about the amplicons
params.samtools_plot_ampliconstats = true # images related to amplicon performance
params.kraken2 = false                    # used to classify reads and needs a corresponding params.kraken2_db and organism if not SARS-CoV-2
params.aci = false                        # coverage approximation of amplicons
parms.igv_reports = false                 # SNP IGV images
params.nextclade = true                   # SARS-CoV-2 clade determination
params.pangolin = true                    # SARS-CoV-2 lineage determination
params.pango_aliasor = true              # SARS-CoV-2 lineage tracing
params.freyja = true                      # multiple SARS-CoV-2 lineage determination
params.vadr = false                       # NCBI fasta QC
params.relatedness = false                # create multiple sequence alignments with input fastq and fasta files
params.snpdists = true                    # creates snp matrix from mafft multiple sequence alignment
params.iqtree2 = true                     # creates phylogenetic tree from mafft multiple sequence alignement
params.bamsnap = false                    # has been removed
params.rename = false                     # needs a corresponding sample file and will rename files for GISAID and NCBI submission
params.filter = false                     # takes the aligned reads and turns them back into fastq.gz files
params.multiqc = true                     # aggregates data into single report

About

Reference-based consensus creation

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5