Cecret

Named after the beautiful Cecret lake

Location: 40.570°N 111.622°W , Elevation: 9,875 feet (3,010 m), Hiking level: easy

(Image credit: Intermountain Healthcare)

Table of Contents:

Introduction

Cecret was originally developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. This nextflow workflow, however, is flexible for many additional organisms and primer schemes as long as the reference genome is "small" and "good enough." In 2022, @tives82 added in contributions for Monkeypox virus, including converting IDT's primer scheme to NC_063383.1 coordinates. We are grateful to everyone that has contributed to this repo.

The nextflow workflow was built to work on linux-based operating systems. Additional config options are needed for cloud batch usage.

The library preparation method greatly impacts which bioinformatic tools are recommended for creating a consensus sequence. For example, amplicon-based library prepation methods will required primer trimming and an elevated minimum depth for base-calling. Some bait-derived library prepation methods have a PCR amplification step, and PCR duplicates will need to be removed. This has added complexity and several (admittedly confusing) options to this workflow. Please submit an issue if/when you run into issues.

It is possible to use this workflow to simply annotate fastas generated from any workflow or downloaded from GISAID or NCBI. There are also options for multiple sequence alignment (MSA) and phylogenetic tree creation from the fasta files.

Cecret is also part of the staphb-toolkit.

Dependencies

Nextflow
Singularity or Docker - set the profile as singularity or docker during runtime

Usage

Cecret can also use a sample sheet for input with the sample name and reads separated by commas. The header must be sample,fastq_1,fastq_2. The general rule is the identifier for the file(s), the file locations, and the type if not paired-end fastq files.

Rows match files with their processing needs.

paired-end reads: sample,read1.fastq.gz,read2.fastq.gz
single-reads reads: sample,sample.fastq.gz,single
nanopore reads : sample,sample.fastq.gz,ont
fasta files: sample,sample.fasta,fasta
multifasta files: multifasta,multifasta.fasta,multifasta

Example sample sheet:

sample,fastq_1,fastq_2
SRR13957125,/home/eriny/sandbox/test_files/cecret/reads/SRR13957125_1.fastq.gz,/home/eriny/sandbox/test_files/cecret/reads/SRR13957125_2.fastq.gz
SRR13957170,/home/eriny/sandbox/test_files/cecret/reads/SRR13957170_1.fastq.gz,/home/eriny/sandbox/test_files/cecret/reads/SRR13957170_2.fastq.gz
SRR13957177S,/home/eriny/sandbox/test_files/cecret/single_reads/SRR13957177_1.fastq.gz,single
OQ255990.1,/home/eriny/sandbox/test_files/cecret/fastas/OQ255990.1.fasta,fasta
SRR22452244,/home/eriny/sandbox/test_files/cecret/nanopore/SRR22452244.fastq.gz,ont

# using docker on samples specified in SampleSheet.csv
nextflow run UPHL-BioNGS/Cecret -profile docker --sample_sheet SampleSheet.csv

# using a config file containing all inputs
nextflow run UPHL-BioNGS/Cecret -c file.config

Results are roughly organiized into 'params.outdir'/< analysis >/sample.result

A file summarizing all results is found in 'params.outdir'/cecret_results.csv and 'params.outdir'/cecret_results.txt.

Consensus sequences can be found in 'params.outdir'/consensus and end with *.consensus.fa.

Full workflow

Updating Cecret

nextflow pull UPHL-BioNGS/Cecret

Cecret has a weekly update schedule. Cecret's versions have three numbers : X.Y.Z. If the first number, X, changes, there has been a major modification. Params may have changed or subworkflows/channels may have been modified. If the second number, Y, changes, there has been a minor to moderate change. These are mainly for bug fixes or the changing the defaults of params. If the last number has been modified, Z, the workflow is basically the same, there have just been some updates in the containers pulled for the workflow. Most of these updates are to keep Freyja, NextClade, and Pangolin current for SARS-CoV-2 analysis.

The main components of Cecret are:

aci - for depth estimation over amplicons (optional, set params.aci = true)
artic network - for aligning and consensus creation of nanopore reads
bbnorm - for normalizing reads (optional, set params.bbnorm = true)
bcftools - for variants
bwa - for aligning reads to the reference
fastp - for cleaning reads ; (optional, set params.cleaner = 'fastp')
fastqc - for QC metrics
freyja - for multiple SARS-CoV-2 lineage classifications
heatcluster - for visualizing SNP matrices generated via SNP dists
iqtree2 - for phylogenetic tree generation (optional, set params.relatedness = true)
igv-reports - visualizing SNPs (optional, set params.igv_reports = true)
ivar - calling variants and creating a consensus fasta; default primer trimmer
kraken2 - for read classification
mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
minimap2 - an alternative to bwa (optional, set params.aligner = minimap2 )
multiqc - summary of results
nextclade - for SARS-CoV-2 clade classification (optional: aligned fasta can be used from this analysis when relatedness is set to "true" and msa is set to "nextclade")
pangolin - for SARS-CoV-2 lineage classification
pango aliasor - for SARS-CoV-2 lineage tracing
phytreeviz - for visualizing phylogenetic trees
samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files; optional duplication marking
seqyclean - for cleaning reads
snp-dists - for relatedness determination (optional, relatedness must be set to "true")
vadr - for annotating fastas like NCBI

Turning off unneeded processes

It came to my attention that some processes (like bcftools) do not work consistently. Also, they might take longer than wanted and might not even be needed for the end user. Here's the processes that can be turned off with their default values:

params.bcftools_variants = true           # vcf of variants
params.fastqc = true                      # qc on the sequencing reads
params.ivar_variants = true               # itemize the variants identified by ivar
params.samtools_stats = true              # stats about the bam files
params.samtools_coverage = true           # stats about the bam files
params.samtools_depth = true              # stats about the bam files
params.samtools_flagstat = true           # stats about the bam files
params.samtools_ampliconstats = true      # stats about the amplicons
params.samtools_plot_ampliconstats = true # images related to amplicon performance
params.kraken2 = false                    # used to classify reads and needs a corresponding params.kraken2_db and organism if not SARS-CoV-2
params.aci = false                        # coverage approximation of amplicons
parms.igv_reports = false                 # SNP IGV images
params.nextclade = true                   # SARS-CoV-2 clade determination
params.pangolin = true                    # SARS-CoV-2 lineage determination
params.pango_aliasor = true              # SARS-CoV-2 lineage tracing
params.freyja = true                      # multiple SARS-CoV-2 lineage determination
params.vadr = false                       # NCBI fasta QC
params.relatedness = false                # create multiple sequence alignments with input fastq and fasta files
params.snpdists = true                    # creates snp matrix from mafft multiple sequence alignment
params.iqtree2 = true                     # creates phylogenetic tree from mafft multiple sequence alignement
params.bamsnap = false                    # has been removed
params.rename = false                     # needs a corresponding sample file and will rename files for GISAID and NCBI submission
params.filter = false                     # takes the aligned reads and turns them back into fastq.gz files
params.multiqc = true                     # aggregates data into single report

Name		Name	Last commit message	Last commit date
Latest commit History 1,571 Commits
.github/workflows		.github/workflows
bin		bin
conf		conf
data		data
genomes		genomes
images		images
modules/local		modules/local
schema		schema
subworkflows/local		subworkflows/local
tests		tests
workflows		workflows
.dockstore.yml		.dockstore.yml
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
LICENSE		LICENSE
README.md		README.md
XML_Configuration.xml		XML_Configuration.xml
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cecret

Introduction

Dependencies

Usage

Full workflow

Updating Cecret

The main components of Cecret are:

Turning off unneeded processes

About

Uh oh!

Releases 135

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

UPHL-BioNGS/Cecret

Folders and files

Latest commit

History

Repository files navigation

Cecret

Introduction

Dependencies

Usage

Full workflow

Updating Cecret

The main components of Cecret are:

Turning off unneeded processes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 135

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages