AbAnalysis: Antibody Sequence Analysis Pipeline
Description: AbAnalysis is an antibody sequence analysis pipeline designed for processing, filtering, and annotating next-generation sequencing (NGS) antibody repertoires. It integrates germline assignment, junction identification, UAID parsing, and consensus sequence generation into a reproducible and automated workflow.
Key Features:
🔹 Paired-end read merging with PANDAseq
🔹 Germline assignment and junction identification using IgBLAST
🔹 Automatic filtering of non-functional sequences and sequencing artifacts
🔹 Frameshift correction for indel errors
🔹 UAID (Unique Antibody Identifier) parsing with configurable length (e.g., -u 20)
🔹 Data storage and querying via MongoDB integration
🔹 Consensus sequence generation using MUSCLE and Biopython
🔹 Cross-platform support with precompiled IgBLAST binaries for Linux and macOS
Workflow Summary:
Merge paired-end reads using PANDAseq.
Process merged reads through the IgBLAST-based analysis pipeline.
Filter out non-functional or artifact sequences; correct indels.
Parse UAIDs (Unique Antibody Identifiers) and populate sequence metadata.
Store annotated sequences in MongoDB for downstream analysis.
Bin sequences by UAID, discard singletons, and add germline variable gene sequences as tie-breakers.
Generate consensus sequences with MUSCLE and Biopython.
Re-analyze consensus sequences and store final results in a separate MongoDB database.
To run AbAnalysis on a single FASTA or FASTQ file:
python ab_analysis.py -i <input-file> -o <output-directory> -t <temp-directory>
To iteratively run AbAnalysis on all files in an input directory:
python ab_analysis.py -i <input-directory> -o <output-directory> -t <temp-directory>
-m, --merge
Input directory should contain paired FASTQ (or gzipped FASTQ) files. Paired files will be merged with PANDAseq prior to processing with AbAnalysis.
-u N, --uaid N
Sequences contain a unique antibody ID (uaid) of length N. The uaid will be parsed and added to the JSON output.
-s, --species
Select the species from which the input sequences are derived. Supported options are 'human', 'mouse', and 'macaque'. Default is 'human'.
-n, --next_seq
Use if the sequences were generated on a NextSeq sequencer. Multiple lane files from the same sample will be merged.
Two helper scripts are included:
batch_merge.py
performs PANDAseq merging on a directory of paired FASTQ (or gzipped FASTQ) files.
mongoimport.py
iteratively imports a directory of JSON files into a MongoDB database.
Python 3 >= 3.7
biopython >= 1.76
batch_merge.py
requires PANDAseq (https://github.com/neufeld/pandaseq)
mongoimport.py
requires MongoDB >= 2.6 (http://www.mongodb.org/) and pymongo >= 3.7
You don't need to install igblastn. The binaries are included in this repository.
AbAnalysis should work correctly with Windows(x86, x64), Linux, OS X
You can install almost all the requirements with pip or anaconda
pandaseq will require some level of professional skills to compile binaries for Windows. OS X/Linux compiled versions you can find under the official releases tab on GitHub (https://github.com/neufeld/pandaseq/releases)
For the Python 2 version usd python2 branch or the original repository