Skip to content

rmukh/abanalysis

Repository files navigation

AbAnalysis (Python 3 only)

AbAnalysis: Antibody Sequence Analysis Pipeline

Description: AbAnalysis is an antibody sequence analysis pipeline designed for processing, filtering, and annotating next-generation sequencing (NGS) antibody repertoires. It integrates germline assignment, junction identification, UAID parsing, and consensus sequence generation into a reproducible and automated workflow.

Key Features:

🔹 Paired-end read merging with PANDAseq

🔹 Germline assignment and junction identification using IgBLAST

🔹 Automatic filtering of non-functional sequences and sequencing artifacts

🔹 Frameshift correction for indel errors

🔹 UAID (Unique Antibody Identifier) parsing with configurable length (e.g., -u 20)

🔹 Data storage and querying via MongoDB integration

🔹 Consensus sequence generation using MUSCLE and Biopython

🔹 Cross-platform support with precompiled IgBLAST binaries for Linux and macOS

Workflow Summary:

Merge paired-end reads using PANDAseq.

Process merged reads through the IgBLAST-based analysis pipeline.

Filter out non-functional or artifact sequences; correct indels.

Parse UAIDs (Unique Antibody Identifiers) and populate sequence metadata.

Store annotated sequences in MongoDB for downstream analysis.

Bin sequences by UAID, discard singletons, and add germline variable gene sequences as tie-breakers.

Generate consensus sequences with MUSCLE and Biopython.

Re-analyze consensus sequences and store final results in a separate MongoDB database.

Usage

To run AbAnalysis on a single FASTA or FASTQ file:
python ab_analysis.py -i <input-file> -o <output-directory> -t <temp-directory>

To iteratively run AbAnalysis on all files in an input directory:
python ab_analysis.py -i <input-directory> -o <output-directory> -t <temp-directory>

Additional options

-m, --merge Input directory should contain paired FASTQ (or gzipped FASTQ) files. Paired files will be merged with PANDAseq prior to processing with AbAnalysis.

-u N, --uaid N Sequences contain a unique antibody ID (uaid) of length N. The uaid will be parsed and added to the JSON output.

-s, --species Select the species from which the input sequences are derived. Supported options are 'human', 'mouse', and 'macaque'. Default is 'human'.

-n, --next_seq Use if the sequences were generated on a NextSeq sequencer. Multiple lane files from the same sample will be merged.

Helper scripts

Two helper scripts are included:
batch_merge.py performs PANDAseq merging on a directory of paired FASTQ (or gzipped FASTQ) files.
mongoimport.py iteratively imports a directory of JSON files into a MongoDB database.

Requirements

Python 3 >= 3.7
biopython >= 1.76

batch_merge.py requires PANDAseq (https://github.com/neufeld/pandaseq)
mongoimport.py requires MongoDB >= 2.6 (http://www.mongodb.org/) and pymongo >= 3.7

Notes

You don't need to install igblastn. The binaries are included in this repository.
AbAnalysis should work correctly with Windows(x86, x64), Linux, OS X

You can install almost all the requirements with pip or anaconda

pandaseq will require some level of professional skills to compile binaries for Windows. OS X/Linux compiled versions you can find under the official releases tab on GitHub (https://github.com/neufeld/pandaseq/releases)

For the Python 2 version usd python2 branch or the original repository

About

AbAnalysis software port

Topics

Resources

License

Stars

Watchers

Forks