EvolCat-Python: A Python Suite for Evolutionary and Comparative Genomics

EvolCat-Python is a powerful suite of command-line tools that simplifies and accelerates common tasks in evolutionary biology and comparative genomics. Converted from a battle-tested collection of Perl scripts, this project leverages the power of Biopython and other modern libraries to provide a robust, user-friendly toolkit for researchers.

Key Features

EvolCat-Python streamlines complex bioinformatics tasks. With this suite, you can:

🧬 Format Conversion: Effortlessly convert between common formats like GenBank, FASTA, and PHYLIP (gb2fasta.py, fas2phy.py).
🚀 BLAST Result Processing: Parse raw BLAST output into clean, usable tables for downstream analysis (parse_blast_text.py, blast_to_table.py).
🌳 Phylogenetic Pipeline Assistance: Prepare sequences, curate alignments, and calculate distance matrices for external tree-building software (nogaps.py, calculate_k2p.py).
🔬 Sequence Analysis & Extraction: Extract CDS regions from GenBank files, translate them, and analyze specific nucleotide positions (extract_cds_region.py, gbCDS.py).
🦠 Virus Genomics & Protein Analysis: Utilize dedicated pipelines and guides for viral analysis, including a full SARS-CoV-2 lineage classification pipeline, protein language model analysis, and evolutionary data retrieval pipelines. See pipelines/README.md for details.
🛠️ Powerful Wrappers: Simplify interaction with external tools like PAML and perform basic filtering on VCF files.

Example Workflow: Phylogenetic Analysis

Visually understand how EvolCat-Python scripts work together with external tools to build a phylogenetic tree.

graph TD
    A[Input: Unaligned FASTA] -->|clean_fasta_name.py| B(Cleaned FASTA);
    B -->|External Tool: MAFFT/MUSCLE| C(Aligned FASTA);
    C -->|nogaps.py| D(Curated Alignment);
    D -->|fas2phy.py| E(PHYLIP File for ML/BI Tools);
    D -->|calculate_dna_distances.py| F(Distance Matrix);
    F -->|build_tree_from_distances.py| G(Newick Tree);
    E -->|External Tool: RAxML/IQ-TREE| G;
    G -->|External Tool: FigTree/iTOL| H(Visualize Tree);

    style A fill:#cde4ff,stroke:#333,stroke-width:2px
    style H fill:#cde4ff,stroke:#333,stroke-width:2px

User Responsibility: The scripts and library components provided here are for research and informational purposes. Users are responsible for validating the results obtained using this software, interpreting them correctly, and ensuring they are appropriate for their specific application. The original authors and the converters of this code disclaim any liability for its use or misuse. It is recommended to test the tools with known datasets and compare results with other established bioinformatics software where appropriate. Users may need to adapt or modify the code to suit their specific research needs and computational environment.

Overview

The library is organized into:

pylib/utils/: Contains core utility modules for tasks like sequence parsing.
pylib/scripts/: Contains executable Python scripts that replicate and extend the functionality of original bioinformatics command-line tools. Many of these scripts depend on the pylib/utils/ core utility modules. The scripts are designed to find these modules by default when EvolCat-Python is structured with pylib/utils/ as a subdirectory. Some scripts may have their own detailed README.md files within this directory (e.g., extract_cds_region.py).
- pylib/scripts/ncbi/: Contains tools specifically for interacting with NCBI.
- pylib/scripts/paml_tools/: Contains tools specifically for PAML genomics analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 362 Commits
docs		docs
guides		guides
pipelines		pipelines
pylib		pylib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

bob-friedman/EvolCat-Python

Folders and files

Latest commit

History

Repository files navigation

EvolCat-Python: A Python Suite for Evolutionary and Comparative Genomics

Key Features

Example Workflow: Phylogenetic Analysis

Table of Contents

Overview

Dependencies

Installation

General Python Environment Setup

Installing EvolCat-Python

Access in a Windows OS with WSL (Windows Subsystem for Linux)

General Script Usage

NCBI Tools

Relationship with Biopython and Scope of Provided Scripts

Workflow Examples

Step 1: Identifying and Retrieving Initial Sequences

Step 2: Fetching Full Records for Retrieved IDs

Step 3: Converting Formats

Step 4: Extracting Relevant Information

Step 5: Cleaning and Standardizing Sequence Data

Step 6: Merging and Organizing Your Local Database

Step 1: Sequence Preparation and Alignment

Step 2: Alignment Curation and Basic Analysis

Step 3: Phylogenetic Inference

Step 4: Tree Visualization and Basic Manipulation

VCF File Analysis and Filtering

Extracting and Analyzing CDS Regions

SARS-CoV-2 Lineage Classification Pipeline

Technical Guides

Virus Genomics, Diversity, and Analysis

SARS-CoV-2 Genome Analysis Technical Report and Dataset

Detailed Script Usage

Sampling Bias and Type II Error

Development and Contributions

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages