Python tools for Codon Usage Bias Analysis

This repository hosts different Python3 command-line programs and new graphical user interface web app! for calculating popular codon usage and amino acid usage frequency statistics from FASTA sequence files (.fasta).
Motivation : I worked with hundreds of genomes so I wrote these scripts to handle batch processing of multiple genomes/ input files and outputs a CSV formatted table that is easier to parse and amenable to statistical analysis like PCA - a task that I found tedious because previously published tools would output the conventional wide-form codon usage table that needed extra processing.
- Genomics Publications that used Codon-Usage-inPython :
  - Transfer RNA Levels Are Tuned to Support Differentiation During Drosophila Neurogenesis
  - Kingdom-Wide Analysis of Fungal Protein-Coding and tRNA Genes Reveals Conserved Patterns of Adaptive Evolution
  - Super-pangenome analysis of 3562 human and animal papillomavirus isolates illuminates their genome and pathogenicity evolution
I validated Codon Usage tools against the original and widely loved CodonW by Peden, 1995.

-------Software Setup----------:

All tools require python3 version 3.10 or higher is installed and pandas version 2.0 or higher.
Recommended to install python3 via anaconda which comes pre-loaded with pandas https://docs.anaconda.com/anaconda/install/index.html

Open Terminal (Mac/Linux) or Command Prompt (Windows)

Clone the repository:

git clone https://github.com/rhondene/Codon-Usage-in-Python.git

Navigate to the project folder:

cd Codon-Usage-in-Python/codon-usage-gui

Install the package:
```
pip install -e .
```

See the test_data folder for examples of the outputs of each tool on the same input fasta file ('NB_CDS.fasta')

------Run Codon Analysis Via Browser Web App (Recommended) ------

🔬 Features

📊 Comprehensive Analysis Types:

Transcriptome-wide RSCU (Relative Synonymous Codon Usage)
Per-gene RSCU analysis for individual sequence patterns
Amino acid usage analysis (expected vs observed frequencies)
Codon usage per 1000 codons for normalization
Relative codon frequencies per gene

Open your terminal and type:

codon-usage-gui

A web page will automatically open in your browser! If not, just click on the Local URL: http://localhost:xxx to surface the web page

------How to Use Stand-alone Command-line Tool ------

Here, you will use a single line of command to run the executable binary file (.pyz) via the shell terminal .

Compute_RSCU_gene :

Computes relative synonymous codon usage of each 59 degenerate codons per each coding sequence (CDS) according to Sharp and Li, 1986 PMCID: PMC340524
Input: FASTA file of N coding sequences (CDS)
Output: comma-separated table (csv) of the relative synonymous codon usage for each transcript: i.e. a matrix of N transcripts x 59 RSCU values

How to Use :

Copy the Compute_RSCU_gene.pyz binary from the Codon-Usage-in-Python/Compute_RSCU_gene folder into your project folder containing the input FASTA file.
Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
Type the following in the terminal, be sure to replace the names of the input and output arguments with your own :

	python Compute_RSCU_gene.pyz -CDS example_cds.fasta -out rscu_results

Also run python Compute_RSCU_gene.pyz --help for help menu.

Compute_RSCU_tw :

Computes relative synonymous codon usage (RSCU) and absolute counts of the 59 synonymous codons over the entire set (aggregate) of coding sequences('transcriptome-wide'). Implemented according to Sharp and Li, 1986 PMCID: PMC340524
Input: single or multifasta file of coding sequences (CDS)
Output: a comma-separated table (.csv) file of the 59 RSCU values

How to Use :

Copy Compute_RSCU_tw.pyz binary from Codon-Usage-in-Python/Compute_RSCU_tw folder into your working folder that contains the input fasta file of CDS.
Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
To run the programn, type the command below in the terminal shell (be sure to replace arguments with the actual name the input and output files):
```
	python Compute_RSCU_tw.pyz -CDS example.fasta -out results
```

CodonCount:

Computes the length normalized codon frequency of each 61 sense codons of a coding sequence (CDS), and returns CSV .

    Relative Frequency of Codon_i=  (frequency of codon_i)/(total number of codons in the CDSj)

How to Use :

Copy the CodonCount.pyz file in Codon-Usage-in-Python/CodonCount folder into your working folder with the input fasta file(s).
Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
To run the programn, type the command below in the terminal shell (be sure to replace arguments with the actual name the input and output files):
```
python CodonCount.pyz -CDS example.fasta -out example_output
```

Also run python CodonCount.pyz --help for help menu.

CodonUsage_per_1000:

Computes codon usage per 1000 of the whole transcriptome.

Copy the CodonUsage_per_1000.pyz file in Codon-Usage-in-Python/CodonUsage_per_1000 folder into your working folder with the input fasta file(s).
Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
To run the programn, type the command below in the terminal shell (be sure to replace arguments with the actual name the input and output files):
```
python CodonUsage_per_1000.pyz -CDS all_CDS.fasta -out  results_cu
```

Also run python CodonUsage_per_1000.pyz --help for help menu.

fasta2csv :

Converts fasta file to two-column csv table (Header | Sequence);

aa_usage :

Computes the Expected and Observed Amino acid usage according to methods outlined in and https://pubmed.ncbi.nlm.nih.gov/5767777/ the https://qubeshub.org/publications/979/serve/1/3067?el=1&download=1
To run, download the script in your project folder and type in the terminal

python aa_usage.py -CDS YOUR_CDS.fasta -out OUTPUT_NAME

fix_fasta.py:

Corrects the issue of newlines within the same sequence.

Glossary Codon Usage Metrics

Codon Usage Bias

The unequal usage of synonymous codons within a gene or genome i.e. the deviation of synonymous codons from a uniform distribution due to a combination of natural selection, neutral mutational bias and genetic drift.

Relative Synonymous Codon Usage

The RSCU of a codon is computed as its observed frequency divided by its expected frequency within a gene or whole transcriptome under the null hypothesis of equal synonymous codon usage.

RSCU greater that 1 means that the codon is used more than expected by random chance. [Sharp & Li 1987].

Codons with high RSCU in highly expressed genes are referred to as "optimal codons". For many species the optimal codons are selectively recognised by the abundant tRNAs, which is often taken as an indication selection pressures shaping codon usage patterns [Ikemura 1983; Wint et al 2022].

Amino Acid Frequency:

If a particular amino acid is in some way adaptive, then it should occur more frequently than expected by chance.
This can easily be tested by calculating the expected frequencies of amino acids and comparing to observed. The codons and observed frequencies of particular amino acids are given in the table.
The frequencies of DNA bases in nature are 22.0% uracil, 30.3% adenine, 21.7% cytosine, and 26.1% guanine. The expected frequency of a particular codon can then be calculated by multiplying the frequencies of each DNA base comprising the codon. The expected frequency of the amino acid can then be calculated by adding the frequencies of each codon that codes for that amino acid.
As an example, the RNA codons for tyrosine are UAU and UAC, so the random expectation for its frequency is (0.220)(0.303)(0.220) + (0.220)(0.303)(0.217) = 0.0292. Since 3 of the 64 codons are nonsense or stop codons, this frequency for each amino acid is multiplied by a correction factor of 1.057.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
CodonCount		CodonCount
CodonUsage_per_1000		CodonUsage_per_1000
Compute_RSCU_gene		Compute_RSCU_gene
Compute_RSCU_tw		Compute_RSCU_tw
codon-usage-gui		codon-usage-gui
test_data		test_data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
aa_usage.py		aa_usage.py
fas2csv.sh		fas2csv.sh
fasta2csv.py		fasta2csv.py
fix_fasta.py		fix_fasta.py
per_gene_absolute_codon_counts.csv		per_gene_absolute_codon_counts.csv
res_gene_rscu.csv		res_gene_rscu.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python tools for Codon Usage Bias Analysis

-------Software Setup----------:

------Run Codon Analysis Via Browser Web App (Recommended) ------

🔬 Features

------How to Use Stand-alone Command-line Tool ------

Compute_RSCU_gene :

Compute_RSCU_tw :

CodonCount:

CodonUsage_per_1000:

fasta2csv :

aa_usage :

fix_fasta.py:

Glossary Codon Usage Metrics

Codon Usage Bias

Relative Synonymous Codon Usage

Amino Acid Frequency:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

rhondene/Codon-Usage-in-Python

Folders and files

Latest commit

History

Repository files navigation

Python tools for Codon Usage Bias Analysis

-------Software Setup----------:

------Run Codon Analysis Via Browser Web App (Recommended) ------

🔬 Features

------How to Use Stand-alone Command-line Tool ------

Compute_RSCU_gene :

Compute_RSCU_tw :

CodonCount:

CodonUsage_per_1000:

fasta2csv :

aa_usage :

fix_fasta.py:

Glossary Codon Usage Metrics

Codon Usage Bias

Relative Synonymous Codon Usage

Amino Acid Frequency:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages