Fstic: Allele‑Frequency‑based Genetic distance calculator

Paula Ruiz-Rodriguez¹ and Mireia Coscolla¹
_{1. I²SysBio, University of Valencia-CSIC, FISABIO Joint Research Unit Infection and Public Health, Valencia, Spain}

Overview

Fstic is a high‑performance command‑line tool written in Rust that calculates pairwise genetic distances from variant data. It ingests single‑sample VCF files or allele‑frequency tables and outputs an N × N distance matrix that summarises genetic differentiation among samples. The code is fully parallelised and scales to whole‑genome datasets on modern multi‑core CPUs.

Key Features

Multiple Distance Metrics · Eight standard estimators (FST, GST, Jost’s D, Reynolds, Nei, Cavalli‑Sforza chord, Rogers, Bray‑Curtis) let you view your data from complementary theoretical angles.
Flexible Input · Accepts raw VCFs or pre‑processed tables (.csv, .tsv, .tab). You can pass file lists for convenience.
Smart Filtering · Configurable filters on depth, allele frequency and allele counts for both VCF and table inputs ensure data quality.
Configurable Output · Add --normalize to divide cumulative distances by the number of loci (where applicable).
Optimised for Speed · Work is automatically distributed across all logical CPU cores (override with --workers).
Transparent Reporting · Live progress bar, ETA and a complete log of the filters applied.

Quick Start

# 1. Install your program:

# Using conda
conda install -c bioconda fstic
or
# Using mamba
mamba install -c bioconda fstic
or
# Generate your binary (requires Rust toolchain)
git clone https://github.com/<your-org>/fstic.git
cd fstic
cargo build --release  # binary at ./target/release/fstic

# 2. Run examples

# Example A: basic FST from three VCFs
./fstic \
  --vcf sample1.vcf sample2.vcf sample3.vcf \
  --output distances.csv

# Example B: Bray–Curtis from a list of tables, eight threads
./fstic \
  --table-list path_to_tables.txt \
  --output bray_curtis.csv \
  --formula bray-curtis \
  --workers 8

# Example C: stringent filters + normalised FST
./fstic \
  --table *.tsv \
  --reference ref.fa \
  --output fst_norm.tsv \
  --formula fst \
  --normalize \
  --min-depth 50 \
  --min-af 0.01 \
  --min-alt-reads 5

Command‑Line Reference

Flag	Default	Description
`--vcf / --vcf-list`	–	One or more VCF files or a file containing paths to them.
`--table / --table-list`	–	One or more table files or a file with paths.
`--reference <FASTA>`	required for table inputs	Reference genome used to infer missing reference alleles. Not needed for VCF mode.
`--output <FILE>`	required	Destination file for the distance matrix.
`--formula <METRIC>`	`fst`	Distance metric to compute (see below).
`--normalize`	off	Divide cumulative distances by the number of loci (affects `fst`, `bray-curtis`, `chord`, `jost_d`).
`--min-depth <INT>`	30	Filter out variants whose total depth < this value.
`--min-af <FLOAT>`	0.05	Filter out variants whose alt‑allele frequency < this value.
`--min-alt-reads <INT>`	2	Filter out variants with fewer supporting alt reads.
`--workers <INT>`	all logical cores	Number of threads to spawn.
`--help`	–	Print the full help message.

Guide to Distance Formulas

Metrics for Population Differentiation

Name (`--formula`)	Global Formula	Notes & Recommended Use
GST	$G_{ST} = \dfrac{H_T - H_S}{H_T}$	Classic overall differentiation (Nei 1973).
FST	$\theta = \dfrac{\sum_l (H_{T,l}-H_{S,l})}{\sum_l H_{T,l}}$ (Weir & Cockerham 1984 ratio‑of‑sums)	Default for relative differentiation; per‑locus estimates are also available.
Jost’s D	$D = \dfrac{H_T - H_S}{1 - H_S}$	Measures the fraction of allelic diversity that is partitioned among populations; less sensitive to within‑population variation.

Metrics for Phylogenetic / Divergence Analysis

Name (`--formula`)	Formula	Notes
Reynolds	$D_R = -\ln(1 - \theta)$	Linear with drift time for recently diverged populations.
Nei’s D	$D = -\ln \left( \dfrac{J_{xy}}{\sqrt{J_x J_y}} \right)$	Effective over long time‑scales; $J$ = probability of allele identity.
Cavalli‑Sforza “Chord”	$D_{CH} = \sqrt{,2\bigl(1-\sum_i \sqrt{p_i q_i}\bigr)}$	Geometric distance satisfying triangle inequality; useful for tree inference.
Rogers	$D_{R} = \sqrt{\dfrac{\sum_i (p_i-q_i)^2}{2L}}$	Euclidean‑based distance bounded between 0 and 1.

Metric for Allele‑Frequency Profile Comparison

Name (`--formula`)	Formula	Notes
Bray‑Curtis	$BC = \tfrac{1}{2} \sum_i \lvert p_i - q_i \rvert$	Dissimilarity (does not satisfy the triangle inequality).

Variables: $p_i$, $q_i$ = allele frequencies in populations x, y; $H_S$ and $H_T$ = within‑population and total heterozygosity; $L$ = number of loci.

Input Formats

VCF

One file per sample (single‑sample VCF). Sample name is taken from the filename.
Requires FORMAT/FREQ field (supports decimal or percentage). Optionally uses DP, AD, ADR for filtering.
Indels fully supported.
--reference not required in VCF mode: REF allele is part of the file.

Example 1 — VCF with all filtering fields

##fileformat=VCFv4.2
#CHROM POS  ID REF ALT QUAL FILTER INFO FORMAT        sample_A
chr1    100 .  A   T   .    .      .    GT:DP:AD:FREQ 0/1:50:24:48.0%
chr1    250 .  C   CAA .    .      .    GT:DP:AD:FREQ 0/1:45:10:22.2%

Example 2 — VCF missing DP/AD but with FREQ only

##fileformat=VCFv4.2
#CHROM POS  ID REF ALT QUAL FILTER INFO FORMAT sample_B
chr1    100 .  A   T   .    .      .    GT:FREQ 0/1:85.0%

Table (`.csv`, `.tsv`, `.tab`)

Required columns: sample, position, sequence (alt allele), frequency.
Recommended: ref_allele (crucial for indels).
Optional filtering columns: total_dp, alt_dp, alt_rv.
Column names are case‑insensitive; delimiter auto‑detected from extension.

Example 1 — CSV with all columns

sample,position,ref_allele,sequence,frequency,total_dp,alt_dp
sample_A,100,A,T,0.5,50,24
sample_B,100,A,T,0.8,60,48

Example 2 — TSV with only required columns

sample	position	sequence	frequency
sample_C	550	T	0.12
sample_D	550	A	0.95

Minor Clarifications & Defaults

Threads: By default Fstic launches one worker per logical CPU core. Override with --workers N.
Frequencies: The frequency field can be a proportion (0.125) or a percentage (12.5 %). Both are auto‑detected.
Normalisation: Adding --normalize turns cumulative distances into per‑locus means for metrics where that is meaningful.

✨ Contributors

fstic is developed with ❤️ by:

_{Paula Ruiz-Rodriguez}
💻 🔬 🤔 🔣 🎨 🔧

_{Mireia Coscolla}
🔍 🤔 🧑‍🏫 🔬 📓

This project follows the all-contributors specification (emoji key).

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/logos		.github/logos
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fstic: Allele‑Frequency‑based Genetic distance calculator

Overview

Key Features

Quick Start

Command‑Line Reference

Guide to Distance Formulas

Metrics for Population Differentiation

Metrics for Phylogenetic / Divergence Analysis

Metric for Allele‑Frequency Profile Comparison

Input Formats

VCF

Table (`.csv`, `.tsv`, `.tab`)

Minor Clarifications & Defaults

✨ Contributors

About

Uh oh!

Releases 1

Packages

Languages

License

PathoGenOmics-Lab/fstic

Folders and files

Latest commit

History

Repository files navigation

Fstic: Allele‑Frequency‑based Genetic distance calculator

Overview

Key Features

Quick Start

Command‑Line Reference

Guide to Distance Formulas

Metrics for Population Differentiation

Metrics for Phylogenetic / Divergence Analysis

Metric for Allele‑Frequency Profile Comparison

Input Formats

VCF

Table (.csv, .tsv, .tab)

Minor Clarifications & Defaults

✨ Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Table (`.csv`, `.tsv`, `.tab`)

Packages