scE2G

A computational pipeline for predicting genome-wide enhancer-gene regulatory links from single-cell ATAC-seq or paired ATAC and RNA-seq (multiome) data.

Overview

Input: Single-cell ATAC-seq or paired ATAC and RNA-seq (multiome) data per cell cluster

Output: Genome-wide enhancer-gene regulatory link predictions per cell cluster

Pipeline components

ABC model predictions - Compute ABC model predictions for each cell cluster
E2G feature generation - Generate element-gene features from ABC predictions
Correlation analysis (multiome only) - Compute Kendall correlation and/or ARC-E2G score for each cell cluster
Feature integration - Combine components 2 & 3 to construct feature file for predictive model
Model training (optional) - Train predictive model using CRISPR-validated E-G pairs from K562 dataset
Prediction - Apply trained model to assign scores to each element-gene pair

System requirements

Hardware requirements

The scE2G pipeline necessitates a standard computer furnished with ample RAM to facilitate the operations as defined by a user. For optimal performance, we suggest using a computer equipped with 32+ GB RAM.

Software requirements

The scE2G pipeline is compatible with Linux. It has been tested successfully on the following systems:

Linux: Red Hat Enterprise Linux 8.10 (Ootpa)
Linux: CentOS Linux 7 (Core)

The software dependencies and versions on which the software has been tested are listed in

workflow/envs/run_snakemake.yml
workflow/envs/sc_e2g.yml files

Installation

Clone repository

# Quick clone with submodules
git clone --recurse-submodules --shallow-submodules --depth 1 https://github.com/EngreitzLab/scE2G.git

# Initialize and update nested submodules
cd scE2G
git submodule update --init --recursive

Set up environment

We highly recommend using the provided conda environment for compatibility:

# Configure conda for flexible channel packaging
conda config --set channel_priority flexible

# Create and activate environment
conda env create -f workflow/envs/run_snakemake.yml
conda activate run_snakemake

Usage

Prerequisites

Before running scE2G, perform clustering to define cell clusters through standard single-cell analysis (e.g., Seurat & Signac).

Input data requirements

See example data in resources/example_chr22_multiome_cluster/ folder.

1. Pseudobulk fragment files

Format: One file per cell cluster with corresponding *.tbi index files
Requirements:
- Sorted by coordinates, compressed using bgzip, and indexed with tabix
- 5 columns (no header) corresponding to: chr, start, end, cell_name, read_count
- Cell names must match RNA count matrix column names
- Must contain exact set of cells represented in RNA matrix

Preparation:

# Sort if needed
sort -k1,1 -k2,2n atac_fragments.unsorted.tsv > atac_fragments.tsv

# Compress and index
bgzip atac_fragments.tsv
tabix -p bed atac_fragments.tsv.gz

2. RNA count matrix (multiome only)

Format: Gene × cell matrix for each cell cluster
Requirements:
- Use unnormalized (raw) counts
- No duplicated gene names
- Supported formats:
  - .csv.gz
  - .h5ad or .h5 (may require matching anndata version)
  - Sparse matrix directory (matrix.mtx.gz, genes.tsv.gz/features.tsv.gz, barcodes.tsv.gz)

Gene mapping: By default, genes are mapped via Ensembl ID using GENCODE v43 annotations. Modify gene_annotation in config/config.yaml for different versions (e.g., GENCODE v32 for CellRanger data).

Advanced file format options

Pre-processed fragment files

If your fragment files are already properly sorted and filtered to main chromosomes, you can skip preprocessing steps by setting fragments_preprocessed: True in your config file. Use this option only if you are certain that your files:

Are sorted with sort -k1,1 -k2,2n (chromosome then numerical position)
Only contain fragments on chromosomes present in the chromosome sizes reference file

This option allows you to skip the sorting and filtering steps of the pipeline, which can be very resource intensive for large fragment files.

Cell filtering configurations

The default pipeline settings assume the each cell cluster has a corresponding fragment file and RNA matrix that contain the exact same cells. If you instead have an RNA matrix containing cells from many clusters, you can avoid making cluster-specific matrices by setting RNA_matrix_filtered: False in your config file, and using the same RNA matrix for all clusters in your cell_cluster_config. The pipeline will then use the intersection of cells contained in the cluster-specific fragment file and combined RNA matrix to compute the Kendall correlation. Please note:

You still must provide cluser-specific fragment files
The RNA matrix must meet the formatting requirements indicated above
The memory requirements to load a very large RNA matrix may exceed the default estimations in the pipeline.

Configuration

Main config - Edit config/config.yaml:
- Set results_dir path
Cell clusters - Edit config/config_cell_clusters.tsv:
- Specify cluster name, ATAC fragment file path, RNA matrix path
- For ATAC-only analysis: leave RNA matrix path empty but include the column
Model selection - Specify path to model directory, or a comma-separated list of multiple models. Current supported models estimate contact using a power law function of genomic distance:
- For multiome predictions: models/multiome_powerlaw_v3
- For ATAC-only predictions: models/scATAC_powerlaw_v3

Running the pipeline

snakemake -j1 --use-conda --configfile config/config.yaml

Note: First run may take time to build conda environments, usually around 30-40 minutes according to CircleCI tests. If it exceeds 1 hour, ensure you're using mamba and have sufficient memory.

Output

Key outputs

Tabular predictions
- {results_dir}/{cell_cluster}/{model_name}/scE2G_predictions.tsv.gz: All putative enhancer-gene predictions for a cell cluster
- {results_dir}/{cell_cluster}/{model_name}/scE2G_predictions_threshold{model_threshold}.tsv.gz: Thresholded predictions containing enhancer-gene pairs that pass score threshold and other filtering steps
- Key score column in these files is E2G.Score.qnorm (quantile-normalized scE2G score)
Genome-browser files (produced if make_IGV_tracks: True in your config file)
- {IGV_dir}/{cell_cluster}/ATAC_norm.bw: bigWig file with read-depth normalized pseudobulk ATAC signal
- {IGV_dir}/{cell_cluster}/scE2G_predictions_threshold{model_threshold}.bedpe: bedpe file with filtered enhnancer-gene predictions
QC report
- {results_dir}/qc_plots/predictions_qc_report.html: Report summarizing properties of predictions in comparison to reference values

Full output structure

{results_dir}/                                         # Main results directory
├── {cell_cluster}/                                      # Outputs for each cell cluster
│   ├── ActivityOnly_features.tsv.gz                       # All element-gene pairs with activity-based features
│   ├── ActivityOnly_plus_external_features.tsv.gz         # All element-gene pairs with activity-based and other features
│   ├── ARC/                                               # ARC-E2G results and intermediate files (multiome only)
│   ├── external_features_config.tsv                       # Configuration for external features
│   ├── feature_table.tsv                                  # Feature table reflecting all models designated for the cell cluster
│   ├── genomewide_features.tsv.gz                         # Genome-wide element-gene pairs with features, formatted for model application
│   ├── Kendall/                                           # Kendall correlation results and intermediate results (multiome only)
│   ├── {model_name}/                                      # scE2G model-specific results (e.g., multiome_powerlaw_v3)
│   │   ├── scE2G_predictions.tsv.gz                         # All predictions with scores
│   │   ├── scE2G_predictions_threshold{threshold}.tsv.gz    # Filtered predictions
│   │   ├── scE2G_predictions_threshold{threshold}_stats.tsv # Properties of filtered predictions
│   │   ├── scE2G_element_list.tsv.gz                        # List of candidate elements and associated features
│   │   └── scE2G_gene_list.tsv.gz                           # List of genes in reference file and associated features
│   ├── Neighborhoods/                                    # Results from "Neighborhoods" step of ABC
│   ├── new_features/                                     # Additional computed features
│   ├── Peaks/                                            # Results from "Peaks" step of ABC
│   ├── Predictions/                                      # Results from "Predictions" step of ABC
│   ├── processed_genes_file.bed                          # Processed gene annotations
│   ├── tagAlign/                                         # Results from converting fragment to tagAlign file
│   └── to_generate.txt                                   # Pipeline generation tracking file
└── qc_plots/                                           # Quality control outputs across clusters
  └── predictions_qc_report.html                          # Report of prediction properties compared to reference
  └── [PDFs of prediction property plots]

{IGV_dir (= results_dir if not defined}/             # Genome-browser results directory (only if make_IGV_tracks is True)
├── {cell_cluster}/                                    # Outputs for each cell cluster
│   ├── ATAC.bw                                          # bigWig with unnormalized pseudobulk ATAC signal
│   ├── ATAC_norm.bw                                     # bigWig with read-depth normalized pseudobulk ATAC signal
└── └── {model_name}/                                    # scE2G model results (e.g., multiome_powerlaw_v3)
       └── scE2G_predictions_threshold{threshold}.bedpe   # bedpe file corresponding to filtered predictions

Model training

⚠️ Important: Only train models for biosamples matching the corresponding CRISPR data (currently K562).

Configuration

Edit config/config_training.yaml:

model_config columns:
- model: Model name
- dataset: Dataset identifier
- ABC_directory: ABC results directory
- feature_table: Feature table path
- polynomial: Use polynomial features? (Note: models with polynomial features cannot be directly used in Apply model workflow)
cell_cluster_config rows:
- Each "dataset" in model_config must correspond to a "cluster" here
- If ABC_directory not specified, must contain required ABC biosample parameters

Assembling model directory to use in application workflow

Each model directory must contain:

model.pkl
feature_table.tsv
score_threshold_{score_threshold}, where score_threshold is a value from 0–1 (e.g., 0.177)
tpm_threshold_{tpm_threshold}, where tpm_threshold is any non-negative value (use 0 for ATAC-only models)
qnorm_reference.tsv.gz (single column with header E2G.Score containing raw scores)

Running training workflow

snakemake -s workflow/Snakefile_training -j1 --use-conda

Output appears at the path to results_dir specified in config_training.yaml.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use scE2G in your research, please cite our preprint: bioRxiv preprint

Support

For questions and issues, please use the GitHub Issues page.

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
.circleci		.circleci
ENCODE_rE2G @ ab8cc1d		ENCODE_rE2G @ ab8cc1d
config		config
models		models
resources		resources
tests		tests
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scE2G

Overview

Pipeline components

System requirements

Hardware requirements

Software requirements

Installation

Clone repository

Set up environment

Usage

Prerequisites

Input data requirements

1. Pseudobulk fragment files

2. RNA count matrix (multiome only)

Advanced file format options

Configuration

Running the pipeline

Output

Key outputs

Full output structure

Model training

Configuration

Assembling model directory to use in application workflow

Running training workflow

License

Citation

Support

About

Uh oh!

Releases 3

Packages

Contributors 4

Uh oh!

Languages

License

EngreitzLab/scE2G

Folders and files

Latest commit

History

Repository files navigation

scE2G

Overview

Pipeline components

System requirements

Hardware requirements

Software requirements

Installation

Clone repository

Set up environment

Usage

Prerequisites

Input data requirements

1. Pseudobulk fragment files

2. RNA count matrix (multiome only)

Advanced file format options

Configuration

Running the pipeline

Output

Key outputs

Full output structure

Model training

Configuration

Assembling model directory to use in application workflow

Running training workflow

License

Citation

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Uh oh!

Languages

Packages