A computational pipeline for predicting genome-wide enhancer-gene regulatory links from single-cell ATAC-seq or paired ATAC and RNA-seq (multiome) data.
Input: Single-cell ATAC-seq or paired ATAC and RNA-seq (multiome) data per cell cluster
Output: Genome-wide enhancer-gene regulatory link predictions per cell cluster
- ABC model predictions - Compute ABC model predictions for each cell cluster
- E2G feature generation - Generate element-gene features from ABC predictions
- Correlation analysis (multiome only) - Compute Kendall correlation and/or ARC-E2G score for each cell cluster
- Feature integration - Combine components 2 & 3 to construct feature file for predictive model
- Model training (optional) - Train predictive model using CRISPR-validated E-G pairs from K562 dataset
- Prediction - Apply trained model to assign scores to each element-gene pair
The scE2G pipeline necessitates a standard computer furnished with ample RAM to facilitate the operations as defined by a user. For optimal performance, we suggest using a computer equipped with 32+ GB RAM.
The scE2G pipeline is compatible with Linux. It has been tested successfully on the following systems:
- Linux: Red Hat Enterprise Linux 8.10 (Ootpa)
- Linux: CentOS Linux 7 (Core)
The software dependencies and versions on which the software has been tested are listed in
- workflow/envs/run_snakemake.yml
- workflow/envs/sc_e2g.yml files
# Quick clone with submodules
git clone --recurse-submodules --shallow-submodules --depth 1 https://github.com/EngreitzLab/scE2G.git
# Initialize and update nested submodules
cd scE2G
git submodule update --init --recursiveWe highly recommend using the provided conda environment for compatibility:
# Configure conda for flexible channel packaging
conda config --set channel_priority flexible
# Create and activate environment
conda env create -f workflow/envs/run_snakemake.yml
conda activate run_snakemakeBefore running scE2G, perform clustering to define cell clusters through standard single-cell analysis (e.g., Seurat & Signac).
See example data in resources/example_chr22_multiome_cluster/ folder.
- Format: One file per cell cluster with corresponding *.tbiindex files
- Requirements:
- Sorted by coordinates, compressed using bgzip, and indexed withtabix
- 5 columns (no header) corresponding to: chr,start,end,cell_name,read_count
- Cell names must match RNA count matrix column names
- Must contain exact set of cells represented in RNA matrix
 
- Sorted by coordinates, compressed using 
Preparation:
# Sort if needed
sort -k1,1 -k2,2n atac_fragments.unsorted.tsv > atac_fragments.tsv
# Compress and index
bgzip atac_fragments.tsv
tabix -p bed atac_fragments.tsv.gz- Format: Gene × cell matrix for each cell cluster
- Requirements:
- Use unnormalized (raw) counts
- No duplicated gene names
- Supported formats:
- .csv.gz
- .h5ador- .h5(may require matching- anndataversion)
- Sparse matrix directory (matrix.mtx.gz,genes.tsv.gz/features.tsv.gz,barcodes.tsv.gz)
 
 
Gene mapping: By default, genes are mapped via Ensembl ID using GENCODE v43 annotations. Modify gene_annotation in config/config.yaml for different versions (e.g., GENCODE v32 for CellRanger data).
Pre-processed fragment files
If your fragment files are already properly sorted and filtered to main chromosomes, you can skip preprocessing steps by setting fragments_preprocessed: True in your config file. Use this option only if you are certain that your files:
- Are sorted with sort -k1,1 -k2,2n(chromosome then numerical position)
- Only contain fragments on chromosomes present in the chromosome sizes reference file
This option allows you to skip the sorting and filtering steps of the pipeline, which can be very resource intensive for large fragment files.
Cell filtering configurations
The default pipeline settings assume the each cell cluster has a corresponding fragment file and RNA matrix that contain the exact same cells. If you instead have an RNA matrix containing cells from many clusters, you can avoid making cluster-specific matrices by setting RNA_matrix_filtered: False in your config file, and using the same RNA matrix for all clusters in your cell_cluster_config. The pipeline will then use the intersection of cells contained in the cluster-specific fragment file and combined RNA matrix to compute the Kendall correlation.
Please note:
- You still must provide cluser-specific fragment files
- The RNA matrix must meet the formatting requirements indicated above
- The memory requirements to load a very large RNA matrix may exceed the default estimations in the pipeline.
- 
Main config - Edit config/config.yaml:- Set results_dirpath
 
- Set 
- 
Cell clusters - Edit config/config_cell_clusters.tsv:- Specify cluster name, ATAC fragment file path, RNA matrix path
- For ATAC-only analysis: leave RNA matrix path empty but include the column
 
- 
Model selection - Specify path to model directory, or a comma-separated list of multiple models. Current supported models estimate contact using a power law function of genomic distance: - For multiome predictions: models/multiome_powerlaw_v3
- For ATAC-only predictions: models/scATAC_powerlaw_v3
 
- For multiome predictions: 
snakemake -j1 --use-conda --configfile config/config.yamlNote: First run may take time to build conda environments, usually around 30-40 minutes according to CircleCI tests. If it exceeds 1 hour, ensure you're using mamba and have sufficient memory.
- Tabular predictions
- {results_dir}/{cell_cluster}/{model_name}/scE2G_predictions.tsv.gz: All putative enhancer-gene predictions for a cell cluster
- {results_dir}/{cell_cluster}/{model_name}/scE2G_predictions_threshold{model_threshold}.tsv.gz: Thresholded predictions containing enhancer-gene pairs that pass score threshold and other filtering steps
- Key score column in these files is E2G.Score.qnorm(quantile-normalized scE2G score)
 
- Genome-browser files (produced if make_IGV_tracks: Truein your config file)- {IGV_dir}/{cell_cluster}/ATAC_norm.bw: bigWig file with read-depth normalized pseudobulk ATAC signal
- {IGV_dir}/{cell_cluster}/scE2G_predictions_threshold{model_threshold}.bedpe: bedpe file with filtered enhnancer-gene predictions
 
- QC report
- {results_dir}/qc_plots/predictions_qc_report.html: Report summarizing properties of predictions in comparison to reference values
 
{results_dir}/                                         # Main results directory
├── {cell_cluster}/                                      # Outputs for each cell cluster
│   ├── ActivityOnly_features.tsv.gz                       # All element-gene pairs with activity-based features
│   ├── ActivityOnly_plus_external_features.tsv.gz         # All element-gene pairs with activity-based and other features
│   ├── ARC/                                               # ARC-E2G results and intermediate files (multiome only)
│   ├── external_features_config.tsv                       # Configuration for external features
│   ├── feature_table.tsv                                  # Feature table reflecting all models designated for the cell cluster
│   ├── genomewide_features.tsv.gz                         # Genome-wide element-gene pairs with features, formatted for model application
│   ├── Kendall/                                           # Kendall correlation results and intermediate results (multiome only)
│   ├── {model_name}/                                      # scE2G model-specific results (e.g., multiome_powerlaw_v3)
│   │   ├── scE2G_predictions.tsv.gz                         # All predictions with scores
│   │   ├── scE2G_predictions_threshold{threshold}.tsv.gz    # Filtered predictions
│   │   ├── scE2G_predictions_threshold{threshold}_stats.tsv # Properties of filtered predictions
│   │   ├── scE2G_element_list.tsv.gz                        # List of candidate elements and associated features
│   │   └── scE2G_gene_list.tsv.gz                           # List of genes in reference file and associated features
│   ├── Neighborhoods/                                    # Results from "Neighborhoods" step of ABC
│   ├── new_features/                                     # Additional computed features
│   ├── Peaks/                                            # Results from "Peaks" step of ABC
│   ├── Predictions/                                      # Results from "Predictions" step of ABC
│   ├── processed_genes_file.bed                          # Processed gene annotations
│   ├── tagAlign/                                         # Results from converting fragment to tagAlign file
│   └── to_generate.txt                                   # Pipeline generation tracking file
└── qc_plots/                                           # Quality control outputs across clusters
  └── predictions_qc_report.html                          # Report of prediction properties compared to reference
  └── [PDFs of prediction property plots]
{IGV_dir (= results_dir if not defined}/             # Genome-browser results directory (only if make_IGV_tracks is True)
├── {cell_cluster}/                                    # Outputs for each cell cluster
│   ├── ATAC.bw                                          # bigWig with unnormalized pseudobulk ATAC signal
│   ├── ATAC_norm.bw                                     # bigWig with read-depth normalized pseudobulk ATAC signal
└── └── {model_name}/                                    # scE2G model results (e.g., multiome_powerlaw_v3)
       └── scE2G_predictions_threshold{threshold}.bedpe   # bedpe file corresponding to filtered predictions
⚠️ Important: Only train models for biosamples matching the corresponding CRISPR data (currently K562).
Edit config/config_training.yaml:
- 
model_configcolumns:- model: Model name
- dataset: Dataset identifier
- ABC_directory: ABC results directory
- feature_table: Feature table path
- polynomial: Use polynomial features? (Note: models with polynomial features cannot be directly used in Apply model workflow)
 
- 
cell_cluster_configrows:- Each "dataset" in model_configmust correspond to a "cluster" here
- If ABC_directorynot specified, must contain required ABC biosample parameters
 
- Each "dataset" in 
Each model directory must contain:
- model.pkl
- feature_table.tsv
- score_threshold_{score_threshold}, where- score_thresholdis a value from 0–1 (e.g.,- 0.177)
- tpm_threshold_{tpm_threshold}, where- tpm_thresholdis any non-negative value (use 0 for ATAC-only models)
- qnorm_reference.tsv.gz(single column with header- E2G.Scorecontaining raw scores)
snakemake -s workflow/Snakefile_training -j1 --use-condaOutput appears at the path to results_dir specified in config_training.yaml.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use scE2G in your research, please cite our preprint: bioRxiv preprint
For questions and issues, please use the GitHub Issues page.