scE2G

This project is licensed under the terms of the MIT license.

Input: Single-cell ATAC-seq or paired ATAC and RNA-seq (mulitome) data per cell cluster

Output: Genomewide enhancer-gene regulatory link predictions per cell cluster

The pipeline consists of the following components:

Compute ABC model predictions for each cell cluster
Generate E2G features from ABC predictions
If running scE2G (Multiome): compute Kendall correlation and/or ARC-E2G score for each cell cluster
Combine 2 & 3 to construct a feature file to be used as input to predictive model
(optional) Train predictive model using CRISPR-validated E-G pairs from K562 dataset
Apply trained model to make predictions by assigning a score to each E-G pair

Set up

Clone the repo and set it up for submodule usage

git clone --recurse-submodules https://github.com/EngreitzLab/scE2G.git
git config --global submodule.recurse true

When running for the first time, the conda environments have to be setup. For speed, it's recommended that your current environment has mamba installed

conda config --set channel_priority flexible  # Make sure your conda config uses flexible channel packaging to prevent unsatisfiable errors
conda create -n mamba -c conda-forge mamba=1.5.11 "python<=3.11"
conda activate mamba
mamba install -c conda-forge -c bioconda snakemake=7

Apply model

Before running this workflow, users should perform clustering to define cell clusters through regular single-cell analysis (such as Seurat & Signac). Required input data includes (refer to the example data in the resources/example_chr22_multiome_cluster folder):

Pseudobulk fragment files and their corresponding *.tbi index files in the same directory for each cell cluster
- Must be sorted by coordinates and gzipped. If it is not sorted, you can use the sortBed tool from bedtools: sortBed atac_fragments.unsorted.tsv > atac_fragments.tsv.
- Must have 5 columns (no header) corresponding to chr, start, end, cell_name, read_count (usually just 1)
- The cell_name column should correspond to the cell names in the RNA count matrix. All cells in the RNA matrix must be represented in the fragment file.
- To create a .tbi index, use bgzip from HTSlib instead of gzip to compress the fragment file: bgzip atac_fragments.tsv, then generate the corresponding .tbi index file using tabix -p bed atac_fragments.tsv.gz.
For scE2G (Multiome): RNA count matrix (gene x cell) for each cell cluster
- Use unnormalized (raw) counts
- Must be in either .csv.gz or .h5ad format
- Ensure there are not duplicated gene names

To configure the pipeline:

Modify config/config.yaml to specify paths for results_dir.
Modify config/config_cell_clusters.tsv to specify the RNA matrix path, fragment file path, Hi-C file path, Hi-C data type, Hi-C resolution, TSS coordinates, and gene coordinates for each cell cluster. If running scE2G (ATAC), leave the RNA matrix path blank.
Specify model directory from those in models/ to be applied

Running the pipeline:

snakemake -j1 --use-conda

This command make take a while the first time you run it, as it needs to build the conda environments. But if it takes more than 1 hour, that's usually a bad sign, meaning that you're not using mamba and/or need more memory to build the environment.

Output will show up in the results/ directory by default, with the structure results/cell_cluster/model_name/encode_e2g_predictions.tsv.gz. The score column to use is E2G.Score.qnorm.

Train model

Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)

Modify config/config_training.yaml with your model and cell_cluster configs

model_config has columns: model, dataset, ABC_directory, feature_table, polynomial (do you want to use polynomial features?) Note that trained models generated using polynomial features cannot directly be used in the Apply model workflow
cell_cluster_config has rows representing each "dataset" in model_config, where each "dataset" must correspond to a "cluster" in cell_cluster_config
- If an ABC_directory is not specified for a dataset, its entry in cell_cluster_config must also contain the required ABC biosample parameters
- TO DO: specify how to generate and formats for Kendall parameters
To apply a trained model, it must contain the following files: 1) model.pkl, 2) feature_table.tsv, 3) score_threshold_.XX , 5) tpm_threshold_YY (YY=0 if ATAC-only model), 4) qnorm_reference.tsv.gz (single column with header E2G.Score that contains raw scores for genomewide predictions)

Running the pipeline:

snakemake -s workflow/Snakefile_training -j1 --use-conda

Output will show up in the results_training/ directory by default

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.circleci		.circleci
ENCODE_rE2G @ 5020b37		ENCODE_rE2G @ 5020b37
config		config
models		models
resources		resources
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scE2G

Set up

Apply model

Train model

About

Releases 1

Packages

Contributors 3

Languages

License

EngreitzLab/scE2G

Folders and files

Latest commit

History

Repository files navigation

scE2G

Set up

Apply model

Train model

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages