This repo contains instructions on how to screen for the presence of Gubaphage lineages in a set of predicted viral sequences (nucleotide FASTA file). For more information about the Gubaphage, see Camarillo-Guerrero et al. Cell 2021 for a formal description of this clade.
The Gubaphage is a recently discovered clade of bacteriophages highly prevalent in the gut microbiome of diverse human populations. Understanding its global distribution is important to uncover its potential role(s) in the gut ecosystem.
To perform a targeted detection of the Gubaphage, I performed a pan-genome analysis of all known Gubaphage genomes retrieved from Camarillo-Guerrero et al. Cell 2021, leading to the identification of a set of 6 core genes present in >90% of the genomes. For each core gene, HMMER was used to determine the optimal alignment bitscores (maximum F1 score, calculated with scripts/hmm-thresholds.R
) that would enable a clear discrimination between Gubaphage and non-Gubaphage sequences. The resulting HMM models alongside their scores can be found in hmm_models/
.
- Install the following dependencies:
- Python (tested v3.7.3)
- BioPython
- Prodigal (tested v2.6.3)
- HMMER (tested v3.1b2)
- MUSCLE (tested v3.8.31)
- IQ-TREE (tested v1.6.11)
- FastTree (tested v2.1.10)
- Clone the repo.
git clone https://github.com/alexmsalmeida/gubascreen.git
- Add the
scripts/
directory to your$PATH
environmental variable.
- Predict protein sequences from your nucleotide FASTA file (
input.fa
).
prodigal -i input.fa -a proteins.faa -p meta
- Run HMMER to detect the presence of Gubaphage marker genes using pre-defined thresholds.
hmmsearch --cpu {threads} --cut_ga --tblout guba_hmmer.tsv --noali hmm_models/guba_core.hmm proteins.faa
- Build phylogenetic tree from HMMER output (
iqtree
can be replaced withfasttree
for a faster, albeit less accurate analysis).
hmmer2tree.sh -t {threads} -i guba_hmmer.tsv -p proteins.faa -m iqtree -o phylo_tree
The main output file is a phylogenetic (phylo_tree/concat_alignment.aln.treefile
) file containing your input sequences and reference Gubaphage genomes.