NCBI download and Orthofinder analysis
ls /home/share/workshop/faa_files/
We will be using Orthofinder for our main comparative genomic analysis. The manual is very detailed, I recommend taking some time to read it. To run the program we will need some genomes to compare.
Orhtofinder Manual: https://github.com/davidemms/OrthoFinder
The program takes a set of protein sequences for each species and runs pair-wise comarisons to identify orhtologous groups. For each orthogroup a gene tree is calculated and in the end an overall species tree is computed. To get any sort of meaningful phylogenetic tree we need to be sure to include at least four different genome datasets. Ideally we would run this program with all of the avaialble sequences on NCBI. As you can imagine, a pairwise comparison with 1,185 Streptomyces genomes will take a lone time (days). We will therfore run it with a reduced set. Next we will determine what genomes we want to download and go over the best ways to retrieve them from NCBI.
cd ~/genomics_tutorial/
mkdir genbank_downloads
cd genbank_downloads/
FAQs about genome download from NCBI: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#GBorRS
Whatever method you use be sure to grab an outgroup, or don't thats your call.
We ran a NCBI blast during the genome assembly tutorial. This BLAST should have given you the closest match against the nt database. Chances are it 'hit' well to many genomes. Choose the top hit to a full genome and follow the links to retriev the download link of the genome FAA from the ftp site.
Alternatively, you can download the reference genome used as part of the MR study in staphylococcus.
Staphylococcus aureus ATCC 29213 is the reference strain. https://www.ncbi.nlm.nih.gov/assembly/GCF_001879295.1
FOllow the links to the FTP download.
wget "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/879/295/GCF_001879295.1_StAu00v1/GCF_001879295.1_StAu00v1_protein.faa.gz"
link to NCBI prokaryote tables: https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/ link to genome reports FTP: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/
- Download the genome report file for all all of prokaryotes
We will download the file directly to the server, there is no need to download it to your computer. Right lick on the link and copy the link address. This is a big file so we will filter it a bit first before opening it with tabview.
This file has a lot of useful information. For now we really care about column 21 which is the downloa link for the genome on the ftp site. Copy that link and paste it into a browser to see the files. We will then download the FAA files to the server.
# download the file
wget "ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt"
# view it
tabview prokaryotes.txt
# grep for species in question and view.
grep -i "Staphylococcus" prokaryotes.txt | grep REPR | tabview -
# print the download commands
grep "Staphylococcus" prokaryotes.txt | grep REPR | awk -F'\t' '{print "wget "$21"/*protein.faa.gz"}'
# download all the faa files automatically. -P is the number of processes at a time.
grep "Staphylococcus" prokaryotes.txt | grep REPR | awk -F'\t' '{print $21"/*protein.faa.gz"}' | xargs -P 1 wget -i
# or even better, rename the files as you go. (Delete the files created from the previous command before proceeding).
grep "Staphylococcus" prokaryotes.txt | grep REPR | sed 's/ /_/g' | awk -F'\t' '{print $1"_"$19".faa.gz",$21"/*protein.faa.gz"}' | xargs -n 2 -P 1 wget -O
Remove the empty files, some of them don't have annotations and it will mess up the next part.
zgrep -c '>' *.faa.gz
zgrep -c '>' *.faa.gz | awk -F':' '$2 == 0'
# automatic deletion with xargs
zgrep -c '>' *.faa.gz| awk -F':' '$2 == 0 {print $1}' | xargs rm
Unzip all the files
gunzip *.faa.gz
# move to analysis folder
mkdir ~/genomics_tutorial/orthofinder-analysis
cd ~/genomics_tutorial/orthofinder-analysis
# create a soft link to the FAA files we just downloaded
ln -s ../genbank_downloads/*.faa ./
# create a soft link to the FAA fles from our PROKKA analysis
ln -s /home/share/workshop/faa_files/*.faa ./
Think about what these numbers tell us right off the bat.
grep -c '>' *.faa
The input to the program is a directory containing a FAA file for each species.
# view the manual
orthofinder --help
# run the program, it will take some time
nohup time orthofinder -t 16 -a 16 -S diamond -f ./ &
I will review some, but not all of the files. The manual goes into extensive detail.
cd Results*/
ls
A tab seperated table. Each orthogroup is a raw, each column is a different sample.
The table provides all of the data for orthogroups that are in at least two different samples. If a sample has more than one protein for that particular orthogroup than it will have a comma seperated list for the entry.
tabview Orthogroups.csv
The same style table. Instead this one contains Orthogroups that are not belonging to an orthogroup, they are unique to a single sample. As you scroll down you should notice the proteins belong to different samples.
tabview Orthogroups_UnassignedGenes.csv
My favorite 'Orthogroup' Output file. Orthogroups are the rows, columns are gene counts per species. This can be easily parsed to see what orthogroups are specific to waht species. It provides total gene counts for each sample.
tabview Orthogroups.GeneCount.csv
- add annotations from a reference sequence
~/orthogroups_add_annotations.py <reference_faa> Orthogroups.txt Orthogroups.GeneCount.csv
orthogroups_add_annotations.py ../GCF_000203835.1_ASM20383v1_protein.faa Orthogroups.txt Orthogroups.GeneCount.csv | tabview -
A file containing the overall statistcis for the analysis. Total number of genes in the dataset etc.
tabview Statistics_Overall.csv
In my opinion this is the most important statistics output file. It provides details for each sample. How many genes were speciifc to that sample. If you want to know a quick statistics of how 'differen't your genome is, this is it.
tabview Statistics_PerSpecies.csv
All of the work that external programs like BLAST or DIAMOND. 'ls' this directory. It contains all the results for each pairwise comparison.
This directory contains a lot of useful data related to the Orthofinder analysis and how they commpute the phylogenetic trees.
A directory containing inferred trees for every orthogroup.
A rooted-species tree. Orthofinder commputes a root for the tree automatically. You can view this in any tree viewing program like FigTree or TreeView (macs). This file is in newick format. Check it out.
more Orthologues*/SpeciesTree_rooted.txt
orthogroups_add_annotations.py ../GCF_000203835.1_ASM20383v1_protein.faa Orthogroups.txt Orthogroups.GeneCount.csv
orthotools-venn.py Results_*/ PROKKA_*.faa species1.faa species2.faa venn
orthotools-UpSet.R Results_*/Orthogroups.GeneCount.csv