-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata - Add taxonomic breakdown to sourmash output #195
Comments
Good idea, but where should we get this information from?
|
Something like Downloads data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, might need to figure out what they do with it. |
@luizirber suggested piggybacking off of the Karen tree information. http://ccb.jhu.edu/software/kraken/MANUAL.html E.g. taxonomy/nodes.dmp + taxonomy/names.dmp: Taxonomy names might have good information? I can look more deeply. My slightly more Luddite and straight forward idea was that we might just snag the tax id information from the metadata and use that as a means of pulling out phylogeny. You can get the full lineage from this file which is apparently updated regularly. So... that might be made into a sort of db to go alongside the sbt? Info on taxon id db: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt |
Alright, so as I understand them from reading over the files quickly... nodes.dmp: for any given tax ID it provides the information to identify the 'parental' tax-id (in addition to other things e.g. rank level for the node id) So, if we take the example of Prochlorococcus marinus (Tax ID 1041938). The look up in names.dmp provides the different strains of Prochlorococcus. Looking up in nodes.dmp provides the genus id 1218 which if you look up in names.dmp is Prochlorococcus. One problem/difficulty that immediately jumps out at me is that there are multiple different entries for a given tax id sometimes. Because why choose just one??? 1213 | "Prochlorales" Lewin 1977 | | synonym | |
Asked Zach Foster about what kind of info and output would be most useful for packages like metacoder. |
talking to @meren to find out what they need for anvi'o and to see if we can swipe some code, too. |
I've been using something like this to build my own database here (just isolated this function from a larger context): import sys
import requests
from xml.etree import ElementTree
taxonomy = ['superkingdom', 'phylum', 'order', 'class', 'family', 'genus', 'species']
def get_lineage(tax_id):
response = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=%s' % str(tax_id))
tree = ElementTree.fromstring(response.content)
lineage = dict(zip(taxonomy, [None] * len(taxonomy)))
try:
taxon = tree.findall('Taxon')[0]
except:
return lineage
for e in taxon.findall('LineageEx')[0].findall('Taxon'):
rank = e.find('Rank').text
if rank in lineage:
lineage[rank] = e.find('ScientificName').text.replace(' ', '_')
return lineage
if __name__ == '__main__':
print(get_lineage(sys.argv[1])) Here: $ python get_lineage.py 665953
{'superkingdom': 'Bacteria', 'family': 'Bacteroidaceae', 'class': 'Bacteroidia', 'phylum': 'Bacteroidetes', 'species': 'Bacteroides_eggerthii', 'genus': 'Bacteroides', 'order': 'Bacteroidales'} This is just to play, of course. There are much better ways to deal with this issue in general. But requires lots of communication and logistics :/ |
Hi @ctb, thanks for asking.
Currently, metacoder is best at parsing any text-based format that is 1) sequential and 2) encodable by regular expressions. By these terms I mean:
For examples of this, I suggest looking at our parsing tutorial: https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html However, these examples are for parsing FASTA headers, not tables of information.
I expect that I will be rewriting Hope that helps. Feel free to ask any questions if something does not make sense. I could say more if I had a better idea of the information you want to output. |
RE krona & NCBI taxonomy numbers: I think supporting krona output is not necessarily the way to go. Perhaps instead outputting NCBI taxonomy number would be a better use of time, so that it's easy for people to feed their results in to a lot of downstream analyses, not just krona. I like the Krona visualization, but it is glitchy with large-ish datasets (RNAseq, 4M reads). Originally posted #174 |
ok given that we want to do this for 100,000 entries I'm going to mash up @halexand and @meren suggestions. On MSU HPC under
My current plan is to:
...and then we'll at least have the code to annotate the signatures when we figure out what the right metadata approach is. Probably this will end up in the @taylorreiter what's the input format for krona again? thx! |
ok, I can now get output like this:
yay w00t. |
too much detail but my memory is weak:
|
CSV file
now available at: Note that approximately 40 accessions in our genbank SBTs do not seem to appear in either |
We'll work on utilities to extract and correlate this info with sourmash output in useful ways. In the meantime special thx to @meren for providing some useful code! |
A few more thoughts on taxonomy: in the metadata block for signatures generated from NCBI, we should definitely have a way to include accession(s) of included sequences and taxid(s) of included sequences. I do now think we should also provide an 'ncbi-tax-lineage' entry that encodes something standard like |
We've made quite a bit of headway on this issue, most recently with the sourmash LCA stuff (see gist and ncbi_taxdump_utils specifically. Just wanted to link that in here :) |
metacodeR viz working: http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-part-1.html |
a tutorial for metacodeR analysis on gather output: https://hackmd.io/EYMwTAhgHGwMwFoCmcAsqGoAwGMwIggE4ATBKHCVANiSxMpziA== |
|
@halexand suggested adding taxonomic breakdowns to sbt signatures and subsequently, gather output. This could facilitate downstream analysis at different taxonomic levels. For example, kraken does
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
The text was updated successfully, but these errors were encountered: