Metadata - Add taxonomic breakdown to sourmash output #195

brooksph · 2017-04-25T23:41:36Z

@halexand suggested adding taxonomic breakdowns to sbt signatures and subsequently, gather output. This could facilitate downstream analysis at different taxonomic levels. For example, kraken does d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli

The text was updated successfully, but these errors were encountered:

ctb · 2017-04-25T23:42:31Z

Good idea, but where should we get this information from?

luizirber · 2017-04-25T23:45:04Z

Something like kraken-build --download-taxonomy --db $DBNAME can be a start (from http://ccb.jhu.edu/software/kraken/MANUAL.html)

Downloads data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, might need to figure out what they do with it.

halexand · 2017-04-25T23:49:21Z

@luizirber suggested piggybacking off of the Karen tree information. http://ccb.jhu.edu/software/kraken/MANUAL.html

E.g. taxonomy/nodes.dmp + taxonomy/names.dmp: Taxonomy names might have good information? I can look more deeply.

My slightly more Luddite and straight forward idea was that we might just snag the tax id information from the metadata and use that as a means of pulling out phylogeny. You can get the full lineage from this file which is apparently updated regularly. So... that might be made into a sort of db to go alongside the sbt?

Info on taxon id db: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt
Actual download: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.Z

halexand · 2017-04-26T00:04:06Z

Alright, so as I understand them from reading over the files quickly...

nodes.dmp: for any given tax ID it provides the information to identify the 'parental' tax-id (in addition to other things e.g. rank level for the node id)
names.dmp: gives the actual name for the tax ID.

So, if we take the example of Prochlorococcus marinus (Tax ID 1041938). The look up in names.dmp provides the different strains of Prochlorococcus. Looking up in nodes.dmp provides the genus id 1218 which if you look up in names.dmp is Prochlorococcus.

One problem/difficulty that immediately jumps out at me is that there are multiple different entries for a given tax id sometimes. Because why choose just one???

ctb · 2017-05-13T13:09:19Z

Asked Zach Foster about what kind of info and output would be most useful for packages like metacoder.

ctb · 2017-05-14T15:29:32Z

talking to @meren to find out what they need for anvi'o and to see if we can swipe some code, too.

meren · 2017-05-14T22:43:59Z

I've been using something like this to build my own database here (just isolated this function from a larger context):

import sys
import requests

from xml.etree import ElementTree

taxonomy = ['superkingdom', 'phylum', 'order', 'class', 'family', 'genus', 'species']

def get_lineage(tax_id):
    response = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=%s' % str(tax_id))

    tree = ElementTree.fromstring(response.content)

    lineage = dict(zip(taxonomy, [None] * len(taxonomy)))

    try:
        taxon = tree.findall('Taxon')[0]
    except:
        return lineage

    for e in taxon.findall('LineageEx')[0].findall('Taxon'):
        rank = e.find('Rank').text
        if rank in lineage:
            lineage[rank] = e.find('ScientificName').text.replace(' ', '_')

    return lineage

if __name__ == '__main__':
    print(get_lineage(sys.argv[1]))

Here:

$ python get_lineage.py 665953
{'superkingdom': 'Bacteria', 'family': 'Bacteroidaceae', 'class': 'Bacteroidia', 'phylum': 'Bacteroidetes', 'species': 'Bacteroides_eggerthii', 'genus': 'Bacteroides', 'order': 'Bacteroidales'}

This is just to play, of course. There are much better ways to deal with this issue in general. But requires lots of communication and logistics :/

zachary-foster · 2017-05-15T22:34:55Z

Hi @ctb, thanks for asking.

what kind of info and output would be most useful for packages like metacoder.

Currently, metacoder is best at parsing any text-based format that is 1) sequential and 2) encodable by regular expressions. By these terms I mean:

Repeating units in the same format representing taxonomic classifications and associated information.
All pieces of information that should be preserved during parsing can be defined by capture groups (parentheses) in regular expressions.

For examples of this, I suggest looking at our parsing tutorial: https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html

However, these examples are for parsing FASTA headers, not tables of information.
I dont know what kind of information you are looking to plot or how you want your output files to be formatted yet.
If they are tables, then there is a function called parse_taxonomy_table, that is just a wrapper for extract_taxonomy, that will get the taxonomic information for each row from one column and save the rest of the data in the parsed object and associate it with the taxonomy. That column could have nearly any format, but the easiest to parameterize would be something like:

Fungi;Ascomycota;Leotiomycetes;Helotiales...
or

k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales...

I expect that I will be rewriting parse_taxonomy_table in the near future to be simpler and more flexible. When I do that I plan on adding convenience wrappers for parse_taxonomy_table that parse common formats. I could easily add one for sourmash once I get an idea of what output format you decide on. I expect that whatever format you choose, metacoder will be able to parse it, but some formats will be easier for users to figure out the correct parameters than others. The simplest would be a table with one column that has classifications like Fungi;Ascomycota;Leotiomycetes;Helotiales and any number of other columns with any kind of information you like. The data from the other columns will be available in the parsed object for plotting/filtering/whatever.

Hope that helps. Feel free to ask any questions if something does not make sense. I could say more if I had a better idea of the information you want to output.

taylorreiter · 2017-05-19T20:13:22Z

RE krona & NCBI taxonomy numbers:

I think supporting krona output is not necessarily the way to go. Perhaps instead outputting NCBI taxonomy number would be a better use of time, so that it's easy for people to feed their results in to a lot of downstream analyses, not just krona.

I like the Krona visualization, but it is glitchy with large-ish datasets (RNAseq, 4M reads).

Originally posted #174

ctb · 2017-05-28T17:29:22Z

ok given that we want to do this for 100,000 entries I'm going to mash up @halexand and @meren suggestions.

On MSU HPC under /mnt/home/ctb/research/tax, I have code to produce a leveldb of the access2taxid files (acc2taxid-to-leveldb.py) and under the subdir taxdump the following script seems to roughly work --

 % python -i parse.py nodes.dmp names.dmp
('Shewanella baltica OS223', '', 'scientific name')
('Shewanella baltica', '', 'scientific name')
('Shewanella', '', 'scientific name')
('Shewanellaceae', '', 'scientific name')
('Alteromonadales', '', 'scientific name')
('Gammaproteobacteria', '', 'scientific name')
('Proteobacteria', '', 'scientific name')
('Bacteria', 'Bacteria <prokaryotes>', 'scientific name')
('cellular organisms', '', 'scientific name')

My current plan is to:

finish producing the leveldb for the WGS and Genbank Nucleotide databases, so that we can quickly look up the taxid for a given acc;
make some form of db dump (maybe another leveldb?) for the much smaller nodes.dmp and names.dmp files;
extract all the accessions from our current genbank signatures;
write a CSV file that lets us take accession -> lineage like @meren's code above;
produce an ad hoc script that may suit our present needs for going from gather/search -> lineage;

...and then we'll at least have the code to annotate the signatures when we figure out what the right metadata approach is. Probably this will end up in the utils submodule as a set of extra commands...

@taylorreiter what's the input format for krona again? thx!

ctb · 2017-05-29T03:30:09Z

ok, I can now get output like this:

CP001056,935198,Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium botulinum
CP001072,512562,Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae;Helicobacter;Helicobacter pylori
CP000805,455434,Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Treponema;Treponema pallidum
CP000934,498211,Bacteria;Proteobacteria;Gammaproteobacteria;Cellvibrionales;Cellvibrionaceae;Cellvibrio;Cellvibrio japonicus
CU469464,37692,Bacteria;Tenericutes;Mollicutes;Acholeplasmatales;Acholeplasmataceae;Candidatus Phytoplasma;Candidatus Phytoplasma mali
CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae

yay w00t.

ctb · 2017-05-29T14:15:02Z

too much detail but my memory is weak:

# in genbank SBT directory: extract accession info from genbank
find . \! -name \*internal\* -exec grep -h \"name\" {} \; > ../genbank-names.txt

# in /mnt/home/ctb/research/tax: connect accession info to taxid using REALLY BIG files
python acc2taxid.py ../genbank-k31-acc.txt nucl_gb.accession2taxid nucl_wgs.accession2taxid.gz 

# in /mnt/home/ctb/research/tax/taxdump: output CSV file with accession, taxid, and ';'-separated lineage
python do-the-csv.py nodes.dmp names.dmp ../../genbank-k31-acc.txt.taxid

ctb · 2017-05-29T14:59:44Z

CSV file genbank-genomes-accession+lineage-20170529.csv.gz containing 93,209 accessions + taxid + lineage in this format:

CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae

now available at:

https://osf.io/28ymr/

Note that approximately 40 accessions in our genbank SBTs do not seem to appear in either nucl_wgs.accession2taxid or nucl_gb.accession2taxid and so are not represented in this CSV file. shrug

ctb · 2017-05-29T15:00:27Z

We'll work on utilities to extract and correlate this info with sourmash output in useful ways. In the meantime special thx to @meren for providing some useful code!

ctb · 2017-05-29T15:07:46Z

code in https://github.com/dib-lab/2017-ncbi-taxdump

ctb · 2017-06-03T19:04:53Z

A few more thoughts on taxonomy:

in the metadata block for signatures generated from NCBI, we should definitely have a way to include accession(s) of included sequences and taxid(s) of included sequences. I do now think we should also provide an 'ncbi-tax-lineage' entry that encodes something standard like Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans because this really is unlikely to change and is quite useful to have ready to hand.

ctb · 2017-08-26T21:21:35Z

We've made quite a bit of headway on this issue, most recently with the sourmash LCA stuff (see gist and ncbi_taxdump_utils specifically. Just wanted to link that in here :)

ctb · 2017-09-10T13:03:45Z

metacodeR viz working: http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-part-1.html

ctb · 2017-09-28T18:57:14Z

a tutorial for metacodeR analysis on gather output: https://hackmd.io/EYMwTAhgHGwMwFoCmcAsqGoAwGMwIggE4ATBKHCVANiSxMpziA==

ctb · 2018-02-18T02:36:19Z

sourmash lca and https://github.com/dib-lab/2018-ncbi-lineages now handle various bits of this. sourmash lca is the way forward for taxonomy in sourmash, at least for now.

ctb mentioned this issue May 19, 2017

what's needed for a 2.0 release? #174

Closed

taylorreiter mentioned this issue May 19, 2017

support krona output? #178

Closed

This was referenced Jun 1, 2017

Provide a utility for human investigation of taxonomic results #265

Open

Potential modifications to signature format for major releases #268

Open

ctb mentioned this issue Jun 7, 2017

Possible inconsistency in sbt_gather output #275

Closed

ctb closed this as completed Feb 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata - Add taxonomic breakdown to sourmash output #195

Metadata - Add taxonomic breakdown to sourmash output #195

brooksph commented Apr 25, 2017 •

edited

Loading

ctb commented Apr 25, 2017 via email

luizirber commented Apr 25, 2017

halexand commented Apr 25, 2017

halexand commented Apr 26, 2017

ctb commented May 13, 2017

ctb commented May 14, 2017

meren commented May 14, 2017

zachary-foster commented May 15, 2017 •

edited

Loading

taylorreiter commented May 19, 2017

ctb commented May 28, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented Jun 3, 2017

ctb commented Aug 26, 2017

ctb commented Sep 10, 2017

ctb commented Sep 28, 2017

ctb commented Feb 18, 2018

Metadata - Add taxonomic breakdown to sourmash output #195

Metadata - Add taxonomic breakdown to sourmash output #195

Comments

brooksph commented Apr 25, 2017 • edited Loading

ctb commented Apr 25, 2017 via email

luizirber commented Apr 25, 2017

halexand commented Apr 25, 2017

halexand commented Apr 26, 2017

ctb commented May 13, 2017

ctb commented May 14, 2017

meren commented May 14, 2017

zachary-foster commented May 15, 2017 • edited Loading

taylorreiter commented May 19, 2017

ctb commented May 28, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented May 29, 2017

ctb commented Jun 3, 2017

ctb commented Aug 26, 2017

ctb commented Sep 10, 2017

ctb commented Sep 28, 2017

ctb commented Feb 18, 2018

brooksph commented Apr 25, 2017 •

edited

Loading

zachary-foster commented May 15, 2017 •

edited

Loading