Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more/better revindex functionality on top of LCA databases. #581

Closed
ctb opened this issue Dec 19, 2018 · 8 comments
Closed

Implement more/better revindex functionality on top of LCA databases. #581

ctb opened this issue Dec 19, 2018 · 8 comments
Labels
revisit_me An issue that needs attention and clarification

Comments

@ctb
Copy link
Contributor

ctb commented Dec 19, 2018

With the merge of #533, the LCA databases (should) now contain the full set of hashes that SBTs do: previously, LCA DBs were buggy and contained a somewhat random collection of hashes that were connected with taxonomic IDs, but now they have everything whether or not it has a tax id. There's a lot of opportunity to connect back to the revindex code that @halexand and I have been using for various things, viz https://github.com/ctb/2017-sourmash-revindex, and make that code better/more complete/more usable/integrated into sourmash, either as an extension or directly in sourmash.

Just a thought :)

@ctb
Copy link
Contributor Author

ctb commented Jan 7, 2019

more thoughts, based on prodding from @bluegenes to think about renaming "LCA databases" :)

I think the right medium term thing to do is:

  • split LCA databases into "revindex" and "taxinfo";
  • rename LCA database format to revindex;
  • (somehow) enable the provision of taxinfo to sourmash search and sourmash gather routines, so that we can add taxinfo-based output and analyses on top of any operation;

but I don't want to delay 2.0 for this; I think it's a 3.0-ish kind of thing.

@ctb
Copy link
Contributor Author

ctb commented Jan 16, 2019

Some more thoughts!

If we make taxinfo databases include a taxonomic hierarchy as well (see taxlist in sourmash/lca/lca_utils.py) then we can make this work with LINS as well as AQU's taxonomy and NCBI's taxonomy.

@ctb
Copy link
Contributor Author

ctb commented Apr 18, 2020

Digging into the revindex code, it looks like the three main pieces of functionality we'd want in LCA_Database are:

we might also be interested in incorporating code from https://github.com/ctb/2017-sourmash-revindex/blob/master/classify-common-hashes.py to do some kind of full-database classification.

@luizirber
Copy link
Member

luizirber commented Apr 18, 2020

tracking abundances, per https://github.com/ctb/2017-sourmash-revindex/blob/master/extract-hashvals-by-sample.py

An initial way of doing this would be storing (idx, abund) for each hashval. Not terribly efficient, but...

(long term research project: go super fancy and implement a REINDEER index, but not today =])

@ctb
Copy link
Contributor Author

ctb commented Jun 6, 2020

#1013 adds protein, dayhoff, and hp signature indexing in LCA databases.

@ctb
Copy link
Contributor Author

ctb commented Jun 7, 2020

#1015 adds abundance tracking into LCA databases. (CLOSED / NOT MERGED)

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2022

it would be easy to support storing abundances in SqliteIndex ref #1808, but we may decide not to do it just yet.

@ctb
Copy link
Contributor Author

ctb commented Sep 23, 2023

this idea lives on in mastiff, and also we've moved away from recommending LCA based approaches for taxonomy. So I'm closing.

ref #2760 for arguments in favor of using gather+tax rather than LCA.

@ctb ctb closed this as completed Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

2 participants