Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider ways to improve speed of LCA database #821

Closed
ctb opened this issue Jan 5, 2020 · 7 comments · Fixed by #1808
Closed

consider ways to improve speed of LCA database #821

ctb opened this issue Jan 5, 2020 · 7 comments · Fixed by #1808

Comments

@ctb
Copy link
Contributor

ctb commented Jan 5, 2020

Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs.

I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK. Or we could use some other on-disk dictionary structure, but of course that gets slow for big databases.

This could be implemented as an optional caching mechanism, too, but that could add a lot of nasty code.

@luizirber
Copy link
Member

Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs.

One good starting point is verifying how much of the information we have to hold in memory, and exactly what it is. I have the impression that a lot of the lineage info is redundant?

I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK.

That only makes things faster if we optimize the querying (probably writing SQL?), because if we query hash by hash... it's going to be slow anyway, no?

@ctb
Copy link
Contributor Author

ctb commented Jan 7, 2020 via email

@luizirber
Copy link
Member

luizirber commented Jan 7, 2020

No, I don't think it is... unless you mean that we store each lineage in full, rather than doing an NCBI-taxonomy-like breakdown. I don't think that's a major chunk of data but I guess I could be wrong. Have to think about how to most easily measure it.

I was checking https://stackoverflow.com/a/40880923, but it fails with CFFI objects (the MinHash), but should be easy to fix.

for things like lca classify and lca summarize, it seems like 99.9% of the current time is spent loading the JSON file. My intuition is that shifting 10% of the processing to SQLite is still a win :).

I have a solution for fast JSON file parsing, but you might not like the other consequences 🦀

More seriously, this is also what I've been seeing with signature parsing: benchmarking is trending to signature loading taking most of the time now. Should be better after #532 lands, but that doesn't benefit the LCA code that much.

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2020

with the API refactoring in #946 all this is a lot clearer to me now - the creation/saving/loading is pretty separate, as are the hooks needed to engage with queries. We could start to provide different backends pretty easily, I think. I have more (and better) experience with SQLite than other on-disk stores, but in any case I think the value of a such a refactoring would be in more cleanly separating out the storage API for LCA Databases.

We may want to provide a way to store lineages differently or separately (raised in #948, for rust reasons; but also here because maybe SBT databases could benefit from taxonomy info). One idea is to have a separate "taxinfo" object that can be part of LCA and SBTs both, but also could be provided separately (as e.g. a spreadsheet, as in sourmash lca index). This would permit various kinds of taxonomic output from searches more generally.

ref also: oxidize the LCA database saving/loading, #948, as well as #909, one of the motivating issues for improving speed.

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2022

The early results of implementing SqliteIndex are ...very promising; see #1808. For GTDB genomic-reps, it's a SQLite database is same size as an sbt.zip file, approximately, with very fast loading time and minimal memory usage. Query time via reverse index is twice as fast as a linear search of a .zip file; haven't yet compared to the .sbt.zip (but memory will surely be an improvement!)

@ctb
Copy link
Contributor Author

ctb commented Apr 6, 2022

I have what appears to be a fully functional LCA database implementation in sqlite - see #1933 (comment) - that seems fast and should be multi-client read-only. cc @phiweger.

@ctb
Copy link
Contributor Author

ctb commented Apr 6, 2022

ref #1930 for motivation/discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants