-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider ways to improve speed of LCA database #821
Comments
One good starting point is verifying how much of the information we have to hold in memory, and exactly what it is. I have the impression that a lot of the lineage info is redundant?
That only makes things faster if we optimize the querying (probably writing SQL?), because if we query hash by hash... it's going to be slow anyway, no? |
On Mon, Jan 06, 2020 at 03:26:32PM -0800, Luiz Irber wrote:
> Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs.
One good starting point is verifying how much of the information we have to hold in memory, and exactly what it is. I have the impression that a lot of the lineage info is redundant?
No, I don't think it is... unless you mean that we store each lineage in
full, rather than doing an NCBI-taxonomy-like breakdown. I don't think
that's a major chunk of data but I guess I could be wrong. Have to think
about how to most easily measure it.
> I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK.
That only makes things faster if we optimize the querying (probably writing SQL?), because if we query hash by hash... it's going to be slow anyway, no?
for things like lca classify and lca summarize, it seems like 99.9% of the
current time is spent loading the JSON file. My intuition is that shifting
10% of the processing to SQLite is still a win :).
|
I was checking https://stackoverflow.com/a/40880923, but it fails with CFFI objects (the MinHash), but should be easy to fix.
I have a solution for fast JSON file parsing, but you might not like the other consequences 🦀 More seriously, this is also what I've been seeing with signature parsing: benchmarking is trending to signature loading taking most of the time now. Should be better after #532 lands, but that doesn't benefit the LCA code that much. |
with the API refactoring in #946 all this is a lot clearer to me now - the creation/saving/loading is pretty separate, as are the hooks needed to engage with queries. We could start to provide different backends pretty easily, I think. I have more (and better) experience with SQLite than other on-disk stores, but in any case I think the value of a such a refactoring would be in more cleanly separating out the storage API for LCA Databases. We may want to provide a way to store lineages differently or separately (raised in #948, for rust reasons; but also here because maybe SBT databases could benefit from taxonomy info). One idea is to have a separate "taxinfo" object that can be part of LCA and SBTs both, but also could be provided separately (as e.g. a spreadsheet, as in ref also: oxidize the LCA database saving/loading, #948, as well as #909, one of the motivating issues for improving speed. |
The early results of implementing |
I have what appears to be a fully functional LCA database implementation in sqlite - see #1933 (comment) - that seems fast and should be multi-client read-only. cc @phiweger. |
ref #1930 for motivation/discussion. |
Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs.
I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK. Or we could use some other on-disk dictionary structure, but of course that gets slow for big databases.
This could be implemented as an optional caching mechanism, too, but that could add a lot of nasty code.
The text was updated successfully, but these errors were encountered: