consider ways to improve speed of LCA database #821

ctb · 2020-01-05T15:40:26Z

Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs.

I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK. Or we could use some other on-disk dictionary structure, but of course that gets slow for big databases.

This could be implemented as an optional caching mechanism, too, but that could add a lot of nasty code.

luizirber · 2020-01-06T23:26:31Z

Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs.

One good starting point is verifying how much of the information we have to hold in memory, and exactly what it is. I have the impression that a lot of the lineage info is redundant?

I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK.

That only makes things faster if we optimize the querying (probably writing SQL?), because if we query hash by hash... it's going to be slow anyway, no?

ctb · 2020-01-07T15:26:34Z

On Mon, Jan 06, 2020 at 03:26:32PM -0800, Luiz Irber wrote: > Right now LCA databases are loaded into memory from a JSON file, but loading big JSON files (esp in Python) is slow and causes significant delay at startup. It'd be nice to have an on disk option like with SBTs. One good starting point is verifying how much of the information we have to hold in memory, and exactly what it is. I have the impression that a lot of the lineage info is redundant?

No, I don't think it is... unless you mean that we store each lineage in full, rather than doing an NCBI-taxonomy-like breakdown. I don't think that's a major chunk of data but I guess I could be wrong. Have to think about how to most easily measure it.

> I have been thinking about using something like a SQLite database to support the basic "hash -> idx" lookup, since that is the bulk of the data AFAIK. That only makes things faster if we optimize the querying (probably writing SQL?), because if we query hash by hash... it's going to be slow anyway, no?

for things like lca classify and lca summarize, it seems like 99.9% of the current time is spent loading the JSON file. My intuition is that shifting 10% of the processing to SQLite is still a win :).

luizirber · 2020-01-07T16:29:15Z

No, I don't think it is... unless you mean that we store each lineage in full, rather than doing an NCBI-taxonomy-like breakdown. I don't think that's a major chunk of data but I guess I could be wrong. Have to think about how to most easily measure it.

I was checking https://stackoverflow.com/a/40880923, but it fails with CFFI objects (the MinHash), but should be easy to fix.

for things like lca classify and lca summarize, it seems like 99.9% of the current time is spent loading the JSON file. My intuition is that shifting 10% of the processing to SQLite is still a win :).

I have a solution for fast JSON file parsing, but you might not like the other consequences 🦀

More seriously, this is also what I've been seeing with signature parsing: benchmarking is trending to signature loading taking most of the time now. Should be better after #532 lands, but that doesn't benefit the LCA code that much.

ctb · 2020-04-23T14:38:09Z

with the API refactoring in #946 all this is a lot clearer to me now - the creation/saving/loading is pretty separate, as are the hooks needed to engage with queries. We could start to provide different backends pretty easily, I think. I have more (and better) experience with SQLite than other on-disk stores, but in any case I think the value of a such a refactoring would be in more cleanly separating out the storage API for LCA Databases.

We may want to provide a way to store lineages differently or separately (raised in #948, for rust reasons; but also here because maybe SBT databases could benefit from taxonomy info). One idea is to have a separate "taxinfo" object that can be part of LCA and SBTs both, but also could be provided separately (as e.g. a spreadsheet, as in sourmash lca index). This would permit various kinds of taxonomic output from searches more generally.

ref also: oxidize the LCA database saving/loading, #948, as well as #909, one of the motivating issues for improving speed.

ctb · 2022-01-26T15:21:16Z

The early results of implementing SqliteIndex are ...very promising; see #1808. For GTDB genomic-reps, it's a SQLite database is same size as an sbt.zip file, approximately, with very fast loading time and minimal memory usage. Query time via reverse index is twice as fast as a linear search of a .zip file; haven't yet compared to the .sbt.zip (but memory will surely be an improvement!)

ctb · 2022-04-06T16:36:39Z

I have what appears to be a fully functional LCA database implementation in sqlite - see #1933 (comment) - that seems fast and should be multi-client read-only. cc @phiweger.

ctb · 2022-04-06T16:37:21Z

ref #1930 for motivation/discussion.

ctb mentioned this issue Apr 23, 2020

shared LCA index for many processes #909

Closed

ctb mentioned this issue Dec 21, 2020

brainstorming: alternative signature storage/loading/query formats #1262

Open

This was referenced Jun 9, 2021

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

Open

support revindex indices that point at signatures #1591

Open

ctb mentioned this issue Apr 19, 2022

[MRG] add sqlite3 implementations for Index, CollectionManifest, and LCA_Database #1808

Merged

33 tasks

ctb closed this as completed in #1808 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider ways to improve speed of LCA database #821

consider ways to improve speed of LCA database #821

ctb commented Jan 5, 2020

luizirber commented Jan 6, 2020

ctb commented Jan 7, 2020 via email

luizirber commented Jan 7, 2020 •

edited

Loading

ctb commented Apr 23, 2020

ctb commented Jan 26, 2022

ctb commented Apr 6, 2022 •

edited

Loading

ctb commented Apr 6, 2022

consider ways to improve speed of LCA database #821

consider ways to improve speed of LCA database #821

Comments

ctb commented Jan 5, 2020

luizirber commented Jan 6, 2020

ctb commented Jan 7, 2020 via email

luizirber commented Jan 7, 2020 • edited Loading

ctb commented Apr 23, 2020

ctb commented Jan 26, 2022

ctb commented Apr 6, 2022 • edited Loading

ctb commented Apr 6, 2022

luizirber commented Jan 7, 2020 •

edited

Loading

ctb commented Apr 6, 2022 •

edited

Loading