-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
other makeshift strategies for large scale database search - the "greyhound" issue #1226
Comments
ok, I implemented a simple script, For hu-s1, it completed in ~90 minutes and ~30 GB of RAM for the ~700k genomes from the latest genbank. The actual sourmash gather is still running, 5 days in, and it's about 2/3 done only 😆 Usage:
|
huzzah! |
@luizirber said on slack: a couple of dishwashing1 hot takes:
1: dishwashing is great for deep thinking |
is it? I'm not sure it is - this is the main content of an LCA DB, I think? |
I implemented 1 + 4 + 5 in greyhound1, and tried with some queries. For On a high level, greyhound is implemented in two steps: filtering and indexing (step 1) and gather (step 2). For step 1, the query is loaded in memory and then all the reference sigs are loaded in parallel and checked for intersection size with the query (map). If they are above the For step 2, while there are matches in the counter the top match is selected, and each reference in the counter has their count decreased if they have hashes in the top match (using the revindex to find which references to decrease). The 3rd step is doing stats/summaries properly, but for now I'm just printing the matches. While this is already way faster than the other gather approaches (in LcaDB and SBTs), it could be even faster if some optimization is done in the Oh, and another hot take: I think there is space for a 1: if smol is the tiniest and cutest gather, then the fastest gather should be greyhound, right? =P |
as a side note, I really like the simplicity of all our underlying algorithms. nothin' special, let's just reimplement everything 5 different ways to experiment... |
... and five years building tooling, tearing the software apart and rebuilding it again, shaving yaks, sharpening, automating, thinking a lot to find the simple algorithm, and finally be able to reorganize and move the whole ship quickly when needed =] |
...I wonder what deeply laid bug we'll find next? 🤪 |
The seeds of the yin are in the yang, every advance contains its own destruction =] |
I think this new strategy offers some nice output options not available with |
Yup, that's easy to output too (report after step 1, where the counter is already available). I was focusing on reporting after step 2 (gather). |
related to previous comment, I expect "which genomes are relevant (by containment)" to be much less dependent on k-mer size than gather - this is relevant to stuff going on in genome-grist, in which the accuracy of the hash matching rates seem to be sometimes quite sensitive to k-mer size (using mapping as a baseline with which to measure "accuracy"). so, we can do something like, 'large scale containment search at k=whatever', and then use gather on just the resulting genomes across a variety of different k-sizes to explore the dependence of gather results and mapping rates to k-mer size. |
do you think the sourmash gather CLI should take |
idle musings: it seems like this approach would support signature search at higher resolutions than currently. my understanding/belief is that we go with a scaled of 2000 for genomes in part because SBTs and LCAs get too big and unwieldy when you use smaller scaled values. BUT, with the prefetch/greyhound approach, the primary costs would be in loading the larger signatures, and nothing else, right? So we could potentially get to higher resolutions. |
Makes sense. I've been using scaled=2000 with the GTDB sigs (31k), and it uses 4.4GB of memory for the full DB. Anecdotally, tests with The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs. |
greyhound is being merged into sourmash, soon-ish - #1238 |
All or most of these ideas have been canonized in sourmash, sourmash-rs, or pyo3_branchwater. I do feel like nominating this issue as a "posterity" issue because it really unleashed a whole new wave of optimizations (massively parallel stuff a la greyhound, AND ridiculously scalable on disk storage a la mastiff)... but in any case I think we can close this issue :) |
I'm on day 3 of a gather of the Hu S1 dataset against all genbank (500k+ genomes), and chatting with @taylorreiter about the find-the-unassigned script that @luizirber wrote,
https://github.com/taylorreiter/cosmo-kmers/blob/master/scripts/unassigned.py
it occurs to me that in some situations (such as environmental metagenomes), it might be more efficient to do a single linear pass across the 500k genomes looking for genomes with substantial containment, and then build an in-memory (or even on disk!) database of just those genomes and do the gather on that.
stopgap? yes. takes less than 4 days? probably 😄
The text was updated successfully, but these errors were encountered: