Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather performance improvements discussion #838

Closed
luizirber opened this issue Jan 11, 2020 · 33 comments
Closed

Gather performance improvements discussion #838

luizirber opened this issue Jan 11, 2020 · 33 comments
Milestone

Comments

@luizirber
Copy link
Member

luizirber commented Jan 11, 2020

Moving the conversation from #835 here.

@luizirber
Copy link
Member Author

We love it, if you guys can improve gather's performance.

here are some stats... looks like v3 may be slower than v2?

Should we file a task? we are happy to help. perhaps in v4? Is a smaller release planned on top of v3 codebase anytime soon?

thanks

sourmash compute -k 31 --scaled 5000 -o testv2.sig test.01M_R1_L001.fastq.gz
0m17.907s

sourmash gather -k 31 --scaled 5000 -o gatherv2 testv2.sig db/ecolidb.sbt.json
0m1.896s

sourmash v3.0.1
sourmash compute -k 31 --scaled 5000 -o testv3.sig test.01M_R1_L001.fastq.gz
0m42.243s

sourmash gather -k 31 --scaled 5000 -o gatherv3 testv3.sig db/ecolidb.sbt.json
0m4.429s

Originally posted by @satishv in #835 (comment)_

@ctb
Copy link
Contributor

ctb commented Jan 13, 2020

Whoa, there's something bad going on all right :)

Used the following code,

#! /usr/bin/env python
import sourmash
import time

print('loaded from:', sourmash)
print('version:', sourmash.VERSION)

query = sourmash.load_one_signature('podar-ref/0.fa.sig', ksize=31)
db = sourmash.lca.lca_utils.load_single_database('xxx.lca.json')
if sourmash.VERSION.startswith('3.'):
    dblist = [db]
elif sourmash.VERSION.startswith('2.3.'):
    dblist = [(db[0], 'xxx', 'LCA'),]

start = time.time()
for i in range(500):
    g = sourmash.search.gather_databases(query, dblist,
                                         threshold_bp=0,
                                         ignore_abundance=False)
    g = list(g)
end = time.time()

print(end - start)

and got:

loaded from: <module 'sourmash' from '/Users/t/dev/sourmash/sourmash/__init__.py'>
version: 3.0.2.dev6+g86e1105
5.426559925079346

——

loaded from: <module 'sourmash' from '/Users/t/dev/sourmash/sourmash/__init__.py'>
version: 2.3.1
0.07218098640441895

(This was with the podar-ref LCA database.)

I didn't think we'd changed that code much between 3.0.1 and 2.3.1 but clearly I am mistaken :)

@ctb
Copy link
Contributor

ctb commented Jan 13, 2020

Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber!

@ctb
Copy link
Contributor

ctb commented Jan 13, 2020

commit ec8a00c, with index ABC:

version: 2.3.2.dev1+gec8a00c
5.255207061767578

previous commit 03e5269, no index ABC:

version: 2.3.1
0.11417484283447266

ok, so it was #556 that did the dirty deed.

@ctb
Copy link
Contributor

ctb commented Jan 13, 2020

(nothing to do with rust!)

@ctb
Copy link
Contributor

ctb commented Jan 13, 2020

OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay between _find_best in search.py, and Index.gather implementations. _find_best does not pass on any hints about search thresholds, and the Index.gather implementation wouldn't take them into account anyway. 😿

@luizirber
Copy link
Member Author

Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber!

I'm a fan of "one release per PR", but we also wanted to do #556 and #424 in 3.0.0, so... yeah. See #655 for more info.

@luizirber
Copy link
Member Author

luizirber commented Jan 13, 2020

OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay
between _find_best in search.py, and Index.gather implementations.

But both SBTs and LCA indices define their own gather, and don't use Index.gather.

_find_best does not pass on any hints about search thresholds, and the Index.gather implementation wouldn't take them into account anyway. crying_cat_face

It's not ideal, but we can pass threshold as a kwargs, and define a default if it is not available (like search already does)

@ctb
Copy link
Contributor

ctb commented Jan 13, 2020 via email

@luizirber
Copy link
Member Author

luizirber commented Jan 13, 2020

I'll dump here some other pointers to gather improvements, and they should probably become other issues eventually (and this one stays as a meta/tracking issue?). I'll break it in two sections: Engineering for making what already exists faster, and Research for deeper changes that involve rethinking data structures and methods.

Engineering

We can continue running py-spy and heaptrack for optimizing specific routines or lowering overall memory consumption.

Research

SBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in gather, or breadth-first search like in search) strongly benefit from prunning the search as early as possible (and avoid loading data from disk).

So, one direction to solve gather issues: better organization of the SBT nodes. At least two ways to do it:

  • Make the insertion of a dataset in an index smarter. We could look for the best position to put the signature, instead of just putting on the next available place. @phoenixAja showed interest in this approach, because it also makes [WIP] Add "knn" and "umap" commands #710 viable.
    • Benefit: this is also an online/streaming approach
    • Possible problem: SBTs built with the current method are "dense", with the lowest possible depth. To keep it dense with this approach it would also involve rebalancing the tree (a self-balancing binary search tree is ideal), but it is more expensive to do and might not work very well with the current internal implementation of SBTs (which uses enumeration on the nodes instead of pointers, so a rotation involves changing a lot of node positions).
  • Scaffold an index. Given a collection of signatures we can cluster them by similarity and build a tree which is both dense and maximize similarity between nodes under the same parent. This is what HowDeSBT does (with the howdesbt cluster command). There is some support for this in sourmash, hidden in the Rust crate.
    • Drawback: this is an offline approach, because it needs a batch of signatures to work.

Another direction, but depending on the first one: change the search strategy.

  • Use a best-first search to find matches in gather. Eventually the DFS will find the same result, but the best-first can reach it faster (by taking the path with largest similarity first). But this only works if the index is well-behaved...
    • Check best_first for a branch implementing the search strategy. It abuses the results dict a bit, so it breaks some tests with search functions that don't take extra kwargs.
  • Use a simple simulated annealing approach: while descending the tree also check some random signature to see if there is a better score.

@satishv
Copy link

satishv commented Jan 14, 2020

Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

@ctb
Copy link
Contributor

ctb commented Jan 14, 2020 via email

@ctb
Copy link
Contributor

ctb commented Jan 14, 2020

Embarrassing realization: my earlier benchmarking of lca gather was off, because the v2.3.1 code modified the incoming query object and removed all the minhashes, so I was essentially comparing gather on an empty signature (in v2.3.1) to gather on a full signature (in v3.0.1). When I run a real comparison, the LCA gather is about 50% faster in v3 than in v2.3.1.

OK, now that I have that figured out, going to take a look at the SBT performance.

@ctb
Copy link
Contributor

ctb commented Jan 14, 2020

Well, at least one problem (mentioned above) is that neither gather implementation (LCA or SBT) was using thresholding in search. That's fixed in #843. My benchmarking is highly variable but I'm seeing what looks like consistent improvements in gather performance on SBTs.

@luizirber luizirber added this to the pre-4.0 milestone Jan 14, 2020
@luizirber
Copy link
Member Author

Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

One of the big reasons to move to rust, as I understand it, is better support for multithreading. Before, the C++ layer was a real pain in terms of threading...
The algorithm is not trivial to parallelize, however.

Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon:
https://github.com/oconnor663/bao/blob/a2fedd649487de63e4d0c0a9e45f5a22d1e46365/src/hash.rs#L224L270

But that's still a bit far in the future, because #532 needs to land (3.1), and then I need to write the Index Python abc/Rust trait bridge (3.3). I want to focus on improving compute (3.2, check #845) before doing the bridge.

@satishv
Copy link

satishv commented Jan 15, 2020

For some more context: We are running with a DB file (200 GB zipped) with over 100k genomes. Would sorting these genomes in the DB file help? As a last resort, we may consider chopping the large DB file into multiple DB files or even passing both starting and end index to the DB file would be awesome, if you can support that kind of DB lookups. Happy to expand this further. Not sure i am making self very clear here. cc @metajinomics

@ctb
Copy link
Contributor

ctb commented Jan 15, 2020

hi @satishv, one thing you can do today would be to split the large DB into multiple files, as you suggest. If you can do so in such a way that related genomes are kept in the same database subset, that might improve things significantly.

A rough pipeline for doing so might be to:

  • take something like the GTDB taxonomy database that I just posted on my blog
  • run sourmash lca classify on all your genome signatures (should take ~3-6 hours)
  • pick a taxonomic level that groups your genomes into conveniently sized chunks of 20k or less
  • make databases for each of those.

best,
--titus

@satishv
Copy link

satishv commented Jan 15, 2020

@ctb - Thanks for the above info. Doing the above will certainly adds to our timelines. We are hoping for small changes to do our side at this point. It will help if you also make changes on your end to speed up things as much as possible.

@ctb
Copy link
Contributor

ctb commented Jan 15, 2020 via email

@satishv
Copy link

satishv commented Jan 15, 2020

thanks a lot!

we are using sourmash-2.0.1 now. I assume no outputs & format will be changed in the 4.0/or the above fix. Assume we will be able to easily swap 2.0.1 with your above fix, ideally with out lot of effort?

cc @brendanwee

@ctb
Copy link
Contributor

ctb commented Jan 15, 2020 via email

@satishv
Copy link

satishv commented Jan 17, 2020

thanks. looking forward to next week, hope to incorporate your changes as soon as ready. can not wait.

@ctb
Copy link
Contributor

ctb commented Jan 19, 2020

re improving SBT organization, I wanted to link in #756, which is a very nice discussion of the issue, and also #545.

@satishv
Copy link

satishv commented Jan 25, 2020

@ctb - we request that the performance of gather to prioritized over compute performance. Let us know if you have any estimate on the release date. totally appreciate this effort. please note that we are not interested in lca gather at this point. cc @luizirber

@ctb
Copy link
Contributor

ctb commented Jan 25, 2020

hi @satishv, I can't speak for @luizirber here,. but I have no immediate plans to work on improving gather performance. The current performance is not really an obstacle for anything I want to do and I'm usually much more interested in correctness, user experience, and memory usage than I am in performance.

More generally, we are always happy to consider pull requests that implement your priorities; we could also help connect you with some consultants that might be able to do the work on your desired timeline. Since the basic gather functionality is well tested and (presumably) performance improvements wouldn't change the API or behavior, there is no obstacle to releasing a new version of sourmash as soon as a PR is merged.

You should also note that there is a distinction between sourmash-gather-on-LCA-databases and sourmash lca gather. The former should return ~identical results given the same collection of sequences, (but has very different performance characteristics than sourmash-gather-on-SBT-databases; the latter has somewhat different functionality. I'm pretty sure (but not 100% positive...) that you can give sourmash lca index an empty spreadsheet to build an LCA database with no taxonomy; you might give that a try if you haven't already.

@ctb
Copy link
Contributor

ctb commented Apr 14, 2020

more performance optimization stuff for gather on SBTs, here.

basically, performance improvements for gather on an SBT from parallelization could come from --

  • on multiple databases, searching each database independently and then selecting the best containment - easy-ish if we can do the search entirely in Rust.
  • parallelizing the tree search WITHIN an SBT, by e.g. queueing up a list of parts of the tree to search in parallel - hard-ish.

#925 may also help by doing a better job of constructing the SBT so that less of the tree is sorted.

@ctb
Copy link
Contributor

ctb commented May 3, 2020

with the new compressed loading Bloom filters in #648, I think we can explore using larger Bloom filters in SBTs as another performance optimization, ref #304.

@luizirber
Copy link
Member Author

luizirber commented May 4, 2020

with the new compressed loading Bloom filters in #648, I think we can explore using larger Bloom filters in SBTs as another performance optimization, ref #304.

Important note here: the compressed BF are on DISK, not on MEMORY. Using larger BFs might incur more memory usage. To fix that:

  • Better internal nodes caching and eviction. For search the nodes are being unloaded after they are traversed (as per Expose an unload method for SBT nodes #784 (comment)), but for gather simply unloading them is too aggressive (since they might be traversed again). Having a cache that can call .unload() when the node is evicted would be perfect, but I didn't find any available with a good API for that. Probably will have to write our own. (Calling .unload() is fine, because the nodes know how to reload the data; but it would involve reading data from storage again, which will be slower)
  • Succint representation in memory. This could probably be RRR encoding, like the original SBT. Pointers: https://alexbowe.com/rrr/

@ctb
Copy link
Contributor

ctb commented May 4, 2020

ok, with the release of v3.3.0 link, things have improved dramatically -

Screen Shot 2020-05-04 at 2 29 30 PM

this is because #799 now permits direct loading of compressed nodegraphs without intermediate files. #648 also adds the convenience of keeping the entire database in one compressed .zip file.

I think #925 is the next big optimization expected to land.

@ctb
Copy link
Contributor

ctb commented May 5, 2020

re

Better internal nodes caching and eviction. For search the nodes are being unloaded after they are traversed (as per #784 (comment)), but for gather simply unloading them is too aggressive (since they might be traversed again).

since hashes from the best match (across all databases) are removed from the query, it might be possible to provide a hint to unload the path to that best match in a database.

@sourmash-bio sourmash-bio deleted a comment from ctb May 5, 2020
@luizirber
Copy link
Member Author

since hashes from the best match (across all databases) are removed from the query, it might be possible to provide a hint to unload the path to that best match in a database.

Good idea! But maybe avoid unloading the nodes near the root (4 levels, ~31 nodes?), because they will probably be queried every time anyway.

Something else that can help: gather starts with best_similarity = 0, but after the first search we have more results (not only the best_match) that we could check first at the second round to start with a better score. That will also probably help limit the number of backtracking/paths that need to be checked. To avoid this being unbounded, maybe keep the top-20 leaves for each search, for seeding the next round?

@ctb
Copy link
Contributor

ctb commented May 5, 2020

Something else that can help: gather starts with best_similarity = 0, but after the first search we have more results (not only the best_match) that we could check first at the second round to start with a better score. That will also probably help limit the number of backtracking/paths that need to be checked.

yes! but of course it's more complicated - see #930 for a slightly rambling discussion of how this would have to work across multiple databases.

@ctb
Copy link
Contributor

ctb commented Jul 18, 2020

remaining issues moved to #1110

@ctb ctb closed this as completed Jul 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants