Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thinking about further gather performance improvements. #1110

Closed
2 of 4 tasks
ctb opened this issue Jul 18, 2020 · 4 comments
Closed
2 of 4 tasks

Thinking about further gather performance improvements. #1110

ctb opened this issue Jul 18, 2020 · 4 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 18, 2020

This is a summarization of remaining issues in #838, which has become unreadable :)


from @luizirber in #838 (comment):

I'll dump here some other pointers to gather improvements, and they should probably become other issues eventually (and this one stays as a meta/tracking issue?). I'll break it in two sections: Engineering for making what already exists faster, and Research for deeper changes that involve rethinking data structures and methods.

Engineering

We can continue running py-spy and heaptrack for optimizing specific routines or lowering overall memory consumption.

Research

SBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in gather, or breadth-first search like in search) strongly benefit from prunning the search as early as possible (and avoid loading data from disk).

So, one direction to solve gather issues: better organization of the SBT nodes. At least two ways to do it:

  • Make the insertion of a dataset in an index smarter. We could look for the best position to put the signature, instead of just putting on the next available place. @phoenixAja showed interest in this approach, because it also makes [WIP] Add "knn" and "umap" commands #710 viable.
    • Benefit: this is also an online/streaming approach
    • Possible problem: SBTs built with the current method are "dense", with the lowest possible depth. To keep it dense with this approach it would also involve rebalancing the tree (a self-balancing binary search tree is ideal), but it is more expensive to do and might not work very well with the current internal implementation of SBTs (which uses enumeration on the nodes instead of pointers, so a rotation involves changing a lot of node positions).
  • Scaffold an index. Given a collection of signatures we can cluster them by similarity and build a tree which is both dense and maximize similarity between nodes under the same parent. This is what HowDeSBT does (with the howdesbt cluster command). There is some support for this in sourmash, hidden in the Rust crate.
    • Drawback: this is an offline approach, because it needs a batch of signatures to work.

Another direction, but depending on the first one: change the search strategy.

  • Use a best-first search to find matches in gather. Eventually the DFS will find the same result, but the best-first can reach it faster (by taking the path with largest similarity first). But this only works if the index is well-behaved...
    • Check best_first for a branch implementing the search strategy. It abuses the results dict a bit, so it breaks some tests with search functions that don't take extra kwargs.
  • Use a simple simulated annealing approach: while descending the tree also check some random signature to see if there is a better score.

more engineering / multi-threading

@satishv:

Thanks for all of your attention on this task. Currently, gather function does not support multi-threading. has that been considered before? I also emailed you about this.

@ctb:

One of the big reasons to move to rust, as I understand it, is better support for multithreading. Before, the C++ layer was a real pain in terms of threading...
The algorithm is not trivial to parallelize, however.

Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon:
https://github.com/oconnor663/bao/blob/a2fedd649487de63e4d0c0a9e45f5a22d1e46365/src/hash.rs#L224L270


reorganizing SBT structure

re improving SBT organization, I wanted to link in #756, which is a very nice discussion of the issue, and also #545.


larger bloom filters?

with the new compressed loading Bloom filters in #648, I think we can explore using larger Bloom filters in SBTs as another performance optimization, ref #304.

Important note here: the compressed BF are on DISK, not on MEMORY. Using larger BFs might incur more memory usage. To fix that:

  • Better internal nodes caching and eviction. For search the nodes are being unloaded after they are traversed (as per Expose an unload method for SBT nodes #784 (comment)), but for gather simply unloading them is too aggressive (since they might be traversed again). Having a cache that can call .unload() when the node is evicted would be perfect, but I didn't find any available with a good API for that. Probably will have to write our own. (Calling .unload() is fine, because the nodes know how to reload the data; but it would involve reading data from storage again, which will be slower)
  • Succint representation in memory. This could probably be RRR encoding, like the original SBT. Pointers: https://alexbowe.com/rrr/

changing the way we do search

@luizirber:

Something else that can help: gather starts with best_similarity = 0, but after the first search we have more results (not only the best_match) that we could check first at the second round to start with a better score. That will also probably help limit the number of backtracking/paths that need to be checked.

@ctb:
yes! but of course it's more complicated - see #930 for a slightly rambling discussion of how this would have to work across multiple databases.

@ctb
Copy link
Contributor Author

ctb commented Jan 10, 2021

gather performance is improved dramatically by greyhound, as mentioned here #1226 (comment). Practically speaking we can now do ~million bacterial signature searches in under 2 hours even in Python (see genome-grist for some functioning code)

@ctb
Copy link
Contributor Author

ctb commented May 8, 2021

ref #1370 and beyond for massive performance increases coming soon. Maybe we can close this issue soon!

@ctb
Copy link
Contributor Author

ctb commented Oct 13, 2022

from slack, luiz sayeth:

More future optimizations:

on the first round for calculating overlaps, save a revindex of the hashes in the query -> datasets ID containing hash. For the gather part you can remove from the counter using only the revindex (no need to keep the sigs around, since they might be large
if you do need to keep sigs, support lazy-loading them (check once_cell), so you don’t need to keep them all in memory. Alternatively, use a Storage =]

Before future optimizations: write some tests, maybe also benchmarks using criterion

@ctb
Copy link
Contributor Author

ctb commented Sep 23, 2023

we've moved beyond most of this discussion; between mastiff updates and the multithreading work in https://github.com/sourmash-bio/pyo3_branchwater, we're in good shape.

@ctb ctb closed this as completed Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant