Thinking about further `gather` performance improvements. #1110

ctb · 2020-07-18T14:22:31Z

This is a summarization of remaining issues in #838, which has become unreadable :)

I'll dump here some other pointers to gather improvements, and they should probably become other issues eventually (and this one stays as a meta/tracking issue?). I'll break it in two sections: Engineering for making what already exists faster, and Research for deeper changes that involve rethinking data structures and methods.

Engineering

We can continue running py-spy and heaptrack for optimizing specific routines or lowering overall memory consumption.

Finish [MRG] replace khmer.Nodegraph with rust nodegraph #799 to allow loading Nodegraph from memory (avoid the tempfile dance when loading data from storage).
Generate a Nodegraph from the query, and use it to query internal nodes (instead of checking every hash in the query). This branch has this somewhat implemented, but depends on an unreleased khmer branch. Should also be doable in [MRG] replace khmer.Nodegraph with rust nodegraph #799
Expose an unload method for SBT nodes #784 is not so good for gather or long-running processes (in these cases you might prefer to keep the full index in memory), but it can lower the memory consumption substantially. There is also [WIP] cache top node results for gather #520 which tries to cache internal nodes, but I think it needs to cut a bit deeper to work well.
- is there any cache that, once an item is evicted, allows calling a method? That would work well with the .unload() from Expose an unload method for SBT nodes #784, I think.
[MRG] add ZipStorage, support loading tree from storage #648 is a disk-space saver (because you don't need to uncompress the SBT indices), but for now it makes things slower

Research

SBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in gather, or breadth-first search like in search) strongly benefit from prunning the search as early as possible (and avoid loading data from disk).

So, one direction to solve gather issues: better organization of the SBT nodes. At least two ways to do it:

Make the insertion of a dataset in an index smarter. We could look for the best position to put the signature, instead of just putting on the next available place. @phoenixAja showed interest in this approach, because it also makes [WIP] Add "knn" and "umap" commands #710 viable.
- Benefit: this is also an online/streaming approach
- Possible problem: SBTs built with the current method are "dense", with the lowest possible depth. To keep it dense with this approach it would also involve rebalancing the tree (a self-balancing binary search tree is ideal), but it is more expensive to do and might not work very well with the current internal implementation of SBTs (which uses enumeration on the nodes instead of pointers, so a rotation involves changing a lot of node positions).
Scaffold an index. Given a collection of signatures we can cluster them by similarity and build a tree which is both dense and maximize similarity between nodes under the same parent. This is what HowDeSBT does (with the howdesbt cluster command). There is some support for this in sourmash, hidden in the Rust crate.
- Drawback: this is an offline approach, because it needs a batch of signatures to work.

Another direction, but depending on the first one: change the search strategy.

Use a best-first search to find matches in gather. Eventually the DFS will find the same result, but the best-first can reach it faster (by taking the path with largest similarity first). But this only works if the index is well-behaved...
- Check best_first for a branch implementing the search strategy. It abuses the results dict a bit, so it breaks some tests with search functions that don't take extra kwargs.
Use a simple simulated annealing approach: while descending the tree also check some random signature to see if there is a better score.

more engineering / multi-threading

@satishv:

Thanks for all of your attention on this task. Currently, gather function does not support multi-threading. has that been considered before? I also emailed you about this.

@ctb:

One of the big reasons to move to rust, as I understand it, is better support for multithreading. Before, the C++ layer was a real pain in terms of threading...
The algorithm is not trivial to parallelize, however.

Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon:
https://github.com/oconnor663/bao/blob/a2fedd649487de63e4d0c0a9e45f5a22d1e46365/src/hash.rs#L224L270

But that's still a bit far in the future, because Moving loading and save sigs to Rust #532 needs to land (3.1), and then I need to write the Index Python abc/Rust trait bridge (3.3). I want to focus on improving compute (3.2, check Compute oxidation #845) before doing the bridge.

reorganizing SBT structure

re improving SBT organization, I wanted to link in #756, which is a very nice discussion of the issue, and also #545.

larger bloom filters?

with the new compressed loading Bloom filters in #648, I think we can explore using larger Bloom filters in SBTs as another performance optimization, ref #304.

Important note here: the compressed BF are on DISK, not on MEMORY. Using larger BFs might incur more memory usage. To fix that:

Better internal nodes caching and eviction. For search the nodes are being unloaded after they are traversed (as per Expose an unload method for SBT nodes #784 (comment)), but for gather simply unloading them is too aggressive (since they might be traversed again). Having a cache that can call .unload() when the node is evicted would be perfect, but I didn't find any available with a good API for that. Probably will have to write our own. (Calling .unload() is fine, because the nodes know how to reload the data; but it would involve reading data from storage again, which will be slower)
Succint representation in memory. This could probably be RRR encoding, like the original SBT. Pointers: https://alexbowe.com/rrr/
- the fixedbitset API is super nice to work on, so having something like it but with internal RRR encoding would be perfect
- https://github.com/sujayakar/rsdict is a good candidate for using in our implementation, as well as the https://github.com/tov/succinct-rs crate

changing the way we do search

@luizirber:

Something else that can help: gather starts with best_similarity = 0, but after the first search we have more results (not only the best_match) that we could check first at the second round to start with a better score. That will also probably help limit the number of backtracking/paths that need to be checked.

@ctb:
yes! but of course it's more complicated - see #930 for a slightly rambling discussion of how this would have to work across multiple databases.

The text was updated successfully, but these errors were encountered:

ctb · 2021-01-10T15:13:55Z

gather performance is improved dramatically by greyhound, as mentioned here #1226 (comment). Practically speaking we can now do ~million bacterial signature searches in under 2 hours even in Python (see genome-grist for some functioning code)

ctb · 2021-05-08T12:27:38Z

ref #1370 and beyond for massive performance increases coming soon. Maybe we can close this issue soon!

ctb · 2022-10-13T02:19:55Z

from slack, luiz sayeth:

More future optimizations:

on the first round for calculating overlaps, save a revindex of the hashes in the query -> datasets ID containing hash. For the gather part you can remove from the counter using only the revindex (no need to keep the sigs around, since they might be large
if you do need to keep sigs, support lazy-loading them (check once_cell), so you don’t need to keep them all in memory. Alternatively, use a Storage =]

Before future optimizations: write some tests, maybe also benchmarks using criterion

ctb · 2023-09-23T16:20:40Z

we've moved beyond most of this discussion; between mastiff updates and the multithreading work in https://github.com/sourmash-bio/pyo3_branchwater, we're in good shape.

ctb mentioned this issue Jul 18, 2020

Gather performance improvements discussion #838

Closed

ctb added the speeding-up-gather label Oct 13, 2022

ctb mentioned this issue Feb 6, 2023

sourmash sketch & search use one thread only #2458

Open

ctb closed this as completed Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thinking about further `gather` performance improvements. #1110

Thinking about further `gather` performance improvements. #1110

ctb commented Jul 18, 2020 •

edited by luizirber

Loading

ctb commented Jan 10, 2021

ctb commented May 8, 2021

ctb commented Oct 13, 2022

ctb commented Sep 23, 2023

Thinking about further gather performance improvements. #1110

Thinking about further gather performance improvements. #1110

Comments

ctb commented Jul 18, 2020 • edited by luizirber Loading

Engineering

Research

more engineering / multi-threading

reorganizing SBT structure

larger bloom filters?

changing the way we do search

ctb commented Jan 10, 2021

ctb commented May 8, 2021

ctb commented Oct 13, 2022

ctb commented Sep 23, 2023

Thinking about further `gather` performance improvements. #1110

Thinking about further `gather` performance improvements. #1110

ctb commented Jul 18, 2020 •

edited by luizirber

Loading