Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feat: on-disk RevIndex based on RocksDB (#2230)
On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval This is the index/core data structure backing https://mastiff.sourmash.bio There are many changes in the Rust code, so bumping the version to `0.12.0`. This is mostly not exposed thru the FFI yet. Tests from the from the in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by `sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the future. It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet. #2727 is a better starting point for how `Index` abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a `LinearIndex` and later specialize into a `RevIndex`, which will make easier to translate the work from #2727 for future indices across FFI. A couple of new concepts introduced in this PR: - a `Collection` is a `Manifest` + `Storage`. So a zip file like the ones for GTDB databases fit this easily (storage = `ZipStorage`, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = `FSStorage`). This goes in a bit of different direction than #1901, which was extending `Storage` to support more functionality. I think `Storage` should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of **what** data is there (this is covered with `Manifest`). - a `CollectionSet` is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create a `CollectionSet` by running `.select()` on a `Collection`. `CollectionSet` is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility. - `LinearIndex` was rewritten based on `Collection` (and also the `greyhound`/`branchwater` parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top of `Collection` implementing functionality expected from indices. - `Manifest`, `Selection`, and `Picklist` are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in #2726) ## Feature - Initial implementation for `Manifest`, `Selection`, and `Picklist` following the Python API. - `Collection` is a new abstraction for working with a set of signatures. A collection needs a `Storage` for holding the signatures (on-disk, in-memory, or remotely), and a `Manifest` to describe the metadata for each signature. - Expose CSV parsing and RocksDB errors. - New module `sourmash::index::revindex::disk_revindex` with the on-disk RevIndex implementation based on RocksDB. - Add `iter` and `iter_mut` methods for `Signature`. - Add `load_sig` and `save_sig` methods to `Storage` trait for higher-level data manipulation and caching. - Add `spec` method to `Storage` to allow constructing a concrete `Storage` from a string description. - Add `InnerStorage` for synchronizing parallel access to `Storage` implementations. - Add `MemStorage` for keeping signatures in-memory (mostly for debugging and testing). ## Refactor - Rename `HashFunctions` variants to follow camel-case, so `Murmur64Protein` instead of `murmur64_protein` - `LinearIndex` is now implemented as a thin layer on top of `Collection`. - Move `GatherResult` to `sourmash::index` module. - Move `sourmash::index::revindex` to `sourmash::index::mem_revindex` (this is the Greyhound version of revindex, in-memory only). It was also refactored internally to build a version of a `LinearIndex` that will be merged in the future with `sourmash::index::LinearIndex` - Move `select` method from `Index` trait into a separate `Select` trait, and implement it for `Signature` based on the new `Selection` API. - Move `SigStore` into `sourmash::storage` module, and remove the generic. Now it always stores `Signature`. Also implement `Select` for it. ## Build - Add new `branchwater` feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdb - Add new `rkyv` feature (disabled by default), making `MinHash` serializable with the `rkyv` crate. - Add semver checks for CI (so we bump versions accordingly, or avoid breaking changes) - Reduce features combinations on Rust checks (takes much less time to run) - Disable `musllinux` wheels (need to figure out how to build rocksdb for it) --------- Co-authored-by: Tessa Pierce Ward <[email protected]> Co-authored-by: C. Titus Brown <[email protected]>
- Loading branch information