-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thoughts on Storage
, manifests, etc.
#1901
Comments
maybe relevant: perhaps |
ctb
added a commit
that referenced
this issue
Dec 1, 2023
On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval This is the index/core data structure backing https://mastiff.sourmash.bio There are many changes in the Rust code, so bumping the version to `0.12.0`. This is mostly not exposed thru the FFI yet. Tests from the from the in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by `sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the future. It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet. #2727 is a better starting point for how `Index` abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a `LinearIndex` and later specialize into a `RevIndex`, which will make easier to translate the work from #2727 for future indices across FFI. A couple of new concepts introduced in this PR: - a `Collection` is a `Manifest` + `Storage`. So a zip file like the ones for GTDB databases fit this easily (storage = `ZipStorage`, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = `FSStorage`). This goes in a bit of different direction than #1901, which was extending `Storage` to support more functionality. I think `Storage` should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of **what** data is there (this is covered with `Manifest`). - a `CollectionSet` is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create a `CollectionSet` by running `.select()` on a `Collection`. `CollectionSet` is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility. - `LinearIndex` was rewritten based on `Collection` (and also the `greyhound`/`branchwater` parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top of `Collection` implementing functionality expected from indices. - `Manifest`, `Selection`, and `Picklist` are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in #2726) ## Feature - Initial implementation for `Manifest`, `Selection`, and `Picklist` following the Python API. - `Collection` is a new abstraction for working with a set of signatures. A collection needs a `Storage` for holding the signatures (on-disk, in-memory, or remotely), and a `Manifest` to describe the metadata for each signature. - Expose CSV parsing and RocksDB errors. - New module `sourmash::index::revindex::disk_revindex` with the on-disk RevIndex implementation based on RocksDB. - Add `iter` and `iter_mut` methods for `Signature`. - Add `load_sig` and `save_sig` methods to `Storage` trait for higher-level data manipulation and caching. - Add `spec` method to `Storage` to allow constructing a concrete `Storage` from a string description. - Add `InnerStorage` for synchronizing parallel access to `Storage` implementations. - Add `MemStorage` for keeping signatures in-memory (mostly for debugging and testing). ## Refactor - Rename `HashFunctions` variants to follow camel-case, so `Murmur64Protein` instead of `murmur64_protein` - `LinearIndex` is now implemented as a thin layer on top of `Collection`. - Move `GatherResult` to `sourmash::index` module. - Move `sourmash::index::revindex` to `sourmash::index::mem_revindex` (this is the Greyhound version of revindex, in-memory only). It was also refactored internally to build a version of a `LinearIndex` that will be merged in the future with `sourmash::index::LinearIndex` - Move `select` method from `Index` trait into a separate `Select` trait, and implement it for `Signature` based on the new `Selection` API. - Move `SigStore` into `sourmash::storage` module, and remove the generic. Now it always stores `Signature`. Also implement `Select` for it. ## Build - Add new `branchwater` feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdb - Add new `rkyv` feature (disabled by default), making `MinHash` serializable with the `rkyv` crate. - Add semver checks for CI (so we bump versions accordingly, or avoid breaking changes) - Reduce features combinations on Rust checks (takes much less time to run) - Disable `musllinux` wheels (need to figure out how to build rocksdb for it) --------- Co-authored-by: Tessa Pierce Ward <[email protected]> Co-authored-by: C. Titus Brown <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue extracts the leftover bits from #1352, mostly to do with evolving the
Storage
class and SBTs to better intersect.From comment - @luizirber:
From comment - @ctb:
and more -
The text was updated successfully, but these errors were encountered: