Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: FFI for Rust LinearIndex #2727

Draft
wants to merge 1 commit into
base: latest
Choose a base branch
from

Conversation

luizirber
Copy link
Member

(spun off #2230, needs it and #2726 to work)

Exposes the LinearIndex in Rust for use in Python.

ctb added a commit that referenced this pull request Dec 1, 2023
On-disk RevIndex based on RocksDB, initially implemented in
https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing
https://mastiff.sourmash.bio

There are many changes in the Rust code, so bumping the version to
`0.12.0`.

This is mostly not exposed thru the FFI yet. Tests from the from the
in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is
not well supported (doesn't allow saving/loading from disk, for
example), and should be wholly replaced by
`sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the
future.
It is confusing to have these different RevIndex impls in Rust, and I
started converging them, but the work is not completely done yet.

#2727 is a better starting point for how `Index` abc/trait should work
acrosss Python/Rust, and I started moving the Rust indices to start from
a `LinearIndex` and later specialize into a `RevIndex`, which will make
easier to translate the work from #2727 for future indices across FFI.

A couple of new concepts introduced in this PR:

- a `Collection` is a `Manifest` + `Storage`. So a zip file like the
ones for GTDB databases fit this easily (storage = `ZipStorage`,
manifest is read from the zipfile), but a file paths list does too
(manifest built from the file paths, storage = `FSStorage`). This goes
in a bit of different direction than #1901, which was extending
`Storage` to support more functionality. I think `Storage` should stay
pretty bare and mostly deal with loading/saving data, but not much
knowledge of **what** data is there (this is covered with `Manifest`).

- a `CollectionSet` is a consistent collection of signatures. Consistent
here means: same k-size, downsample-compatible for scaled, same moltype.
You can create a `CollectionSet` by running `.select()` on a
`Collection`. `CollectionSet` is required for building indices (because
we shouldn't be building indices mixing k-size/moltype), and greatly
simplifies the logic in many places because it is not necessary to check
for compatibility.

- `LinearIndex` was rewritten based on `Collection` (and also the
`greyhound`/`branchwater` parallelism), and this supports the "parallel
search without an index" use case. There is no index construction per se
here, pretty much just a thin layer on top of `Collection` implementing
functionality expected from indices.

- `Manifest`, `Selection`, and `Picklist` are still incomplete, but the
relevant function definitions are in place, need to barrage it with
tests (and potentially exposing to Python and reusing the ones there in
#2726)

## Feature

- Initial implementation for `Manifest`, `Selection`, and `Picklist`
following
  the Python API.
- `Collection` is a new abstraction for working with a set of
signatures. A
collection needs a `Storage` for holding the signatures (on-disk,
in-memory,
or remotely), and a `Manifest` to describe the metadata for each
signature.
- Expose CSV parsing and RocksDB errors.
- New module `sourmash::index::revindex::disk_revindex` with the on-disk
  RevIndex implementation based on RocksDB.
- Add `iter` and `iter_mut` methods for `Signature`.
- Add `load_sig` and `save_sig` methods to `Storage` trait for
higher-level data
  manipulation and caching.
- Add `spec` method to `Storage` to allow constructing a concrete
`Storage` from
  a string description.
- Add `InnerStorage` for synchronizing parallel access to `Storage`
  implementations.
- Add `MemStorage` for keeping signatures in-memory (mostly for
debugging and
  testing).

## Refactor

- Rename `HashFunctions` variants to follow camel-case, so
`Murmur64Protein` instead of `murmur64_protein`
- `LinearIndex` is now implemented as a thin layer on top of
`Collection`.
- Move `GatherResult` to `sourmash::index` module.
- Move `sourmash::index::revindex` to `sourmash::index::mem_revindex`
(this is
the Greyhound version of revindex, in-memory only). It was also
refactored
internally to build a version of a `LinearIndex` that will be merged in
the
  future with `sourmash::index::LinearIndex`
- Move `select` method from `Index` trait into a separate `Select`
trait,
  and implement it for `Signature` based on the new `Selection` API.
- Move `SigStore` into `sourmash::storage` module, and remove the
generic. Now
  it always stores `Signature`. Also implement `Select` for it.

## Build

- Add new `branchwater` feature (enabled by default), which can be
disabled by downstream projects to limit bringing heavy dependencies
like rocksdb
- Add new `rkyv` feature (disabled by default), making `MinHash`
serializable
  with the `rkyv` crate.
- Add semver checks for CI (so we bump versions accordingly, or avoid
breaking changes)
- Reduce features combinations on Rust checks (takes much less time to
run)
- Disable `musllinux` wheels (need to figure out how to build rocksdb
for it)

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: C. Titus Brown <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant