Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Using bitmagic for internal Nodegraph representation #1221

Closed
wants to merge 16 commits into from

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Oct 27, 2020

This PR replaces the fixedbitset dependency with bitmagic, a crate wrapping the bitmagic C++ library for compressed bit-vector containers. It is used to implement the internal representation of Nodegraph at the moment, but I want to use it to implement HowDeSBT-like internal nodes in the future.

This is not changing much in the codebase because I wrote the bitmagic crate with the fixedbitset API in mind, and despite some (necessary) optimizations it is already working. It is especially interesting because bitmagic also keeps the compressed representation in memory, which means that the memory consumption is way lower than what can be achieved with large fixedbitset vectors (or with khmer Nodegraph, which also allocates a large memory buffer for the Bloom Filter). For example, I ran gather with SBTs (k=51, -x 1e5) with 5.9k signatures:

Runtime (s) Memory (MB) Index size (MB)
original 7.36 7,367 233
bitmagic 24.94 334 241
#1137 + #1138 + bitmagic 6.51 358 241

There are more things to fix along the way, but an unholy union of #1137 #1138 and this PR yield a 6.51 seconds runtime, with 358 MB of memory consumption (I added the numbers to the table).

(no, the memory consumption is not a typo)

#1138 is a potential improvement for lowering the runtime (doing hash-by-hash membership checks with the bitmagic library is slow, because it is going thru the C FFI layer).

And I think this will work very well with #1201, but that needs more work =]

Why is this an experiment?

  • This is creating a new Nodegraph version (99 =]), which is incompatible with khmer
  • This brings C++ again into the codebase, and I would like to keep it working with webassembly (although bitmagic can be built for wasm, so this is potentially fixable)
  • It works better with [WIP] SBT scaffold #1201 is functional and merged first

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Oct 27, 2020

Codecov Report

Merging #1221 (7e8458f) into latest (ffbc919) will decrease coverage by 0.25%.
The diff coverage is 67.10%.

@@            Coverage Diff             @@
##           latest    #1221      +/-   ##
==========================================
- Coverage   82.75%   82.49%   -0.26%     
==========================================
  Files         122      122              
  Lines       13206    13246      +40     
  Branches     1780     1780              
==========================================
- Hits        10928    10927       -1     
- Misses       2014     2055      +41     
  Partials      264      264              
Flag Coverage Δ
python 90.62% <16.66%> (-0.04%) ⬇️
rust 65.22% <71.42%> (-0.58%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/nodegraph.rs 0.00% <0.00%> (ø)
src/sourmash/nodegraph.py 69.91% <16.66%> (-1.91%) ⬇️
src/core/src/sketch/nodegraph.rs 82.40% <76.92%> (-8.78%) ⬇️
src/sourmash/sourmash_args.py 93.50% <0.00%> (-0.04%) ⬇️
src/sourmash/lca/lca_db.py 91.27% <0.00%> (ø)
src/sourmash/index/__init__.py 96.45% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffbc919...7e8458f. Read the comment docs.

@ctb
Copy link
Contributor

ctb commented Oct 27, 2020

sounds great ;)

@luizirber
Copy link
Member Author

Tentative: explore using simple-sds instead of bitmagic. It is a pure Rust impl (good for wasm!), but is is very new if compared to bitmagic.

(we don't need all features of bitmagic, and I think simple-sds has all we need)

@luizirber
Copy link
Member Author

luizirber commented Mar 16, 2021

I've been thinking about this PR, and maybe it is time to graduate it from [EXP] to [WIP].

My main concerns were new Nodegraph format and bringing C++ into the codebase again (even if as a dependency), but:

  • bitmagic supports loading older versions of its serialization format. It is more complex/opaque than what we have with Nodegraphs (which is a simple format and well documented), but we can stick to whatever bitmagic version works for us.
  • The Nodegraph usage in sourmash is mostly as caching (the internal nodes in the SBT), and they can be rebuilt if needed (because all the data are in the leaves), so even if something happens with bitmagic we can still switch to another format.
  • The C++ dep complicates wasm support, but in [WIP] Reduce wasm surface by removing index #1399 I proposed to support only a subset of the sourmash API in wasm (which excludes the part that uses Nodegraphs). bitmagic can work on wasm too, but I didn't figure out how to build it together with Rust yet, so this is probably fixable in the future.

So, the path for merging this is

  • Make an official Nodegraph version (instead of 99), and document how it is saved on disk
  • Fix wheel building to include what is needed for building bitmagic (cmake and so on)
  • Decide how to save new Nodegraphs: expose version on the save function? (for keeping compatibility with khmer)
  • Maybe just drop compat with khmer (allow loading, but not saving anymore)?
  • Review [WIP] Remove min_n_below from search code #1137 and [WIP] Use a Nodegraph for searching in internal nodes #1138 again, and see if they still apply with current changes/explorations in Index

@ctb
Copy link
Contributor

ctb commented Mar 19, 2021

Sounds good to me; I'm not a big fan of complicating builds further, but the performance gains seem significant and as you say it's not locking us into anything in particular.

@luizirber
Copy link
Member Author

Sounds good to me; I'm not a big fan of complicating builds further, but the performance gains seem significant and as you say it's not locking us into anything in particular.

Note that the "complicating builds" part is about building wheels in CI, it still works fine for local dev with conda (because it has all the compilers/deps installed already). I'm trying to contain all the build complexity in the bitmagic-rs crate, and it should be as transparent as possible here in sourmash.

@ctb
Copy link
Contributor

ctb commented Mar 19, 2021

Note that the "complicating builds" part is about building wheels in CI, it still works fine for local dev with conda (because it has all the compilers/deps installed already). I'm trying to contain all the build complexity in the bitmagic-rs crate, and it should be as transparent as possible here in sourmash.

ahh! you are indeed wise. :gandalf emoji:

@luizirber
Copy link
Member Author

A curious bump: the bitmagic C API has issues building on aarch64/s390x/ppc64le, which are platforms that we currently build wheels for. I opened tlk00/BitMagic#65 to discuss solutions and luizirber/bitmagic-rs#2 to add cross-platform testing to bitmagic-rs CI.

(it is probably solvable by disabling SIMD feature detection in these platforms, because it is using x86-specific instructions to execute it)

@luizirber
Copy link
Member Author

A curious bump: the bitmagic C API has issues building on aarch64/s390x/ppc64le, which are platforms that we currently build wheels for. I opened tlk00/BitMagic#65 to discuss solutions and luizirber/bitmagic-rs#2 to add cross-platform testing to bitmagic-rs CI.

(it is probably solvable by disabling SIMD feature detection in these platforms, because it is using x86-specific instructions to execute it)

Fixed by tlk00/BitMagic#66

Tests are still failing on s390x, and since 1) I don't have access to an IBM mainframe to debug it and 2) I highly doubt someone is using sourmash in an IBM mainframe, I'm proposing to drop s390x wheels support.
(this also cuts a bit of GH Actions runtime, since the wheels are built using qemu to emulate an s390x machine...)

@luizirber luizirber changed the title [EXP] Using bitmagic for internal Nodegraph representation [WIP] Using bitmagic for internal Nodegraph representation Apr 16, 2021
@luizirber
Copy link
Member Author

I still like the idea of using bitmagic for Nodegraph, but not working on SBTs these days (and #2230 goes with roaring bitmaps instead)

@luizirber luizirber closed this Feb 13, 2023
@luizirber luizirber deleted the bitmagic branch February 13, 2023 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants