-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Using bitmagic for internal Nodegraph representation #1221
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1221 +/- ##
==========================================
- Coverage 82.75% 82.49% -0.26%
==========================================
Files 122 122
Lines 13206 13246 +40
Branches 1780 1780
==========================================
- Hits 10928 10927 -1
- Misses 2014 2055 +41
Partials 264 264
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
sounds great ;) |
Tentative: explore using simple-sds instead of bitmagic. It is a pure Rust impl (good for wasm!), but is is very new if compared to bitmagic. (we don't need all features of bitmagic, and I think |
I've been thinking about this PR, and maybe it is time to graduate it from My main concerns were new Nodegraph format and bringing C++ into the codebase again (even if as a dependency), but:
So, the path for merging this is
|
Sounds good to me; I'm not a big fan of complicating builds further, but the performance gains seem significant and as you say it's not locking us into anything in particular. |
Note that the "complicating builds" part is about building wheels in CI, it still works fine for local dev with conda (because it has all the compilers/deps installed already). I'm trying to contain all the build complexity in the bitmagic-rs crate, and it should be as transparent as possible here in sourmash. |
ahh! you are indeed wise. :gandalf emoji: |
6977802
to
aa45cb6
Compare
A curious bump: the bitmagic C API has issues building on aarch64/s390x/ppc64le, which are platforms that we currently build wheels for. I opened tlk00/BitMagic#65 to discuss solutions and luizirber/bitmagic-rs#2 to add cross-platform testing to bitmagic-rs CI. (it is probably solvable by disabling SIMD feature detection in these platforms, because it is using x86-specific instructions to execute it) |
Fixed by tlk00/BitMagic#66 Tests are still failing on s390x, and since 1) I don't have access to an IBM mainframe to debug it and 2) I highly doubt someone is using sourmash in an IBM mainframe, I'm proposing to drop |
d756bb1
to
f190db9
Compare
d2bc8c3
to
49a8412
Compare
I still like the idea of using bitmagic for Nodegraph, but not working on SBTs these days (and #2230 goes with roaring bitmaps instead) |
This PR replaces the
fixedbitset
dependency withbitmagic
, a crate wrapping the bitmagic C++ library for compressed bit-vector containers. It is used to implement the internal representation ofNodegraph
at the moment, but I want to use it to implementHowDeSBT
-like internal nodes in the future.This is not changing much in the codebase because I wrote the
bitmagic
crate with thefixedbitset
API in mind, and despite some (necessary) optimizations it is already working. It is especially interesting becausebitmagic
also keeps the compressed representation in memory, which means that the memory consumption is way lower than what can be achieved with largefixedbitset
vectors (or with khmerNodegraph
, which also allocates a large memory buffer for the Bloom Filter). For example, I rangather
with SBTs (k=51
,-x 1e5
) with 5.9k signatures:There are more things to fix along the way, but an unholy union of #1137 #1138 and this PR yield a 6.51 seconds runtime, with 358 MB of memory consumption (I added the numbers to the table).
(no, the memory consumption is not a typo)
#1138 is a potential improvement for lowering the runtime (doing hash-by-hash membership checks with the
bitmagic
library is slow, because it is going thru the C FFI layer).And I think this will work very well with #1201, but that needs more work =]
Why is this an experiment?
99
=]), which is incompatible withkhmer
Checklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?