-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try adding 'gzip_level' parameter for HDF5 writers #42
Conversation
Hi @ycli1995, this looks like a great contribution! I'm going through it in detail now, but the first pass looks really good -- I appreciate you adding in the tests for your code, and catching a mistake I had in my error-checking code. I'm basically happy taking the
P.S. I see you've started up some repos re-implementing bitpacking compression in Rust -- super cool! Not sure if you're in industry, gradschool, etc. but I'd be happy to provide advice on any BPCells-related projects if that would be useful, as well as helping plan any further improvements to BPCells you might be interested in. |
Hi @bnprks, thanks for reviewing my code! For the two questions:
BPCells is one of the most wonderful packages I've met in 2023! It provides a quite promising framework which may become one of the foundations for large-scale single-cell analysis on desktop. It will be so great if BPCells might become some kind of common library in the future, so that it can be conveniently imported by diverse analysis tools or algorithms, regardless of the program languages. The bitpacking extension library you found is just some trial for bringing BPCells-like framework to Rust. The reason is that I quite like the one-step importing using |
Oh no, I accidentally messed up your home repo commits with a force push. Very sorry about that, purely my own dumb git mistake while I was trying to add a couple minor edits. I think the way to restore your fork will be to just run a In the mean time, I have merged these changes in to the main branch since I think everything was ready to go! Let me know if you had any other changes you had in mind and we can do that in a new pull request that I haven't messed up. |
@ycli1995 Do you mind if I re-license BPCells from GPLv3 to be dual-licensed Apache-2.0 and MIT? I want to make sure this change has approval from all the BPCells contributors. If yes, could you give a thumbs up or a quick comment reply? Why this change: GPLv3 requires that all "derivative work" also be licensed under GPLv3 or a related license, which can limit how BPCells can be used or contributed to by companies. The Apache-2.0 and MIT licenses are less restrictive open-source licenses that I hope will expand who can contribute to BPCells while keeping it freely available and open source. |
Sorry for the delayed response. I'm totally okay with the less restrictive licenses. |
Hi,
I added a
gzip_level
parameter for functions where data are written to HDF5 files. This may help to reduce file size when users want to write matrix into vanillaH5SparseMatrix
format, where an H5 group should at least contain three datasets:data
,indices
andindptr
.Since
H5SparseMatrix
-like formats are widely used in cellranger output,HDF5Array
matrix andAnnData
, users may desire thatBPCells
matrix can be easily converted into those formats, in not only integers but also floats (normalized expression). As one of the possible solutions, one can first write matrix to a temporary H5 file with un-bitpacked formats, then copy these H5 links (index
,idxptr
andval
) to the destination file to form anH5SparseMatrix
, using something likehdf5r
. Therefore, an option ofgzip_level
can help to save the disk storage, in the cost of more writing time though. I set the defaultgzip_level
to0L
so that it should work as the originalBPCells
.