Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try adding 'gzip_level' parameter for HDF5 writers #42

Closed
wants to merge 0 commits into from

Conversation

ycli1995
Copy link
Contributor

Hi,
I added a gzip_level parameter for functions where data are written to HDF5 files. This may help to reduce file size when users want to write matrix into vanilla H5SparseMatrix format, where an H5 group should at least contain three datasets: data, indices and indptr.

Since H5SparseMatrix-like formats are widely used in cellranger output, HDF5Array matrix and AnnData, users may desire that BPCells matrix can be easily converted into those formats, in not only integers but also floats (normalized expression). As one of the possible solutions, one can first write matrix to a temporary H5 file with un-bitpacked formats, then copy these H5 links (index, idxptr and val) to the destination file to form an H5SparseMatrix, using something like hdf5r. Therefore, an option of gzip_level can help to save the disk storage, in the cost of more writing time though. I set the default gzip_level to 0L so that it should work as the original BPCells.

@bnprks
Copy link
Owner

bnprks commented Aug 29, 2023

Hi @ycli1995, this looks like a great contribution! I'm going through it in detail now, but the first pass looks really good -- I appreciate you adding in the tests for your code, and catching a mistake I had in my error-checking code.

I'm basically happy taking the gzip_level parameter as-is, though I have two areas I'm curious for your thoughts on:

  1. With the default bitpacking compression enabled, I'd suspect (though I'm not certain) that adding gzip would provide little additional space savings while imposing a fairly noticeable speed penalty. Would you mind if I add in a warning when people use gzip_level combined with compress = TRUE?

  2. When gzip is used in HDF5 files for single cell datasets, I most often see it combined with the shuffle filter, which can yield both speed and size savings when applied numeric data prior to gzip. This corresponds to the Shuffle() property in HighFive. I could see a few options for how to handle this:

    • Just default to using the Shuffle() filter on all numeric data if gzip_level != 0
    • Allow some more complicated argument type to gzip_level like a list or character vector for users that want access to turn Shuffle() on/off

    I think I lean towards the first of these options -- what are your thoughts here?

P.S. I see you've started up some repos re-implementing bitpacking compression in Rust -- super cool! Not sure if you're in industry, gradschool, etc. but I'd be happy to provide advice on any BPCells-related projects if that would be useful, as well as helping plan any further improvements to BPCells you might be interested in.

@ycli1995
Copy link
Contributor Author

ycli1995 commented Aug 30, 2023

Hi @bnprks, thanks for reviewing my code!

For the two questions:

  1. I agree that adding a warning when people use gzip_level with compress = TRUE is a good idea. We can then let the users decide whether to use HDF5 compression by themselves.
  2. Sorry for missing the Shuffle() thing. I've seen it in rhdf5 and SeuratDisk before, and yes, it is usually turn on when there is gzip compression. I think that setting default to using shuffle with gizp_level != 0 is quite sufficient, because this has little influence the regular IO behaviors. Users of BPCells may not want to take extra cares for such an internal HDF5 option. Besides, it seems that simply adding an conditional props.add(HighFive::Shuffle()) following props.add(HighFive::Deflate(gzip_level)) in src/arrayIO/hdf5.h can just work.

BPCells is one of the most wonderful packages I've met in 2023! It provides a quite promising framework which may become one of the foundations for large-scale single-cell analysis on desktop. It will be so great if BPCells might become some kind of common library in the future, so that it can be conveniently imported by diverse analysis tools or algorithms, regardless of the program languages.

The bitpacking extension library you found is just some trial for bringing BPCells-like framework to Rust. The reason is that I quite like the one-step importing using cargo add in comparison to c++. However, it's definitely a complex work considering the different philosophy for Rust. I may keep trying it as long as I have spare time from my main job.

@bnprks
Copy link
Owner

bnprks commented Aug 31, 2023

Oh no, I accidentally messed up your home repo commits with a force push. Very sorry about that, purely my own dumb git mistake while I was trying to add a couple minor edits.

I think the way to restore your fork will be to just run a git push --force of your own.

In the mean time, I have merged these changes in to the main branch since I think everything was ready to go! Let me know if you had any other changes you had in mind and we can do that in a new pull request that I haven't messed up.

@bnprks
Copy link
Owner

bnprks commented Sep 7, 2023

@ycli1995 Do you mind if I re-license BPCells from GPLv3 to be dual-licensed Apache-2.0 and MIT? I want to make sure this change has approval from all the BPCells contributors. If yes, could you give a thumbs up or a quick comment reply?

Why this change: GPLv3 requires that all "derivative work" also be licensed under GPLv3 or a related license, which can limit how BPCells can be used or contributed to by companies. The Apache-2.0 and MIT licenses are less restrictive open-source licenses that I hope will expand who can contribute to BPCells while keeping it freely available and open source.

@ycli1995
Copy link
Contributor Author

@ycli1995 Do you mind if I re-license BPCells from GPLv3 to be dual-licensed Apache-2.0 and MIT? I want to make sure this change has approval from all the BPCells contributors. If yes, could you give a thumbs up or a quick comment reply?

Why this change: GPLv3 requires that all "derivative work" also be licensed under GPLv3 or a related license, which can limit how BPCells can be used or contributed to by companies. The Apache-2.0 and MIT licenses are less restrictive open-source licenses that I hope will expand who can contribute to BPCells while keeping it freely available and open source.

Sorry for the delayed response. I'm totally okay with the less restrictive licenses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants