-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run polars dataframe methods on large FASTQ files in a memory efficient way #99
Comments
Thanks again for the filing these issues. For the first, without loading everything into memory, you can stream through the import biobear as bb
session = bb.connect()
result = session.sql("SELECT name, sequence FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz')")
rbr = result.to_arrow_record_batch_reader()
for batch in rbr:
chunk_df = pl.from_arrow(batch) This would stream arrow record batches from disk, and you'd work on chunks of 8192 records as polars dfs. Like I mentioned in the other issue, whatever you push down into SQL will be lazy and use all of your cores (except the fastq file IO). So if you can do more there, you'll get better performance. Going back to the parquet point, if you initially write the data to parquet, you can use FWIW, I'm look at implementing multithreaded readers for some of the common bioinformatic formats, but it's complex as you might imagine. |
Hey @tshauck – thanks for your response. One more technical question; should a user "close" / "disconnect" session after doing this? I didn't see that in any of your examples so I wanted to confirm.
|
Hey @abearab -- sorry for the confusion, you don't need to worry about closing or disconnecting. It's implicitly local, but at some point may take an external server to connect to. |
I see, thanks for clarification. Is there any way to allocate the resources? For instance, can we say just use X number of CPUs and Y amount of memory? |
As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:
biobear
to read large FASTQ files without loading the whole data into memory?Thanks for you time.
related #89 and ArcInstitute/ScreenPro2#28
The text was updated successfully, but these errors were encountered: