Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

Open
abearab opened this issue Feb 27, 2024 · 5 comments

Comments

@abearab
Copy link
Contributor

abearab commented Feb 27, 2024

As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:

  1. How can I take advantage of biobear to read large FASTQ files without loading the whole data into memory?
  2. How can I use multiprocessing to accelerate the computation time? e.g. https://docs.pola.rs/user-guide/misc/multiprocessing/

Thanks for you time.


related #89 and ArcInstitute/ScreenPro2#28

@tshauck
Copy link
Member

tshauck commented Feb 27, 2024

Thanks again for the filing these issues.

For the first, without loading everything into memory, you can stream through the RecordBatchReader from pyarrow like so:

import biobear as bb

session = bb.connect()

result = session.sql("SELECT name, sequence FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz')")

rbr = result.to_arrow_record_batch_reader()

for batch in rbr:
    chunk_df = pl.from_arrow(batch)

This would stream arrow record batches from disk, and you'd work on chunks of 8192 records as polars dfs.

Like I mentioned in the other issue, whatever you push down into SQL will be lazy and use all of your cores (except the fastq file IO). So if you can do more there, you'll get better performance.

Going back to the parquet point, if you initially write the data to parquet, you can use pl.scan_parquet as is described in the polars docs, and you'll also avoid loading everything into memory https://docs.pola.rs/user-guide/lazy/using/. This would give you multiprocessing during the file scan and subsequent computation.

image

FWIW, I'm look at implementing multithreaded readers for some of the common bioinformatic formats, but it's complex as you might imagine.

@abearab
Copy link
Contributor Author

abearab commented Mar 5, 2024

Hey @tshauck – thanks for your response.

One more technical question; should a user "close" / "disconnect" session after doing this? I didn't see that in any of your examples so I wanted to confirm.

session = bb.connect()

@tshauck
Copy link
Member

tshauck commented Mar 5, 2024

Hey @abearab -- sorry for the confusion, you don't need to worry about closing or disconnecting. It's implicitly local, but at some point may take an external server to connect to.

@abearab
Copy link
Contributor Author

abearab commented Mar 5, 2024

I see, thanks for clarification. Is there any way to allocate the resources? For instance, can we say just use X number of CPUs and Y amount of memory?

@tshauck
Copy link
Member

tshauck commented Mar 7, 2024

Currently it's possible to set the number of CPUs used for a few different parts, mostly that control the underlying query engine's (Apache DataFusion) parallelism...

image image image

You can modify this within a session like:

In [1]: import biobear as bb

In [2]: s = bb.connect()

In [3]: s.execute("SET datafusion.execution.target_partitions=1")

In [5]: s.sql("SHOW datafusion.execution.target_partitions").to_polars()
Out[5]: 
shape: (1, 2)
┌───────────────────────────────────┬───────┐
│ name                              ┆ value │
│ ---                               ┆ ---   │
│ str                               ┆ str   │
╞═══════════════════════════════════╪═══════╡
│ datafusion.execution.target_part… ┆ 1     │
└───────────────────────────────────┴───────┘

In [6]: s.execute("SET datafusion.execution.target_partitions=4")

In [7]: s.sql("SHOW datafusion.execution.target_partitions").to_polars()
Out[7]: 
shape: (1, 2)
┌───────────────────────────────────┬───────┐
│ name                              ┆ value │
│ ---                               ┆ ---   │
│ str                               ┆ str   │
╞═══════════════════════════════════╪═══════╡
│ datafusion.execution.target_part… ┆ 4     │
└───────────────────────────────────┴───────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants