How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

abearab · 2024-02-27T02:56:18Z

As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:

How can I take advantage of biobear to read large FASTQ files without loading the whole data into memory?
How can I use multiprocessing to accelerate the computation time? e.g. https://docs.pola.rs/user-guide/misc/multiprocessing/

Thanks for you time.

related #89 and ArcInstitute/ScreenPro2#28

The text was updated successfully, but these errors were encountered:

tshauck · 2024-02-27T23:04:52Z

Thanks again for the filing these issues.

For the first, without loading everything into memory, you can stream through the RecordBatchReader from pyarrow like so:

import biobear as bb

session = bb.connect()

result = session.sql("SELECT name, sequence FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz')")

rbr = result.to_arrow_record_batch_reader()

for batch in rbr:
    chunk_df = pl.from_arrow(batch)

This would stream arrow record batches from disk, and you'd work on chunks of 8192 records as polars dfs.

Like I mentioned in the other issue, whatever you push down into SQL will be lazy and use all of your cores (except the fastq file IO). So if you can do more there, you'll get better performance.

Going back to the parquet point, if you initially write the data to parquet, you can use pl.scan_parquet as is described in the polars docs, and you'll also avoid loading everything into memory https://docs.pola.rs/user-guide/lazy/using/. This would give you multiprocessing during the file scan and subsequent computation.

FWIW, I'm look at implementing multithreaded readers for some of the common bioinformatic formats, but it's complex as you might imagine.

abearab · 2024-03-05T20:07:42Z

Hey @tshauck – thanks for your response.

One more technical question; should a user "close" / "disconnect" session after doing this? I didn't see that in any of your examples so I wanted to confirm.

session = bb.connect()

tshauck · 2024-03-05T21:43:46Z

Hey @abearab -- sorry for the confusion, you don't need to worry about closing or disconnecting. It's implicitly local, but at some point may take an external server to connect to.

abearab · 2024-03-05T23:32:15Z

I see, thanks for clarification. Is there any way to allocate the resources? For instance, can we say just use X number of CPUs and Y amount of memory?

tshauck · 2024-03-07T20:05:29Z

Currently it's possible to set the number of CPUs used for a few different parts, mostly that control the underlying query engine's (Apache DataFusion) parallelism...

You can modify this within a session like:

In [1]: import biobear as bb

In [2]: s = bb.connect()

In [3]: s.execute("SET datafusion.execution.target_partitions=1")

In [5]: s.sql("SHOW datafusion.execution.target_partitions").to_polars()
Out[5]: 
shape: (1, 2)
┌───────────────────────────────────┬───────┐
│ name                              ┆ value │
│ ---                               ┆ ---   │
│ str                               ┆ str   │
╞═══════════════════════════════════╪═══════╡
│ datafusion.execution.target_part… ┆ 1     │
└───────────────────────────────────┴───────┘

In [6]: s.execute("SET datafusion.execution.target_partitions=4")

In [7]: s.sql("SHOW datafusion.execution.target_partitions").to_polars()
Out[7]: 
shape: (1, 2)
┌───────────────────────────────────┬───────┐
│ name                              ┆ value │
│ ---                               ┆ ---   │
│ str                               ┆ str   │
╞═══════════════════════════════════╪═══════╡
│ datafusion.execution.target_part… ┆ 4     │
└───────────────────────────────────┴───────┘

abearab mentioned this issue Feb 27, 2024

Design and implement fastq2count module ArcInstitute/ScreenPro2#28

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

abearab commented Feb 27, 2024

tshauck commented Feb 27, 2024

abearab commented Mar 5, 2024

tshauck commented Mar 5, 2024

abearab commented Mar 5, 2024

tshauck commented Mar 7, 2024 •

edited

Loading

How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

Comments

abearab commented Feb 27, 2024

tshauck commented Feb 27, 2024

abearab commented Mar 5, 2024

tshauck commented Mar 5, 2024

abearab commented Mar 5, 2024

tshauck commented Mar 7, 2024 • edited Loading

tshauck commented Mar 7, 2024 •

edited

Loading