Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10x run leaks memory like crazy #559

Closed
olgabot opened this issue Oct 26, 2018 · 5 comments
Closed

10x run leaks memory like crazy #559

olgabot opened this issue Oct 26, 2018 · 5 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Oct 26, 2018

TL;DR: Running sourmash compute on larger 10X bam files crashed our 2TB ram machine (!!!)

I've been trying to run sourmash compute on a few 10x bam files with 3458 and 610 barcodes, and previously I had tested files with 150 and 625 barcodes with no problem. But because the bam file is sorted by coordinate, the code iterates over each alignment, checks if the barcode associated with this alignment is already added, adds it, then adds the sequence, it ends up taking up a LOT of memory since its unknown which sequences correspond to which barcodes a priori. I crashed our 2TB ram machine running sourmash compute on these two files 😱

The options I are:

  • Refactor the 10x bam code to first sort the bam file by barcode, then write the signature generation as an iterator that yields signatures after it runs into a new barcode, and write each to file before purging it from memory
  • Require a sorted-by-barcode bam file as input (annoying to lazy users like me)

What do you think?

@ctb
Copy link
Contributor

ctb commented Nov 1, 2018

I have no real knowledge here :). What about building signatures for all barcodes simultaneously?

@ctb
Copy link
Contributor

ctb commented Aug 2, 2019

does #685/#687 fix this?

@olgabot
Copy link
Collaborator Author

olgabot commented Aug 3, 2019

Haven't tested it yet, but at a glance, those PRs address the compare functionality whereas this issue was happening just on compute because the file included all ~700k possible barcodes, and was using inefficient data structures to store them all. @pranathivemuri is working on a fix to reading the 10x data here: https://github.com/pranathivemuri/sourmash/blob/pranathi-10x/sourmash/commands.py

There may also be a necessary filter for only allowing barcodes with at least N reads to remove the 'bad' barcodes and reduce the total memory used.

@luizirber
Copy link
Member

was this fixed with bam2fasta, @olgabot?

@olgabot
Copy link
Collaborator Author

olgabot commented Jan 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants