Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357

Closed
adamrtalbot opened this issue Dec 15, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Dec 15, 2023

Description of the bug

When analysing a sample that has been sequenced on multiple lanes, if there is a large discrepancy between samples the grouping key will be waiting for n lanes * n split reads for a sample, which may never occur.

Here's an example:

  • sample1, lane 1: 10 million reads

  • sample1, lane 2: 1 million reads

  • split into 1 million read chunks by FASTP

  • the groupKey here calculates there should be 2 * 10 chunks for a total of 20

  • Only 12 ever appear, so the sample is dropped out of the pipeline.

  • The pipeline green ticks because nothing 'fails'

I've attached a samplesheet using the test data from profile test and test_full which recreates the problem.

Setting splitFastq to size 0 is a workaround for this problem.

Command used and terminal output

nextflow run nf-core/sarek -r v3.4.0 --splitFastq 5000000 --input sarek-uneven-size-error.csv --outdir results/

Relevant files

sarek-uneven-size-error.csv

System information

All

@adamrtalbot adamrtalbot added the bug Something isn't working label Dec 15, 2023
@maxulysse
Copy link
Member

cf nextflow-io/nextflow#4592

@maxulysse
Copy link
Member

related to #853

adamrtalbot added a commit that referenced this issue Dec 20, 2023
Changes:
 - The grouping strategy for sharded data has been improved
 - The number of BAM files per sample is calculated by grouping the sample by ID after splitting the FASTQ files, then counting the total number of FASTQ files created.
 - This has to wait for all FASTQ files to be produced by FASTP, but is more reliable.
 - After alignment, the number of FASTQ files is used to wait to determine the expected number of BAM files used by groupBy.

Fixes #1357
@FriederikeHanssen
Copy link
Contributor

Closing this issue since it was fixed in the linked PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants