Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357

adamrtalbot · 2023-12-15T14:02:02Z

Description of the bug

When analysing a sample that has been sequenced on multiple lanes, if there is a large discrepancy between samples the grouping key will be waiting for n lanes * n split reads for a sample, which may never occur.

Here's an example:

sample1, lane 1: 10 million reads
sample1, lane 2: 1 million reads
split into 1 million read chunks by FASTP
the groupKey here calculates there should be 2 * 10 chunks for a total of 20
Only 12 ever appear, so the sample is dropped out of the pipeline.
The pipeline green ticks because nothing 'fails'

I've attached a samplesheet using the test data from profile test and test_full which recreates the problem.

Setting splitFastq to size 0 is a workaround for this problem.

Command used and terminal output

nextflow run nf-core/sarek -r v3.4.0 --splitFastq 5000000 --input sarek-uneven-size-error.csv --outdir results/

Relevant files

sarek-uneven-size-error.csv

System information

All

The text was updated successfully, but these errors were encountered:

maxulysse · 2023-12-15T15:37:18Z

cf nextflow-io/nextflow#4592

maxulysse · 2023-12-19T10:31:19Z

related to #853

Changes: - The grouping strategy for sharded data has been improved - The number of BAM files per sample is calculated by grouping the sample by ID after splitting the FASTQ files, then counting the total number of FASTQ files created. - This has to wait for all FASTQ files to be produced by FASTP, but is more reliable. - After alignment, the number of FASTQ files is used to wait to determine the expected number of BAM files used by groupBy. Fixes #1357

FriederikeHanssen · 2024-01-17T10:36:15Z

Closing this issue since it was fixed in the linked PR

adamrtalbot added the bug Something isn't working label Dec 15, 2023

adamrtalbot mentioned this issue Dec 20, 2023

Minimum split_fastq value is 250 #1363

Closed

adamrtalbot mentioned this issue Dec 20, 2023

1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

Merged

10 tasks

FriederikeHanssen closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357

Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357

adamrtalbot commented Dec 15, 2023 •

edited

Loading

maxulysse commented Dec 15, 2023

maxulysse commented Dec 19, 2023

FriederikeHanssen commented Jan 17, 2024

Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357

Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357

Comments

adamrtalbot commented Dec 15, 2023 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

maxulysse commented Dec 15, 2023

maxulysse commented Dec 19, 2023

FriederikeHanssen commented Jan 17, 2024

adamrtalbot commented Dec 15, 2023 •

edited

Loading