1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

adamrtalbot · 2023-12-20T17:48:11Z

Split FASTQs are grouped earlier for more reliable grouping strategy (Large discrepancy between number of reads in the same sample but in different lanes causes a sample to get dropped #1357)
Minimum number of fastqs for split_fastq is now 250 (Minimum split_fastq value is 250 #1363)

PR checklist

Changes: - The grouping strategy for sharded data has been improved - The number of BAM files per sample is calculated by grouping the sample by ID after splitting the FASTQ files, then counting the total number of FASTQ files created. - This has to wait for all FASTQ files to be produced by FASTP, but is more reliable. - After alignment, the number of FASTQ files is used to wait to determine the expected number of BAM files used by groupBy. Fixes #1357

Changes: - FASTP uses blocks of 250 reads when splitting a FASTQ file. - This update makes 250 the minimum sized block to split a FASTQ file into. - Updates help text accordingly Fixes #1363

github-actions · 2023-12-20T17:50:19Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit b815378

+| ✅ 146 tests passed       |+
#| ❔  10 tests were ignored |#
!| ❗   2 tests had warnings |!

❗ Test warnings:

files_exist - File not found: .github/workflows/awstest.yml
pipeline_todos - TODO string in WorkflowSarek.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

files_exist - File is ignored: .github/workflows/awsfulltest.yml
files_exist - File is ignored: conf/modules.config
files_unchanged - File ignored due to lint config: assets/nf-core-sarek_logo_light.png
files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo_light.png
files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo_dark.png
files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
actions_ci - actions_ci
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/sarek/sarek/.github/workflows/awstest.yml
template_strings - template_strings

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-sarek_logo_light.png
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-sarek_logo_light.png
files_exist - File found: docs/images/nf-core-sarek_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: lib/WorkflowSarek.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-sarek_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 3.5dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_name_conventions - Name adheres to nf-core convention
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: cloudtest.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: release-announcments.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - 'assets/multiqc_config.yml' contains report_section_order
multiqc_config - 'assets/multiqc_config.yml' contains export_plots
multiqc_config - 'assets/multiqc_config.yml' contains report_comment
multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.11.1
Run at 2023-12-21 17:41:30

adamrtalbot · 2023-12-20T17:50:45Z

@FriederikeHanssen @maxulysse I created a test set of data which included only 60 reads for lane 2 to recreate this problem. Attached here, you'll have to modify the path in the input samplesheet.
fastq_single.csv
test_1_slice60.1.fastq.gz
test_1_slice60.2.fastq.gz

There are no tests in Sarek right now. What shall we do? It's easy to do but we'd need to add some more data to test-datasets (such as those FASTQ files).

maxulysse

LGTM

workflows/sarek.nf

maxulysse · 2023-12-20T18:11:02Z

Can you update changelog too?

FriederikeHanssen · 2023-12-20T18:15:46Z

Testing might be good, but that data probably can't be added to the modules repo, right?

nextflow_schema.json

adamrtalbot · 2023-12-21T17:59:55Z

Testing might be good, but that data probably can't be added to the modules repo, right?

I don't see why not. I just sliced 60 reads from the existing data. Alternatively we could generate it on the fly?

Here is the mini workflow to generate a channel with a slice of the reads:

workflow UNEVEN_FASTQ {
    take:
        csv

    main:
        ch_csv = Channel.fromPath(csv, checkIfExists: true)
            .splitCsv(header: true)
            .map { row ->
                [
                    [
                        patient: row.patient,
                        sex:     row.sex,
                        status:  row.status,
                        sample:  row.sample,
                        lane:    "small_lane"
                    ],
                    file(row.fastq_1),
                    file(row.fastq_2)
                ]
            }
            .first()
        ch_csv
            .splitFastq(by: 60, file: true, pe: true)
            .map { meta, read1, read2 -> [ meta, [ read1, read2 ] ] }
            .first()
            .mix(ch_csv)
            .set { fastq }

    emit:
        fastq
}

FriederikeHanssen

🚀 😍

FriederikeHanssen · 2023-12-21T18:22:05Z

workflows/sarek.nf

+            // Group
+            .groupTuple()
+
+        bai_mapped = FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON.out.bai


Bai at only produced/tested with sentieon, but since it is the same, should work

adamrtalbot added 2 commits December 20, 2023 17:41

Minimum number of fastqs for split_fastq is now 250

991501d

Changes: - FASTP uses blocks of 250 reads when splitting a FASTQ file. - This update makes 250 the minimum sized block to split a FASTQ file into. - Updates help text accordingly Fixes #1363

adamrtalbot requested review from FriederikeHanssen and maxulysse as code owners December 20, 2023 17:48

adamrtalbot changed the title ~~1357 grouping strategy fails with large mismatch in sizes~~ 1357 grouping strategy applied by counting number of FASTQ files generated by FASTP Dec 20, 2023

maxulysse approved these changes Dec 20, 2023

View reviewed changes

workflows/sarek.nf Outdated Show resolved Hide resolved

FriederikeHanssen requested changes Dec 20, 2023

View reviewed changes

nextflow_schema.json Outdated Show resolved Hide resolved

adamrtalbot added 4 commits December 21, 2023 09:47

Support for split_fastq to be zero to disable splitting

d0b8cdc

CHANGELOG

162397d

remove .view() statement

6b42de6

Add 'type' parameter to cheat nf-core linter

b815378

FriederikeHanssen approved these changes Dec 21, 2023

View reviewed changes

adamrtalbot merged commit 048f06e into dev Dec 21, 2023
23 checks passed

adamrtalbot deleted the 1357_grouping_strategy_fails_with_large_mismatch_in_sizes branch December 21, 2023 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

adamrtalbot commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Dec 20, 2023 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

adamrtalbot commented Dec 20, 2023

maxulysse left a comment

maxulysse commented Dec 20, 2023

FriederikeHanssen commented Dec 20, 2023

adamrtalbot commented Dec 21, 2023 •

edited

Loading

FriederikeHanssen left a comment

FriederikeHanssen Dec 21, 2023

1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364

Conversation

adamrtalbot commented Dec 20, 2023 • edited Loading

PR checklist

github-actions bot commented Dec 20, 2023 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

adamrtalbot commented Dec 20, 2023

maxulysse left a comment

Choose a reason for hiding this comment

maxulysse commented Dec 20, 2023

FriederikeHanssen commented Dec 20, 2023

adamrtalbot commented Dec 21, 2023 • edited Loading

FriederikeHanssen left a comment

Choose a reason for hiding this comment

FriederikeHanssen Dec 21, 2023

Choose a reason for hiding this comment

adamrtalbot commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Dec 20, 2023 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️

adamrtalbot commented Dec 21, 2023 •

edited

Loading