rnaseq_preprocess is a Nextflow pipeline for RNA-seq quantification with salmon
. The processing steps are fastqc
first, then quantification with salmon
, aggregation to gene level with tximport
and a small summary report with MultiQC
. Multiple fastq files per sample are supported. These technical replicates will be merged prior to quantification. Optional trimming to a fixed read length is possible. The pipeline is containerized via Docker and Singularity. Outputs can be found in rnaseq_preprocess_results/
including command lines and software versions. The expected Nextflow version is 21.10.6.
Run the test profile to see which output is being produced. Downloading the Docker image may take a minute or two:
NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile docker,test_with_existing_idx,test_resources
See the misc folder which contains the software versions used in the pipeline and the exact command lines. In case of running the pipeline this output will be in the pipeline_info
folder of the output directory.
Indexing
The pipeline does not cover the indexing step as there are different sorts of salmon index methods available, for example indexing only the transcriptome without any genome decoys, partial genome decoys and full genome decoys.
Please produce an index up front and then provide the output folder to the --idx
option.
The pipeline has a hardcoded 8GB memory limit for the quantification step which should be sufficient for transcriptome-only and partial genome decoy indices.
For full genome decoy please modify the withLabel:process_quant
memory definition in nextflow.config
to something like 20GB depending on organism.
Quantification/tximport
The pipeline runs via a samplesheet which is a CSV file with the columns:
sample,r1,r2,libtype
. The first column is the name of the sample, followed by the paths to the R1 and
R2 files and the salmon libtype. If R2 is left blank
then single-end mode is triggered for that sample. Multiple fastq files (lane/technical replicates) are supported.
These must have the same sample column and will then be merged prior to quantification. Optionally, a seqtk
module can
trim reads to a fixed read length, triggered by --trim_reads
with a default of 75bp, controlled by --trim_length
.
The quantification then runs with the salmon options --gcBias --seqBias --posBias
(for single-end without --gcBias
).
Transcript abundance estimates from salmon
are then summarized to the gene level using tximport with its lengthScaledTPM
option. That means returned gene-level counts are already corrected for average transcript length and can go into any downstream DEG analysis, for example with limma
. Both a matrix of counts and effective gene lengths is returned.
Other options:
--idx
: path to the salmon index folder
--tx2gene
: path to the tx2gene map matching transcripts to genes
--samplesheet
: path to the input samplesheet
--trim_reads
: logical, whether to trim reads to a fixed length
--trim_length
: numeric, length for trimming
--quant_additional
: additional options to salmon quant
beyond --gcBias --seqBias --posBias
We hardcoded 8GB RAM and 6 CPUs for the quantification. On our HPC we use:
NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm \
--idx path/to/idx --tx2gene path/to/tx2gene.txt --samplesheet path/to/samplesheet.csv \
-with-report quant_report.html -with-trace quant_report.trace -bg > quant_report.log
Other options
--merge_keep
: logical, whether to keep the merged fastq files
--merge_dir
: folder inside the output directory to store the merged fastq files
--trim_keep
: logical, whether to keep the trimmed fastq files
--trim_dir
: folder inside the output directory to store the trimmed fastq files
--skip_fastqc
: logical, whether to skip fastqc
--only_fastqc
: logical, whether to only run fastqc
and skip quantification
--skip_multiqc
: logical, whether to skip multiqc
--skip_tximport
: logical, whether to skip the tximport
process downstream of the quantification
--fastqc_dir
: folder inside the output directory to store the fastqc results
--multiqc_dir
: folder inside the output directory to store the multiqc results