Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strict read quality and end trimming leads to mispaired files #96

Open
yannickwurm opened this issue Sep 29, 2021 · 2 comments
Open

strict read quality and end trimming leads to mispaired files #96

yannickwurm opened this issue Sep 29, 2021 · 2 comments

Comments

@yannickwurm
Copy link
Member

yannickwurm commented Sep 29, 2021

  • in the first practical, running cutadapt with paramaters that are relatively stringent (e.g. quality-trim of 20-25), leads to changes in the fastq files.
  • those changes in the fastq file (perhaps converting many nucleodites to N). Those changes (that masking?) leads kmc2 to drop some reads
  • thus the fastqs that are output of kmc are out of sync.
  • this means that subsequent cut adapt does not run appropriately - because cut adapt expects appropriately paired reads as input

We need to do one of the following:

  • find a way for kmc to not drop reads (is there an option in a newer version of KMC that enables masking (through N or lowercase instead).
  • or add a step where we manually drop orphan reads to ensure the files are in sync prior to running paired cut adapt
  • or use something other than kmc (probably not.
  • or maybe a newer version of cutadapt can have a "check my reads are paired and skip the orphaned ones" option

and/or replace with a different process that is less susceptible to extreme cleaning

This issue doesn't seem to appear if they use lenient quality cutoffs

@yannickwurm
Copy link
Member Author

I encourage the assistants to resolve this one this year

@piplus2
Copy link
Contributor

piplus2 commented Jul 26, 2022

I have added a note in the doc:

Note:
If you trim too much of your sequence (i.e. too large values for --cut and --quality-cutoff), you increase the likelihood of eliminating important information. Additionally, if the trimming is too aggressive, some sequences may be discarded completely, which will cause problems in the subsequent steps of the pre-processing.
For this example, we suggest to keep --cut below 5 and --quality-cutoff below 10.

Also I've corrected the text, as cutadapt fails in those cases, it doesn't drop the unpaired ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants