Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genes are predicted across N stretches even when providing the -m flag #80

Open
fpusan opened this issue Jul 9, 2020 · 3 comments
Open

Comments

@fpusan
Copy link

fpusan commented Jul 9, 2020

Hi guys, hope the week is going well!

As per the title, prodigal v2.6.3 is predicting genes across N stretches even when supplied with the -m flag. According to the documentation -m should make prodigal treat Ns as masked segments of the sequence.

A reproducible example would be:

echo -e ">test\nCCAGACCAAGCGCCGCCAGCATCAGGCTGGACGGCTCCGGGATGACCTGGATGTCCATCGCGAATTGGCCCGTCGTCCCGTCAAGGAAGAAGAAACCCGAGATGTCGCCGTTGTAGAAATCAACCGGCTGGTTGGTATACTCCTGGCCAATGATATTGCCAAGATCGTTCGTACCCGTGAGGGAGATAAACCCGGTCGAGACCTCGACGTCCGTGTCCGTATTTAAGAGGTCAAACCCGGGATCGATGTTGGCATCGTTGTCGAACGAGACCAGAATCGGCGCTGCGGACGCGCTGATCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" | prodigal -p meta -m -f gff`

and its output is:

-------------------------------------
PRODIGAL v2.6.3 [February, 2016]         
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.     
-------------------------------------
Request:  Metagenomic, Phase:  Training
Initializing training files...done!
-------------------------------------
Request:  Metagenomic, Phase:  Gene Finding
Finding genes in sequence #1 (371 bp)...done!
##gff-version  3
# Sequence Data: seqnum=1;seqlen=371;seqhdr="test"
# Model Data: version=Prodigal.v2.6.3;run_type=Metagenomic;model="41|Shigella_dysenteriae_Sd197|B|51.2|11|1";gc_cont=51.20;transl_table=11;uses_sd=1
test	Prodigal_v2.6.3	CDS	2	370	37.3	-	0	ID=1_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.656;conf=99.98;score=37.28;cscore=35.68;sscore=1.61;rscore=0.00;uscore=0.00;tscore=1.61;

Note the predicted CDS going from 2 to 370, even if Ns start at position 301 of the input sequence.

Is this the expected behaviour?

Best regards,

Fernando

@GabeAl
Copy link

GabeAl commented Sep 17, 2020

Same issue, and this was also reported a lot here and various places on the web.

Since I don't think there will ever be a v3.0 (a coming v3.0 is usually cited as the reason this rather severe bug never gets fixed), perhaps one of us should make a pull request to fix this. 👍

If I have time maybe I'll try diving in...

@GabeAl
Copy link

GabeAl commented Sep 17, 2020

See this other comment as well (shouldn't have been closed IMO): #30

So I'm doing some digging and at first glance I see -m triggers the global "do_mask" which just tallies the integer for "number masked" (nm) via the start and end positions.

I guess it all gets lumped into one variable controlling a mask penalty. If it's short enough, it won't actually do what the behavior promised on the main page. Something reminds me there was a threshold of 50 (like, there have to be 50 masked or such before it actually stops looking).

But all this is just coming together for me, not sure yet. If it's a simple penalty threshold I'll just bump the increment (i.e. from +1 to +5 or some such until it finally stops looking across N's).

@GabeAl
Copy link

GabeAl commented Sep 18, 2020

Okay so after printing some debug statements it looks like it never populates the mask variable at all. Literally nothing gets added at any point to the mask variable despite the code starting to make the mask (it never completes a mask).

This is indeed because MASK_SIZE is set to 50 (which is a bit too high IMO), which is the threshold for adding a mask. I've changed it to 16.

In highly fragmented but linearized genomes this may bump the number of masks to a very high value (>5000 easily if there were as many contigs). So I've bumped this up as well. Unfortunately the algorithm for mask detection is brute force search (ideally something smarter like a hash or binary search would be used here) so if you get up to a high number of masks, you're out of luck.

Simplest was to qsort the ranges using the start-of-range, then fuzzy binary search to find nearest start range of query (<20 checks for a million ranges!), and continue checking forward linearly until the start goes out of range (only 2-3 checks typically). When you have thousands of contig-join-sites in very large genomes, this can speed things up quite a bit. Also, replacing the multiple calls to this function throughout multiple if-else statements to just one is trivial and reduces the # of calls to the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants