-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genes are predicted across N stretches even when providing the -m flag #80
Comments
Same issue, and this was also reported a lot here and various places on the web. Since I don't think there will ever be a v3.0 (a coming v3.0 is usually cited as the reason this rather severe bug never gets fixed), perhaps one of us should make a pull request to fix this. 👍 If I have time maybe I'll try diving in... |
See this other comment as well (shouldn't have been closed IMO): #30 So I'm doing some digging and at first glance I see -m triggers the global "do_mask" which just tallies the integer for "number masked" (nm) via the start and end positions. I guess it all gets lumped into one variable controlling a mask penalty. If it's short enough, it won't actually do what the behavior promised on the main page. Something reminds me there was a threshold of 50 (like, there have to be 50 masked or such before it actually stops looking). But all this is just coming together for me, not sure yet. If it's a simple penalty threshold I'll just bump the increment (i.e. from +1 to +5 or some such until it finally stops looking across N's). |
Okay so after printing some debug statements it looks like it never populates the mask variable at all. Literally nothing gets added at any point to the mask variable despite the code starting to make the mask (it never completes a mask). This is indeed because MASK_SIZE is set to 50 (which is a bit too high IMO), which is the threshold for adding a mask. I've changed it to 16. In highly fragmented but linearized genomes this may bump the number of masks to a very high value (>5000 easily if there were as many contigs). So I've bumped this up as well. Unfortunately the algorithm for mask detection is brute force search (ideally something smarter like a hash or binary search would be used here) so if you get up to a high number of masks, you're out of luck. Simplest was to qsort the ranges using the start-of-range, then fuzzy binary search to find nearest start range of query (<20 checks for a million ranges!), and continue checking forward linearly until the start goes out of range (only 2-3 checks typically). When you have thousands of contig-join-sites in very large genomes, this can speed things up quite a bit. Also, replacing the multiple calls to this function throughout multiple if-else statements to just one is trivial and reduces the # of calls to the function. |
Hi guys, hope the week is going well!
As per the title, prodigal v2.6.3 is predicting genes across N stretches even when supplied with the
-m
flag. According to the documentation-m
should make prodigal treat Ns as masked segments of the sequence.A reproducible example would be:
and its output is:
Note the predicted CDS going from 2 to 370, even if Ns start at position 301 of the input sequence.
Is this the expected behaviour?
Best regards,
Fernando
The text was updated successfully, but these errors were encountered: