Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong orf of interest #76

Open
claudelelemaystdenis opened this issue Apr 23, 2020 · 2 comments
Open

wrong orf of interest #76

claudelelemaystdenis opened this issue Apr 23, 2020 · 2 comments

Comments

@claudelelemaystdenis
Copy link

Hi,
I am interested in some particular gene in my genome sequences, and unfortunately, Prodigal doesn't predict it. It predicts a shorter version that does not code for the right protein. When I look at the potential genes Prodigal considers, my gene is present (in bold), but the program chooses another gene (in bold and italic):
10099 10188 - -36.36 0.79 -37.15 TTG None None -4.77 -0.93 -31.45 0.544
10099 10248 - -15.28 -4.36 -10.93 GTG GGAG/GAGG 5-10bp 3.13 -10.05 -3.50 0.567
10099 10260 - -9.23 -4.61 -4.62 ATG None None -2.61 -4.28 2.77 0.562
10099 10335 - -9.63 -14.81 5.18 ATG GGA/GAG/AGG 5-10bp -1.73 3.41 4.00 0.549

10185 10289 - -48.46 -17.73 -30.74 TTG None None -4.07 0.66 -26.82 0.571
10185 10301 - -45.41 -13.93 -31.48 TTG None None -3.64 -3.34 -24.00 0.573
10185 10319 - -12.03 -7.63 -4.40 GTG GGA/GAG/AGG 5-10bp -3.01 3.01 -3.90 0.548
10185 10382 - -18.21 10.49 -28.70 TTG None None -2.13 -12.54 -14.03 0.525
10185 10385 - 3.51 11.90 -8.39 GTG None None -2.10 -3.69 -2.60 0.532

How can I make sure my gene gets predicted?

Gene of interest:

ATGGACCAAGGCAGAAGTGAAGTCAGTAATCCAGTTGCTGGCCAGTTTGCGTTCCCTTCAAACGCCGCGTTCGGAATGGGAGATCGCGTGCGCAAGAAATCTGGCGCCGCTTGGCAAGGCCAGATTGTCGGGTGGTACTGCACAAAATTGACCCCTGAAGGGTACGCTGTCGAGTCTGAGGCTCACCCTGGCTCGGTACAGATTTATCCTGTTGCGGCACTGGAACGCATCAACTGA

Predicted gene:

gi|xxxxxxxxxx|ref|NZ_xxxxxxxxxxxxxxxx.x|_12 # 10185 # 10385 # -1 # ID=1_12;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.532
GTGTTGTCGGGCTACGCAGCAACCCTAGAAATTCAAAAGAAGGGTCATAAATGGACCAAGGCAGAAGTGAAGTCAGTAATCCAGTTGCTGGCCAGTTTGCGTTCCCTTCAAACGCCGCGTTCGGAATGGGAGATCGCGTGCGCAAGAAATCTGGCGCCGCTTGGCAAGGCCAGATTGTCGGGTGGTACTGCACAAAATTGA

@hyattpd
Copy link
Owner

hyattpd commented Apr 23, 2020

Unfortunately, machine learning algorithms are never going to be perfect. The only way to guarantee a known gene gets found is through a database search.

Prodigal collects a variety of signals for each gene candidate. In your case, the wrong gene has better coding but a bad start site (GTG with no RBS), while the real gene has a terrible coding score but a much better start site (ATG with a 3 base RBS). So the short answer would be that Prodigal somehow has to get better at recognizing this sequence as coding. The fact its coding score is low means it uses unusual codons relative to the rest of the organism.

One thing I've thought about is an option to search candidates against a database when there is more than one reading frame (missing the start site is less big a deal than calling a gene in the wrong frame), but only if they are the best gene in their region along at least one axis (i.e. best coding score, or best start score).

@claudelelemaystdenis
Copy link
Author

Thanks for your rapid answer! Your remark on unusual codons is actually really insightful :)
This database option is not part of the current Prodigal right?
In short, should I forget Prodigal for a tool to predict this gene?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants