Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Up to 2x faster prodigal #95

Open
wants to merge 3 commits into
base: GoogleImport
Choose a base branch
from

Conversation

jaebeom-kim
Copy link

@jaebeom-kim jaebeom-kim commented Aug 23, 2022

Hi, I'm Jaebeom Kim and developing a metagenomic classifier, Metabuli.
I'm using prodigal to predict genes of thousands of prokaryotic genomes,
so I'm working on making prodigal faster.

Here are some modifications.
Test environment: MacBook Pro, 1.4 GHz quad-core Intel Core i5, 16GB 2133 MHz LPDDR3

  1. Rapid mode.
    In the function named 'dprog', I found a suspicious part.
    I think lines 52 and 53 are making the program slow. When i==999 and MAX_NODE_DIST==500, 'min' becomes 0 in line line 54, which leads to calling 'score_connection' 999 times. I'm not sure it is intended or not. So I just make an 'Rapid mode' to jump the lines.
    When I tested with E.coli genome (GCF_000008865.2), Rapid mode decreased the running time (training + prediction) from ~9.9 sec to ~6.5 sec, while producing the same results.

  2. MAX_NODE_DIST
    While reviewing the source code, I found that reducing MAX_NODE_DIST in dprog.h can decrease running time. So, I decreased it from 500 to 300 and tested it using E.coli genome.
    The running time (training + predicting) decreased from ~9.9 sec to ~7.2 sec, while producing the same result. But still, it may lead to prediction of lower quality in other cases, so I just made it as an option for users who want to get results faster.

  • When I tested 100, the running time was about 20 sec producing different result. So, this can be a tricky option to deal with, but I think it will be useful for those who need it.

When rapid mode is used with MAX_NODE_DIST 300,
the same predictions were produced in ~5.5 seconds, which is about 2X acceleration.

I tried to follow the code style :)
If you like the changes, please accept this PR and update the conda package as well.

@hyattpd
Copy link
Owner

hyattpd commented Dec 5, 2022

I need to think about this one. Dynamic programming by default just considers every node (n compared to remaining n-1). This would be extremely slow, so a fix was put in to only look back 500 nodes. This changes the running speed to nx500. Unfortunately, you can get some incredibly long genes so the code needed a fix to handle that edge case and allow a connection bigger than that if it was connecting a start and a stop node in the same gene. Changing that 500 number will of course give linear speedup (nx500 to nx250 will halve the run time of this function), but if the number becomes too low, legitimate connections between nodes could be disallowed. It would take some testing to determine what's a reasonable value (it also varies by GC content, since you can get more/less nodes per 1000bp).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants