Up to 2x faster prodigal #95

jaebeom-kim · 2022-08-23T06:16:10Z

Hi, I'm Jaebeom Kim and developing a metagenomic classifier, Metabuli.
I'm using prodigal to predict genes of thousands of prokaryotic genomes,
so I'm working on making prodigal faster.

Here are some modifications.
Test environment: MacBook Pro, 1.4 GHz quad-core Intel Core i5, 16GB 2133 MHz LPDDR3

Rapid mode.
In the function named 'dprog', I found a suspicious part.
I think lines 52 and 53 are making the program slow. When i==999 and MAX_NODE_DIST==500, 'min' becomes 0 in line line 54, which leads to calling 'score_connection' 999 times. I'm not sure it is intended or not. So I just make an 'Rapid mode' to jump the lines.
When I tested with E.coli genome (GCF_000008865.2), Rapid mode decreased the running time (training + prediction) from ~9.9 sec to ~6.5 sec, while producing the same results.
MAX_NODE_DIST
While reviewing the source code, I found that reducing MAX_NODE_DIST in dprog.h can decrease running time. So, I decreased it from 500 to 300 and tested it using E.coli genome.
The running time (training + predicting) decreased from ~9.9 sec to ~7.2 sec, while producing the same result. But still, it may lead to prediction of lower quality in other cases, so I just made it as an option for users who want to get results faster.

When I tested 100, the running time was about 20 sec producing different result. So, this can be a tricky option to deal with, but I think it will be useful for those who need it.

When rapid mode is used with MAX_NODE_DIST 300,
the same predictions were produced in ~5.5 seconds, which is about 2X acceleration.

I tried to follow the code style :)
If you like the changes, please accept this PR and update the conda package as well.

hyattpd · 2022-12-05T17:28:58Z

I need to think about this one. Dynamic programming by default just considers every node (n compared to remaining n-1). This would be extremely slow, so a fix was put in to only look back 500 nodes. This changes the running speed to nx500. Unfortunately, you can get some incredibly long genes so the code needed a fix to handle that edge case and allow a connection bigger than that if it was connecting a start and a stop node in the same gene. Changing that 500 number will of course give linear speedup (nx500 to nx250 will halve the run time of this function), but if the number becomes too low, legitimate connections between nodes could be disallowed. It would take some testing to determine what's a reasonable value (it also varies by GC content, since you can get more/less nodes per 1000bp).

jaebeom-kim added 3 commits August 23, 2022 11:40

1)Add parameter x. 2)Remove what appears to be an error.

58d5f24

Rapid mode

74514ca

edit usage

0920435

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Up to 2x faster prodigal #95

Up to 2x faster prodigal #95

jaebeom-kim commented Aug 23, 2022 •

edited

Loading

hyattpd commented Dec 5, 2022 •

edited

Loading

Up to 2x faster prodigal #95

Are you sure you want to change the base?

Up to 2x faster prodigal #95

Conversation

jaebeom-kim commented Aug 23, 2022 • edited Loading

hyattpd commented Dec 5, 2022 • edited Loading

jaebeom-kim commented Aug 23, 2022 •

edited

Loading

hyattpd commented Dec 5, 2022 •

edited

Loading