-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB lock for babies #1974
base: master
Are you sure you want to change the base?
DB lock for babies #1974
Conversation
Sorry for the delay @meren, my anvi'o environment on our cluster was giving me a bit of trouble and I was a bit stupid about setting proper parameters, so I had to restart my test workflow many times :) TestingI tested this on 12,684 GTDB genomes using the anvi'o contigs workflow and a config file that included both
and I found plenty of instances of the sad tables error. Then, I removed any existing data from the hmm_hits table by running ResultsAfter the workflow finished, I repeated the search for duplicate entry_id values in all of the genomes. Unfortunately, I found many (at least 6,700 of the DBs had duplicates) :( For example, genome I did find many instances of the "LOCK FOUND" warning across the log files, and 56 of the
So this is a bit weird, because the DB lock is working sometimes, but not always. Unfortunately, I am not 100% sure that it is a bug, because I had to restart this workflow many times. I was careful on the latest restart to be in the correct environment/branch and to run the deletion script before starting the workflow, but I might have missed something in all the mess :/ The commands I used are below in case you want to check what I did; maybe you will see something I didn't. Or maybe my strategy for identifying the sad tables error is wrong? I am happy to start from scratch and test it again, with extra care, if you think that would be best :) Commands
|
This is very annoying, sorry Iva :( But I have no idea how this could happen. Even if a given job was killed before it had a chance to remove its lock file, you shouldn't get a config error about the 'set()' since neither of the programs actually uses the set() function but instead they use wait. So in the worst case scenario you should waited forever due to a remaining lock files :( Let me check the code and come back. |
Yes, the code looks good. Dumb question here: are you sure have git pulled and in fact you have the latest updates? If yes, and if it is not too much work for you, can you please search for remaining lock files and remove them and restart the workflow one more time after deleting HMMs? :( I don't get what is going on as you should never get a config error unless someone calls set() explicitly, which is not the case in the current code. |
I remember doing this at least once, but maybe I didn't get everything. And anyway, it is worth trying again just in case I screwed something else up during the multiple starts and stops :) I will restart everything now, and I will document exactly what I did to make sure it goes smoothly this time! |
Restart Test WorkflowI logged into the cluster from a new terminal window for a guaranteed fresh start :)
Git says that the branch is up to date:
I removed snakemake outputs, HMM hits,
Note that my script to delete the existing HMM hits runs
I double checked the first and last genomes in the fasta.txt to make sure there is absolutely zero in their hmm hits tables:
(there is no output, so we are good) And now that everything is deleted, I'm starting the workflow again :)
I will keep an eye on it, and report back once it is finished. |
Well, this is very strange. Now every single job in the workflow is failing. I see the following error in the log files:
Same error for both I am absolutely sure that you have tested these changes enough to have caught this sort of issue :( which means it is probably something wrong with my copy of anvi'o on the cluster, right? |
Yes, certainly :( From the error:
All files should point to the same repo, not multiple. |
Update - still finding errorsI completely cleaned and re-generated my anvi'o environment on Midway (I made it from scratch this time rather than cloning, to make sure there was no cross-talk between it and our usual anvio-dev env). Then I cleaned everything related to prior tests and restarted the workflow (my comment above has been updated to reflect what I did). The workflow is about 60% done, but it appears to have stalled. For some reason there are jobs on the queue that have been running > 10 hours, and they are taking up all the allowed resources and preventing anything else from being submitted. When I check the logs associated with the slurm job ID, it looks like the jobs finished properly (I see timing info like But before I do that, I checked to see if things are working now, and unfortunately the answer is no :( When I look in the databases for duplicate IDs in the hmm_hits table, I am still finding many genomes that have the sad tables error. It is less than before, but still happening. The lock is working in some cases, because I see the I am also seeing several (~18) jobs that have again failed with this error that we saw before, with the mysterious call to the
So, I think there must be a race condition here. It is either a runtime bug with the lock code, or a persistent issue in my environment (though I think the latter is less likely now that I am using a fresh environment). @meren, when you have some time, could you take a look at this again? :) |
This is very sad :( I don't want this to delay you any further :( Let's leave this PR as is, and solve your problem by not running tRNAs separately, but as a part of anvi-run-hmms with the flag --also-scan-trnas so you don't have the sad tables problem. Thank you very much for your help with this and I apologize for not doing a good job here. |
It has been a long time anvi'o suffered from occasional 'sad tables' when a user run programs such as
anvi-run-hmms
on the same database with multiple models in parallel. Or even running different programs targeting the same tables, such asanvi-run-hmms
andanvi-scan-trnas
caused occasional race conditions that kind of ruined things.A solution could have been implementing db lock mechanisms, but in most cases many anvi'o programs can operate on the same anvi'o db without any problem, so blocking write access to anvi'o dbs would have been too much of a bottleneck for performance.
This PR introduces a quick-and-dirty (but effective) solution for this problem, and its application to
anvi-run-hmms
andanvi-scan-trnas
.Now if you start
anvi-run-hmms
on a contigs-db, and open a new terminal to start anotheranvi-run-hmms
oranvi-scan-trnas
, the second process will wait until the lock is release by the first one. Here is an example of what the user sees in their terminal in that case:Once the first process is done, the user sees that the lock is removed, and the process continues:
(it is a DB lock for babies because it is not over-engineered, very simple to use, and will solve 99% of our problems. Coding for babies should be a trend against 'everything but the kitchen sink' in programming and here we are leading it by example?).