Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB lock for babies #1974

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

DB lock for babies #1974

wants to merge 3 commits into from

Conversation

meren
Copy link
Member

@meren meren commented Aug 27, 2022

It has been a long time anvi'o suffered from occasional 'sad tables' when a user run programs such as anvi-run-hmms on the same database with multiple models in parallel. Or even running different programs targeting the same tables, such as anvi-run-hmms and anvi-scan-trnas caused occasional race conditions that kind of ruined things.

A solution could have been implementing db lock mechanisms, but in most cases many anvi'o programs can operate on the same anvi'o db without any problem, so blocking write access to anvi'o dbs would have been too much of a bottleneck for performance.

This PR introduces a quick-and-dirty (but effective) solution for this problem, and its application to anvi-run-hmms and anvi-scan-trnas.

Now if you start anvi-run-hmms on a contigs-db, and open a new terminal to start another anvi-run-hmms or anvi-scan-trnas, the second process will wait until the lock is release by the first one. Here is an example of what the user sees in their terminal in that case:

image

Once the first process is done, the user sees that the lock is removed, and the process continues:

image

(it is a DB lock for babies because it is not over-engineered, very simple to use, and will solve 99% of our problems. Coding for babies should be a trend against 'everything but the kitchen sink' in programming and here we are leading it by example?).

@meren meren requested review from mschecht and ivagljiva August 27, 2022 13:34
@ivagljiva
Copy link
Contributor

ivagljiva commented Sep 1, 2022

Sorry for the delay @meren, my anvi'o environment on our cluster was giving me a bit of trouble and I was a bit stupid about setting proper parameters, so I had to restart my test workflow many times :)

Testing

I tested this on 12,684 GTDB genomes using the anvi'o contigs workflow and a config file that included both anvi-run-hmms and anvi-scan-trnas (which could run at the same time on the same database). First, I checked that a prior run of this workflow had produced the sad tables error, by looking in all the databases for duplicate entry_id values in the hmm_hits table:

sqlite3 ${genome}-contigs.db "select entry_id, COUNT(*) c from hmm_hits GROUP BY entry_id HAVING c > 1"

and I found plenty of instances of the sad tables error.

Then, I removed any existing data from the hmm_hits table by running anvi-delete-hmms on all of these genomes, and started the workflow.

Results

After the workflow finished, I repeated the search for duplicate entry_id values in all of the genomes. Unfortunately, I found many (at least 6,700 of the DBs had duplicates) :(

For example, genome GCA_000210715 had duplicate entry IDs in the hmm_hits table. Both anvi-run-hmms and anvi-scan-trnas ran successfully on this genome, according to the log files, and there was no "LOCK FOUND" message in either of these logs.

I did find many instances of the "LOCK FOUND" warning across the log files, and 56 of the anvi-run-hmms jobs failed, with the following error in their logs:

Config Error: There is a lock for this database set by the program 'HMMs'. You can either wait
              for it to be released, or manually remove the lock before running the member
              function `set()` again. FYI, the lock file is here:
              [...]

So this is a bit weird, because the DB lock is working sometimes, but not always. Unfortunately, I am not 100% sure that it is a bug, because I had to restart this workflow many times. I was careful on the latest restart to be in the correct environment/branch and to run the deletion script before starting the workflow, but I might have missed something in all the mess :/ The commands I used are below in case you want to check what I did; maybe you will see something I didn't.

Or maybe my strategy for identifying the sad tables error is wrong?

I am happy to start from scratch and test it again, with extra care, if you think that would be best :)

Commands

# working dir
cd /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/03_SNAKEMAKE

# check for sad table errors
while read genome; do echo $genome; sqlite3 ../GTDB-ANVIO/GENOMES/${genome}-contigs.db "select entry_id, COUNT(*) c from hmm_hits GROUP BY entry_id HAVING c > 1"; done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)

# remove .done files
while read genome; do rm ../GTDB-ANVIO/GENOMES/anvi_run_hmms-${genome}.done; rm ../GTDB-ANVIO/GENOMES/anvi_run_trna_scan-${genome}.done; done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)
# delete old HMMs
./rerun_hmms_iva_delete_old_hmms.sh

# load my personal dev environment
conda deactivate
activate-iva # alias
# checkout PR branch
cd /project2/meren/PEOPLE/iveseli/anvio-iva/anvio
git pull
git checkout db-lock-for-babies
# verify location of script
which anvi-run-workflow
cd -
# command to run workflow
clusterize "anvi-run-workflow -w contigs -c rerun_hmms_test_pr_1974_config.json --skip-dry-run --additional-params --cluster \"clusterize -j={rule} -o={log} -n={threads} -x --nodelist midway-l16b-28 --exclude \'\' \" --jobs 40 --resource nodes=40 --latency-wait 100 --keep-going --rerun-incomplete"  -n 1  -o test-pr-1974-workflow.log -j test-pr-1974-workflow
# check for sad tables after workflow finished (with genome names so manual check is possible)
while read genome; do echo $genome; sqlite3 ../GTDB-ANVIO/GENOMES/${genome}-contigs.db "select entry_id, COUNT(*) c from hmm_hits GROUP BY entry_id HAVING c > 1"; done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)
# how many genomes have sad table errors?
i=0; while read genome; do  dups=$(sqlite3 ../GTDB-ANVIO/GENOMES/${genome}-contigs.db "select entry_id, COUNT(*) c from hmm_hits GROUP BY entry_id HAVING c > 1" | wc -l); if [ $dups -gt 0 ]; then ((i=i+1)); fi; done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)
echo $i
# this took a very long time, so I cancelled it partway. But the number it reached to was 6705 :(

# look for locks output:
while read genome; do grep "LOCK FOUND" /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/04_WORKFLOW/00_LOGS/${genome}-anvi_run_trna_scan.log; grep "LOCK FOUND" /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/04_WORKFLOW/00_LOGS/${genome}-anvi_run_hmms.log; done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)
# how many HMMs failed?
grep -c "Error in rule anvi_run_hmms" test-pr-1974-workflow_YCZONPufYK.log

@meren
Copy link
Member Author

meren commented Sep 1, 2022

This is very annoying, sorry Iva :(

But I have no idea how this could happen. Even if a given job was killed before it had a chance to remove its lock file, you shouldn't get a config error about the 'set()' since neither of the programs actually uses the set() function but instead they use wait. So in the worst case scenario you should waited forever due to a remaining lock files :(

Let me check the code and come back.

@meren
Copy link
Member Author

meren commented Sep 1, 2022

Yes, the code looks good. Dumb question here: are you sure have git pulled and in fact you have the latest updates? If yes, and if it is not too much work for you, can you please search for remaining lock files and remove them and restart the workflow one more time after deleting HMMs? :( I don't get what is going on as you should never get a config error unless someone calls set() explicitly, which is not the case in the current code.

@ivagljiva
Copy link
Contributor

are you sure have git pulled

I remember doing this at least once, but maybe I didn't get everything. And anyway, it is worth trying again just in case I screwed something else up during the multiple starts and stops :)

I will restart everything now, and I will document exactly what I did to make sure it goes smoothly this time!

@ivagljiva
Copy link
Contributor

ivagljiva commented Sep 2, 2022

Restart Test Workflow

I logged into the cluster from a new terminal window for a guaranteed fresh start :)

# First, make sure my copy of anvi'o is up to date
cd /project2/meren/PEOPLE/iveseli/github/anvio/
git pull
git checkout db-lock-for-babies
git status

Git says that the branch is up to date:

On branch db-lock-for-babies
Your branch is up-to-date with 'origin/db-lock-for-babies'.
nothing to commit, working tree clean

# load my personal anvi'o environment
conda deactivate
activate-iva
which anvi-run-workflow
# /project2/meren/PEOPLE/iveseli/anvio-iva/anvio/bin/anvi-run-workflow
echo $PYTHONPATH
# /project2/meren/PEOPLE/iveseli/github/anvio/
# looks like it loaded properly :)

I removed snakemake outputs, HMM hits, .done files, former log files (just in case), and any lock files:

# go to the wd
cd /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/03_SNAKEMAKE

# delete everything relevant to the previous runs of the workflow
rm -r .snakemake/ test-pr-1974-workflow_*.log

# remove old HMM hits
./rerun_hmms_iva_delete_old_hmms.sh
# when these jobs are all done, check for errors and delete the log files
grep -i error delete_hmm_*.log
rm delete_hmm_*.log 

# remove .done files
while read genome; do \
    rm ../GTDB-ANVIO/GENOMES/anvi_run_hmms-${genome}.done; \
    rm ../GTDB-ANVIO/GENOMES/anvi_run_trna_scan-${genome}.done; \
done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)
# this produces errors for the files that do not exist

# remove locks
rm -rf ../GTDB-ANVIO/GENOMES/.*.lock
ls -ad ../GTDB-ANVIO/GENOMES/.* # check all hidden files to be sure they are gone

# remove old logs just in case we are seeing old log output (unlikely but still)
while read genome; do \
    rm ../04_WORKFLOW/00_LOGS/${genome}-anvi_run_hmms.log; \
    rm ../04_WORKFLOW/00_LOGS/${genome}-anvi_run_trna_scan.log; \
done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)

Note that my script to delete the existing HMM hits runs anvi-delete-hmms on each genome (it also has unbridled use of the cluster resources so check the queue before starting ):

#!/bin/bash
# A script to delete the old HMMs from my subset of genomes

i=0
while read genome;
do
	clusterize -j $genome -o delete_hmm_${genome}.log -x --exclude '' "anvi-delete-hmms -c /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/GTDB-ANVIO/GENOMES/${genome}-contigs.db --just-do-it";
       	((i=i+1))
	if [ $(expr $i % 100) == "0" ]; then
		echo "Sleeping 20 sec..."
		sleep 20
	fi
done < <(tail -n+2 rerun_hmms_test_pr_1974_fasta.txt | cut -f 1)

I double checked the first and last genomes in the fasta.txt to make sure there is absolutely zero in their hmm hits tables:

sqlite3 /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/GTDB-ANVIO/GENOMES/GCA_000007185.1-contigs.db "select * from hmm_hits"
sqlite3 /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/GTDB-ANVIO/GENOMES/GCF_901411515.1-contigs.db "select * from hmm_hits"

(there is no output, so we are good)

And now that everything is deleted, I'm starting the workflow again :)

clusterize "anvi-run-workflow -w contigs -c rerun_hmms_test_pr_1974_config.json --skip-dry-run --additional-params --cluster \"clusterize -j={rule} -o={log} -n={threads} -x --nodelist midway-l16b-28 --exclude \'\' \" --jobs 40 --resource nodes=40 --latency-wait 100 --keep-going --rerun-incomplete"  -n 1  -o test-pr-1974-workflow.log -j test-pr-1974-workflow

I will keep an eye on it, and report back once it is finished.

@ivagljiva
Copy link
Contributor

Well, this is very strange. Now every single job in the workflow is failing. I see the following error in the log files:

Traceback (most recent call last):
  File "/project2/meren/PEOPLE/iveseli/anvio-iva/anvio/bin/anvi-run-hmms", line 161, in <module>
    main(args)
  File "/project2/meren/GITHUB/anvio/anvio/terminal.py", line 879, in wrapper
    program_method(*args, **kwargs)
  File "/project2/meren/PEOPLE/iveseli/anvio-iva/anvio/bin/anvi-run-hmms", line 98, in main
    lock = utils.DatabaseLockForBabies(args.contigs_db, 'HMMs')
AttributeError: module 'anvio.utils' has no attribute 'DatabaseLockForBabies'

Same error for both anvi-run-hmms and anvi-scan-trnas

I am absolutely sure that you have tested these changes enough to have caught this sort of issue :( which means it is probably something wrong with my copy of anvi'o on the cluster, right?

@meren
Copy link
Member Author

meren commented Sep 2, 2022

probably something wrong with my copy of anvi'o on the cluster, right?

Yes, certainly :(

From the error:

(...)
File "/project2/meren/PEOPLE/iveseli/anvio-iva/anvio/bin/anvi-run-hmms"
(...)
File "/project2/meren/GITHUB/anvio/anvio/terminal.py", line 879, in wrapper
(...)
File "/project2/meren/PEOPLE/iveseli/anvio-iva/anvio/bin/anvi-run-hmms"

All files should point to the same repo, not multiple.

@ivagljiva
Copy link
Contributor

Update - still finding errors

I completely cleaned and re-generated my anvi'o environment on Midway (I made it from scratch this time rather than cloning, to make sure there was no cross-talk between it and our usual anvio-dev env). Then I cleaned everything related to prior tests and restarted the workflow (my comment above has been updated to reflect what I did).

The workflow is about 60% done, but it appears to have stalled. For some reason there are jobs on the queue that have been running > 10 hours, and they are taking up all the allowed resources and preventing anything else from being submitted. When I check the logs associated with the slurm job ID, it looks like the jobs finished properly (I see timing info like ✓ anvi-scan-trnas took 0:00:11.446662 at the bottom). But for some reason slurm failed to end the jobs (this is likely a slurm error, and nothing to do with the DB lock changes). I cancelled these stalled jobs, but unfortunately they are now stuck in the COMPLETING phase (so the workflow cannot continue anyway). I think I will have to cancel and re-start the workflow yet again.

But before I do that, I checked to see if things are working now, and unfortunately the answer is no :(

When I look in the databases for duplicate IDs in the hmm_hits table, I am still finding many genomes that have the sad tables error. It is less than before, but still happening.

The lock is working in some cases, because I see the LOCK FOUND output in some of the log files.

I am also seeing several (~18) jobs that have again failed with this error that we saw before, with the mysterious call to the set() function:

Config Error: There is a lock for this database set by the program 'HMMs'. You can either wait
              for it to be released, or manually remove the lock before running the member
              function `set()` again. FYI, the lock file is here:
              /project2/meren/RESOURCES/PUBLIC/GENOMES/GTDB/release95/GTDB-
              ANVIO/GENOMES/.GCF_002608075.1-contigs.db.HMMs.lock

So, I think there must be a race condition here. It is either a runtime bug with the lock code, or a persistent issue in my environment (though I think the latter is less likely now that I am using a fresh environment).

@meren, when you have some time, could you take a look at this again? :)

@meren
Copy link
Member Author

meren commented Sep 7, 2022

This is very sad :(

I don't want this to delay you any further :(

Let's leave this PR as is, and solve your problem by not running tRNAs separately, but as a part of anvi-run-hmms with the flag --also-scan-trnas so you don't have the sad tables problem.

Thank you very much for your help with this and I apologize for not doing a good job here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants