Proposal for simulation claiming with fewer queries #839
mattdailis
started this conversation in
Ideas
Replies: 1 comment 3 replies
-
Just to make sure I understand, this sounds less like "block eachother" (which sounds like a deadlock) and more like "are linearized" -- if one worker is in the process of trying to claim a job, then every other worker has to wait on the first one before they get a shot. Is that right? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently, simulations are claimed as follows:
simulate
action inserts a row intosimulation_dataset
update set status='incomplete' where id=? and status='pending'
There are a couple of aspects here that I think have room for improvement:
The former is a more pressing issue than the latter - the multiple requests aren't really a performance bottleneck. Two open questions there are:
incomplete
simulation with its own id, could/should it toggle it tofailed
?I engaged in a thought exercise - would it be possible to write a "Claim" action (in the Java database action sense) that atomically claimed any single pending simulation? This action could be used when a worker starts up, to claim previously issued simulation requests, and it could also be used in lieu of iterating over every notification in the queue.
Essentially I'm looking for something like this:
It turns out there are a few problems with this. The trouble comes from the fact that the
select
statement may return rows that are concurrently being modified by another transaction - i.e. another worker may already be in the process of claiming that simulation. The consequence is two-fold:To try to solve that, we could add a seemingly redundant where clause to the update statement:
This prevents double claiming, but it doesn't quite give us the "claim any simulation" semantics. Instead, this is "try to claim the first pending dataset - give up if you can't". We could issue the above query repeatedly, and eventually claim a simulation request, but that seems suboptimal.
We need the select and the update to operate atomically - so that no concurrent modification can sneak in between the worker observing a row as 'pending' and that worker updating that row to 'incomplete'. Enter the
select for update
feature:⭐
select for update
will set row-level locks on each row returned from that select statement. In this case, that means the select statement will lock the first 'pending' simulation request it encounters, and then the update can safely claim it, with no interference from concurrent transactions.This achieves the goal of a single query that claims any simulation request 🎉 This next step I'm less sure is necessary, but for the sake of completeness I'll include it.
The only arguably undesirable property of the above query is that concurrent claim requests block each other - the first query will lock the first pending simulation - and the second query cannot proceed until that query commits. If we want to allow the second query to skip the first simulation request, and move on to the second request, we can add the
skip locked
option to `select for update:This query will claim the first pending simulation request that doesn't have a row-level lock. It introduces a vulnerability that if the locking transaction rolls back rather than commits, our worker may claim zero simulation requests, even though there is one that is available to be claimed. For that reason, I think it might be better to leave off the
skip locked
option, use the option marked with a ⭐ and make sure the claiming transactions are short.Beta Was this translation helpful? Give feedback.
All reactions