-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate job submissions on reload. #6345
Conversation
2d672db
to
8d77c6b
Compare
Yikes!
It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?
Could definitely do with pinning this down as it's likely an interaction of an internal list or task state with some other part of the system which could potentially spring a leak under other circumstances too? |
That was definitely the intention, but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue. [Note to self for tomorrow: my fix here might not be good enough, if batch job submission is not deterministic in terms of batch membership, during the pre_prep period...] |
Oh heck, whilst! The |
This seems to fix it: diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..9648b025a 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1245,7 +1245,7 @@ class Scheduler:
# don't release queued tasks, finish processing preparing tasks
pre_prep_tasks = [
itask for itask in self.pool.get_tasks()
- if itask.state(TASK_STATUS_PREPARING)
+ if itask.waiting_on_job_prep
]
# Return, if no tasks to submit. We use the |
Nope, damn it. That fixes duplicate job submissions, but it does so by no longer waiting for preparing tasks to clear before doing the reload. The thing is, I guess the question is, do we need to waiting for preparing tasks to clear, or just for those with |
8d77c6b
to
37b6593
Compare
Yes, this shifted other bugs. |
Yes, but I'm asking if those bugs were solely due to |
In the case of the auto-restart functionality, it's preparing more generally. The auto-restart functionality will restart the workflow on another host. Because of this, we must wait for all localhost task submissions to complete first because the localhost platform will be different on the new host. |
So @oliver-sanders - just to make sure we're on the same page here, and so I can hopefully solve this problem tomorrow: Your suggestion above to use
seems to be at odds with what you've just said (we have to wait for ALL
Assuming the latter comment overrides the former, I take it you now think we need to keep the original code in the scheduler module, and come up with a different solution to prevent the duplicate submissions? |
Sorry. I wasn't proposing it as a fix (else I would have opened a PR). But pointing out that it seemed to fix the example in the issue. |
37b6593
to
44c00c0
Compare
OK got it. I resorted to a functional test as I wasn't sure how to do better in the time available today. |
I can't figure out the one seemingly repeatable functional test failure on this PR: Pretty sure it's unrelated. It only fails in the macos CI run. It passes in the Linux run, and it passes locally on macos for me.
The test workflow graph is: R1 = """FAM:finish-any => foo""" # FAM is a, b, c Task [UPDATE] damn it, kicking the macos test batch for a third time worked. Well that was a complete waste of time. I'll leave the above comment in just in case it indicates that the test is fundamentally flaky though. (Maybe by coincidence the system load was such that the "fast" task took exactly 10 seconds too...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works.
We would have run into this at NIWA sooner or later 👍
# Avoid duplicate job submissions when flushing | ||
# preparing tasks before a reload. See | ||
# https://github.com/cylc/cylc-flow/pull/6345 | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting out this the test will fail (as expected).
Found 12 (not 1) of 1/foo.*submitted to in /home/sutherlander/cylc-run/cylctb-20240905T051008Z-1U8S/functional/reload/28-preparing/log/scheduler/log
itask.waiting_on_job_prep = False | ||
|
||
if not job_log_dirs: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close #6344 - tasks in the
preparing
state at reload time will submit multiple times.My tentative fix prevents queuing a command to the subprocess pool if the same command is already queued or running. This works, but I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.It's something to do with the fact that we now wait for
preparing
tasks to submit before actioning the requested reload (which I think is new-ish) ... and maybe how that interacts with the evil "pre-prep" task list.Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.