Fix duplicate job submissions on reload. #6345

hjoliver · 2024-09-02T03:48:06Z

Close #6344 - tasks in the preparing state at reload time will submit multiple times.

My tentative fix prevents queuing a command to the subprocess pool if the same command is already queued or running. This works, but I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.

It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish) ... and maybe how that interacts with the evil "pre-prep" task list.

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders · 2024-09-02T10:51:49Z

tasks in the preparing state at reload time will submit multiple times.

Yikes!

It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish)

It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?

I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.

Could definitely do with pinning this down as it's likely an interaction of an internal list or task state with some other part of the system which could potentially spring a leak under other circumstances too?

hjoliver · 2024-09-02T11:56:25Z

It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?

That was definitely the intention, but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.

[Note to self for tomorrow: my fix here might not be good enough, if batch job submission is not deterministic in terms of batch membership, during the pre_prep period...]

oliver-sanders · 2024-09-02T12:01:15Z

but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.

Oh heck, whilst!

The cylc.flow.commands.reload_workflow routine contains a mini main-loop containing the small subset of functionality required to flush preparing tasks through contained with the while schd.release_queued_tasks(): loop. I wonder if there's some other bit of logic required, but not contained within this loop?

oliver-sanders · 2024-09-02T15:18:33Z

This seems to fix it:

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..9648b025a 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1245,7 +1245,7 @@ class Scheduler:
             # don't release queued tasks, finish processing preparing tasks
             pre_prep_tasks = [
                 itask for itask in self.pool.get_tasks()
-                if itask.state(TASK_STATUS_PREPARING)
+                if itask.waiting_on_job_prep
             ]
 
         # Return, if no tasks to submit.

We use the waiting_on_job_prep flag rather than the preparing status in the task_pool for the release-from-queue side of this.

hjoliver · 2024-09-02T21:22:53Z

~~Yes!! I was homing in on that flag myself. I think that's got it.~~

Nope, damn it. That fixes duplicate job submissions, but it does so by no longer waiting for preparing tasks to clear before doing the reload. The thing is, .waiting_on_job_prep is just a subset of the preparing task state.

I guess the question is, do we need to waiting for preparing tasks to clear, or just for those with .waiting_on_job_prep? If the latter, then I just need to fix an integration that is really testing the former. Investigating...

oliver-sanders · 2024-09-03T08:35:45Z

I guess the question is, do we need to waiting for preparing tasks to clear

Yes, this shifted other bugs.

hjoliver · 2024-09-03T09:12:02Z

I guess the question is, do we need to waiting for preparing tasks to clear

Yes, this shifted other bugs.

Yes, but I'm asking if those bugs were solely due to preparing (waiting_on_job_prep), or preparing more generally.

oliver-sanders · 2024-09-03T10:13:49Z

In the case of the auto-restart functionality, it's preparing more generally.

The auto-restart functionality will restart the workflow on another host. Because of this, we must wait for all localhost task submissions to complete first because the localhost platform will be different on the new host.

hjoliver · 2024-09-03T10:56:19Z

So @oliver-sanders - just to make sure we're on the same page here, and so I can hopefully solve this problem tomorrow:

Your suggestion above to use itask.waiting_on_job_prep flag, i.e.:

We use the waiting_on_job_prep flag rather than the preparing status in the task_pool for the release-from-queue side of this.

seems to be at odds with what you've just said (we have to wait for ALL preparing tasks to clear, not just those waiting on job prep):

In the case of the auto-restart functionality, it's preparing more generally.

Assuming the latter comment overrides the former, I take it you now think we need to keep the original code in the scheduler module, and come up with a different solution to prevent the duplicate submissions?

oliver-sanders · 2024-09-03T13:22:42Z

Sorry. I wasn't proposing it as a fix (else I would have opened a PR). But pointing out that it seemed to fix the example in the issue.

hjoliver · 2024-09-04T05:28:56Z

OK got it.

I resorted to a functional test as I wasn't sure how to do better in the time available today.

hjoliver · 2024-09-05T02:48:13Z

I can't figure out the one seemingly repeatable functional test failure on this PR: tests/f/triggering/08-family-finish-any.

Pretty sure it's unrelated. It only fails in the macos CI run. It passes in the Linux run, and it passes locally on macos for me.

    triggering is NOT consistent with the reference log:
    --- reference
    +++ this run
    -1/foo -triggered off ['1/b']
    +1/foo -triggered off ['1/a', '1/b', '1/c']

The test workflow graph is:

   R1 = """FAM:finish-any => foo"""  # FAM is a, b, c

Task b has script = True, and a, c have script = sleep 10, but in the test run b takes exactly 10 seconds to run just like the other two tasks, so foo triggers off of them all at once, hence the result. I downloaded the test artifact, and b's job logs confirm the 10 sec run time and that the scripting is just true. 🤯

[UPDATE] damn it, kicking the macos test batch for a third time worked. Well that was a complete waste of time. I'll leave the above comment in just in case it indicates that the test is fundamentally flaky though. (Maybe by coincidence the system load was such that the "fast" task took exactly 10 seconds too...)

dwsutherland

Works.
We would have run into this at NIWA sooner or later 👍

dwsutherland · 2024-09-05T05:15:36Z

cylc/flow/task_job_mgr.py

+                        # Avoid duplicate job submissions when flushing
+                        # preparing tasks before a reload. See
+                        # https://github.com/cylc/cylc-flow/pull/6345
+                        continue


Commenting out this the test will fail (as expected).
Found 12 (not 1) of 1/foo.*submitted to in /home/sutherlander/cylc-run/cylctb-20240905T051008Z-1U8S/functional/reload/28-preparing/log/scheduler/log

dwsutherland · 2024-09-05T05:19:00Z

cylc/flow/task_job_mgr.py

                    itask.waiting_on_job_prep = False
+
+                if not job_log_dirs:
+                    continue


Commenting out this the test passes but:

So clearly both changes needed

hjoliver added the bug Something is wrong :( label Sep 2, 2024

hjoliver added this to the 8.3.4 milestone Sep 2, 2024

hjoliver self-assigned this Sep 2, 2024

hjoliver changed the title ~~First hack at a dupl job submission fix.~~ Fix duplicate job submissions on reload. Sep 2, 2024

hjoliver force-pushed the fix-more-dupl-submt branch from 2d672db to 8d77c6b Compare September 2, 2024 05:45

hjoliver force-pushed the fix-more-dupl-submt branch from 8d77c6b to 37b6593 Compare September 3, 2024 02:32

Avoid duplicate job submissions before reload.

44c00c0

hjoliver force-pushed the fix-more-dupl-submt branch from 37b6593 to 44c00c0 Compare September 4, 2024 04:16

Added a functional test.

dcedd2b

hjoliver marked this pull request as ready for review September 4, 2024 05:29

hjoliver requested a review from oliver-sanders September 4, 2024 05:29

hjoliver mentioned this pull request Sep 4, 2024

two instances of the same task running in parallel #6329

Closed

hjoliver requested a review from dwsutherland September 4, 2024 22:58

dwsutherland approved these changes Sep 5, 2024

View reviewed changes

oliver-sanders linked an issue Sep 11, 2024 that may be closed by this pull request

Duplicate job submissions on reload #6344

Closed

oliver-sanders approved these changes Sep 12, 2024

View reviewed changes

oliver-sanders merged commit 2837fe2 into cylc:8.3.x Sep 12, 2024
25 checks passed

oliver-sanders mentioned this pull request Sep 12, 2024

Duplicate job submissions on reload #6344

Closed

hjoliver deleted the fix-more-dupl-submt branch September 13, 2024 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicate job submissions on reload. #6345

Fix duplicate job submissions on reload. #6345

hjoliver commented Sep 2, 2024 •

edited

Loading

oliver-sanders commented Sep 2, 2024

hjoliver commented Sep 2, 2024

oliver-sanders commented Sep 2, 2024 •

edited

Loading

oliver-sanders commented Sep 2, 2024 •

edited

Loading

hjoliver commented Sep 2, 2024 •

edited

Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 3, 2024 •

edited

Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 3, 2024 •

edited

Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 4, 2024

hjoliver commented Sep 5, 2024 •

edited

Loading

dwsutherland left a comment

dwsutherland Sep 5, 2024

dwsutherland Sep 5, 2024 •

edited

Loading

Fix duplicate job submissions on reload. #6345

Fix duplicate job submissions on reload. #6345

Conversation

hjoliver commented Sep 2, 2024 • edited Loading

oliver-sanders commented Sep 2, 2024

hjoliver commented Sep 2, 2024

oliver-sanders commented Sep 2, 2024 • edited Loading

oliver-sanders commented Sep 2, 2024 • edited Loading

hjoliver commented Sep 2, 2024 • edited Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 3, 2024 • edited Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 3, 2024 • edited Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 4, 2024

hjoliver commented Sep 5, 2024 • edited Loading

dwsutherland left a comment

Choose a reason for hiding this comment

dwsutherland Sep 5, 2024

Choose a reason for hiding this comment

dwsutherland Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

hjoliver commented Sep 2, 2024 •

edited

Loading

oliver-sanders commented Sep 2, 2024 •

edited

Loading

oliver-sanders commented Sep 2, 2024 •

edited

Loading

hjoliver commented Sep 2, 2024 •

edited

Loading

hjoliver commented Sep 3, 2024 •

edited

Loading

hjoliver commented Sep 3, 2024 •

edited

Loading

hjoliver commented Sep 5, 2024 •

edited

Loading

dwsutherland Sep 5, 2024 •

edited

Loading