Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate job submissions on reload. #6345

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Sep 2, 2024

Close #6344 - tasks in the preparing state at reload time will submit multiple times.

My tentative fix prevents queuing a command to the subprocess pool if the same command is already queued or running. This works, but I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.

It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish) ... and maybe how that interacts with the evil "pre-prep" task list.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver hjoliver added the bug Something is wrong :( label Sep 2, 2024
@hjoliver hjoliver added this to the 8.3.4 milestone Sep 2, 2024
@hjoliver hjoliver self-assigned this Sep 2, 2024
@hjoliver hjoliver changed the title First hack at a dupl job submission fix. Fix duplicate job submissions on reload. Sep 2, 2024
@hjoliver hjoliver force-pushed the fix-more-dupl-submt branch from 2d672db to 8d77c6b Compare September 2, 2024 05:45
@oliver-sanders
Copy link
Member

tasks in the preparing state at reload time will submit multiple times.

Yikes!

It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish)

It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?

I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.

Could definitely do with pinning this down as it's likely an interaction of an internal list or task state with some other part of the system which could potentially spring a leak under other circumstances too?

@hjoliver
Copy link
Member Author

hjoliver commented Sep 2, 2024

It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?

That was definitely the intention, but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.

[Note to self for tomorrow: my fix here might not be good enough, if batch job submission is not deterministic in terms of batch membership, during the pre_prep period...]

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 2, 2024

but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.

Oh heck, whilst!

The cylc.flow.commands.reload_workflow routine contains a mini main-loop containing the small subset of functionality required to flush preparing tasks through contained with the while schd.release_queued_tasks(): loop. I wonder if there's some other bit of logic required, but not contained within this loop?

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 2, 2024

This seems to fix it:

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..9648b025a 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1245,7 +1245,7 @@ class Scheduler:
             # don't release queued tasks, finish processing preparing tasks
             pre_prep_tasks = [
                 itask for itask in self.pool.get_tasks()
-                if itask.state(TASK_STATUS_PREPARING)
+                if itask.waiting_on_job_prep
             ]
 
         # Return, if no tasks to submit.

We use the waiting_on_job_prep flag rather than the preparing status in the task_pool for the release-from-queue side of this.

@hjoliver
Copy link
Member Author

hjoliver commented Sep 2, 2024

Yes!! I was homing in on that flag myself. I think that's got it.

Nope, damn it. That fixes duplicate job submissions, but it does so by no longer waiting for preparing tasks to clear before doing the reload. The thing is, .waiting_on_job_prep is just a subset of the preparing task state.

I guess the question is, do we need to waiting for preparing tasks to clear, or just for those with .waiting_on_job_prep? If the latter, then I just need to fix an integration that is really testing the former. Investigating...

@hjoliver hjoliver force-pushed the fix-more-dupl-submt branch from 8d77c6b to 37b6593 Compare September 3, 2024 02:32
@oliver-sanders
Copy link
Member

I guess the question is, do we need to waiting for preparing tasks to clear

Yes, this shifted other bugs.

@hjoliver
Copy link
Member Author

hjoliver commented Sep 3, 2024

I guess the question is, do we need to waiting for preparing tasks to clear

Yes, this shifted other bugs.

Yes, but I'm asking if those bugs were solely due to preparing (waiting_on_job_prep), or preparing more generally.

@oliver-sanders
Copy link
Member

In the case of the auto-restart functionality, it's preparing more generally.

The auto-restart functionality will restart the workflow on another host. Because of this, we must wait for all localhost task submissions to complete first because the localhost platform will be different on the new host.

@hjoliver
Copy link
Member Author

hjoliver commented Sep 3, 2024

So @oliver-sanders - just to make sure we're on the same page here, and so I can hopefully solve this problem tomorrow:

Your suggestion above to use itask.waiting_on_job_prep flag, i.e.:

We use the waiting_on_job_prep flag rather than the preparing status in the task_pool for the release-from-queue side of this.

seems to be at odds with what you've just said (we have to wait for ALL preparing tasks to clear, not just those waiting on job prep):

In the case of the auto-restart functionality, it's preparing more generally.

Assuming the latter comment overrides the former, I take it you now think we need to keep the original code in the scheduler module, and come up with a different solution to prevent the duplicate submissions?

@oliver-sanders
Copy link
Member

Sorry. I wasn't proposing it as a fix (else I would have opened a PR). But pointing out that it seemed to fix the example in the issue.

@hjoliver hjoliver force-pushed the fix-more-dupl-submt branch from 37b6593 to 44c00c0 Compare September 4, 2024 04:16
@hjoliver
Copy link
Member Author

hjoliver commented Sep 4, 2024

OK got it.

I resorted to a functional test as I wasn't sure how to do better in the time available today.

@hjoliver hjoliver marked this pull request as ready for review September 4, 2024 05:29
@hjoliver
Copy link
Member Author

hjoliver commented Sep 5, 2024

I can't figure out the one seemingly repeatable functional test failure on this PR: tests/f/triggering/08-family-finish-any.

Pretty sure it's unrelated. It only fails in the macos CI run. It passes in the Linux run, and it passes locally on macos for me.

    triggering is NOT consistent with the reference log:
    --- reference
    +++ this run
    -1/foo -triggered off ['1/b']
    +1/foo -triggered off ['1/a', '1/b', '1/c']

The test workflow graph is:

   R1 = """FAM:finish-any => foo"""  # FAM is a, b, c

Task b has script = True, and a, c have script = sleep 10, but in the test run b takes exactly 10 seconds to run just like the other two tasks, so foo triggers off of them all at once, hence the result. I downloaded the test artifact, and b's job logs confirm the 10 sec run time and that the scripting is just true. 🤯

[UPDATE] damn it, kicking the macos test batch for a third time worked. Well that was a complete waste of time. I'll leave the above comment in just in case it indicates that the test is fundamentally flaky though. (Maybe by coincidence the system load was such that the "fast" task took exactly 10 seconds too...)

Copy link
Member

@dwsutherland dwsutherland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works.
We would have run into this at NIWA sooner or later 👍

# Avoid duplicate job submissions when flushing
# preparing tasks before a reload. See
# https://github.com/cylc/cylc-flow/pull/6345
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out this the test will fail (as expected).
Found 12 (not 1) of 1/foo.*submitted to in /home/sutherlander/cylc-run/cylctb-20240905T051008Z-1U8S/functional/reload/28-preparing/log/scheduler/log

itask.waiting_on_job_prep = False

if not job_log_dirs:
continue
Copy link
Member

@dwsutherland dwsutherland Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out this the test passes but:
image

So clearly both changes needed

@oliver-sanders oliver-sanders linked an issue Sep 11, 2024 that may be closed by this pull request
@oliver-sanders oliver-sanders merged commit 2837fe2 into cylc:8.3.x Sep 12, 2024
25 checks passed
@hjoliver hjoliver deleted the fix-more-dupl-submt branch September 13, 2024 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate job submissions on reload
3 participants