Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reload: fix submission errors for jobs awaiting preparation #4984

Merged
merged 5 commits into from
Jul 21, 2022

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Jul 15, 2022

Closes #4974

When we reload a workflow we create new TaskProxy instances for all non-active tasks in the pool and replaces the previous instances with these new ones. The job submission pipeline maintains references to the pre-reload TaskProxy's which means if the workflow is reloaded whilst a tasks is going through job prep the pre-reload task gets submitted but the post-reload task is left in the pool. This results in submission issues as the submit number between the two instances are one off.

For busy workflows where the job submission pipeline is constantly churning away this bug is almost guaranteed to strike with any reload operation.

This PR attempts to resolve this issue by:

  1. Ensuring job submission is fed the same TaskProxy instance as present in the task pool.
  2. Incrementing the job submission number at preparation time (rather than after job submit).

Explanation:

  1. scheduler: re-compute pre_prep_tasks for each iteration

    • Addresses stuck in preparing state #4974
    • Tasks which are awaiting job preparation used to be stored in
      Scheduler.pre_prep_tasks, however, this effectively created an
      intermediate "task pool" which had nasty interactions with reload.
    • This commit removes the pre_prep_tasks list by merging the listing
      of these tasks in with TaskPool.release_queued_tasks (to avoid
      unnecessary task pool iteration).
    • waiting_on_job_prep now defaults to False rather than True.
  2. job: increment the submission number at preparation time

    • Addresses stuck in preparing state #4974
    • Job submission number used to be incremented after submission
      (i.e. only once there is a "submission" of which to speak).
    • However, we also incremented the submission number if submission
      (or preparation) failed (in which cases there isn't really a
      "submission" but we need one for internal purposes).
    • Now the submission number is incremented when tasks enter the
      "preparing" state.
    • This resolves an issue where jobs which were going through the
      submission pipeline during a reload got badly broken in the scheduler
      (until restarted).
    • It also makes the scheduler logs nicer as the submission number for
      preparing tasks now matches that of the subsequent submitted or
      submit-failed outcome.

Testing this one is a nightmare, I can't really see a way to do it meaningfully. Look at the issue for example tests which demonstrates how tasks could get

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Already covered by existing tests, cannot really test this
  • Appropriate change log entry included.
  • No documentation update required.

* Addresses cylc#4974
* Job submission number used to be incremented *after* submission
  (i.e. only once there is a "submission" of which to speak).
* However, we also incremented the submission number if submission
  (or preparation) failed (in which cases there isn't really a
  "submission" but we need one for internal purposes).
* Now the submission number is incremented when tasks enter the
  "preparing" state.
* This resolves an issue where jobs which were going through the
  submission pipeline during a reload got badly broken in the scheduler
  (until restarted).
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Jul 15, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Jul 15, 2022
@oliver-sanders oliver-sanders self-assigned this Jul 15, 2022
@oliver-sanders
Copy link
Member Author

This leaves one remaining question of reload safety:

if itask.state(*TASK_STATUSES_ACTIVE, TASK_STATUS_PREPARING):
LOG.warning(
f"[{itask}] active with pre-reload settings"
)

Should preparing tasks be included here? I think there is the potential for preparation to begin with pre-reload settings but possibly to receive post-reload settings later on?

@oliver-sanders oliver-sanders requested a review from hjoliver July 15, 2022 10:56
@hjoliver
Copy link
Member

It also makes the scheduler logs nicer as the submission number for
preparing tasks now matches that of the subsequent submitted or
submit-failed outcome.

👍 must admit this annoyed me!

@hjoliver
Copy link
Member

Your explanation sounds reasonable, but the functional tests aren't happy.

This leaves one remaining question of reload safety:...

While thinking about how to test this, I discovered this bug: #4987

@hjoliver
Copy link
Member

(Agreed good to have this fix in 8.0 if we can nail it in time).

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jul 19, 2022

the functional tests aren't happy.

Couple of small breaks, will investigate, expect I can shift them quickly.

(Agreed good to have this fix in 8.0 if we can nail it in time).

Unfortunately I think this is 8.0 essential as at the moment cylc reload is almost guaranteed to break workflows every time. The workaround is to restart.

* Addresses cylc#4974
* Tasks which are awaiting job preparation used to be stored in
  `Scheduler.pre_prep_tasks`, however, this effectively created an
  intermediate "task pool" which had nasty interactions with reload.
* This commit removes the pre_prep_tasks list by merging the listing
  of these tasks in with TaskPool.release_queued_tasks (to avoid
  unnecessary task pool iteration).
* `waiting_on_job_prep` now defaults to `False` rather than `True`.
@oliver-sanders
Copy link
Member Author

I had got my if/elif branches the wrong way around, fixed, think that should hold it.

* Previously if submission on a host fails 255 (SSH error), then we put
  a submission retry on it to allow the task to retry on another host
  We decremented the submission number to make it look like the same
  attempt.
* Now we set the flag which sends the task back through the submission
  pipeline allowing it to retry without intermediate state changes.
@hjoliver
Copy link
Member

hjoliver commented Jul 20, 2022

This leaves one remaining question of reload safety:

if itask.state(*TASK_STATUSES_ACTIVE, TASK_STATUS_PREPARING):
LOG.warning(
f"[{itask}] active with pre-reload settings"
)

Should preparing tasks be included here? I think there is the potential for preparation to begin with pre-reload settings but possibly to receive post-reload settings later on?

I think we include preparing here because pre-reload settings are fixed once written to the job file. Presumably it's possible to reload a preparing task before it writes the job file, though. Maybe we should have a separate warning for preparing tasks: may be active with pre-reload settings. Or be specific by flagging if the job file was written yet, or not. (Can be a follow-up if needed).

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, and the tests are happy now.

@hjoliver
Copy link
Member

(The failed functional test is just codecov upload)

@dwsutherland dwsutherland self-requested a review July 20, 2022 03:08
Copy link
Member

@dwsutherland dwsutherland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works, tests passed, code makes sense..
From what I can see, the order of events in the main loop is:

  • Do reload if flagged (reloads task pool at this point (along with DB and store etc))
  • Process the command queue:
    -- reload command sets up new config, then reconfigures broadcasts, pool (just config), tasks events, and DB... Which flags reload.
  • Releases/computes runahead (uses pre-reload pool with new post reload config)
  • Proc Pool Processed
  • Checks triggers (xtrigger etc), and sets expired tasks
  • Releases queued tasks (self.release_queued_tasks() changed with this PR) (still pre-reload pool)
  • . . .
  • Processes queued task messages
  • Processes the command queue (Again)
  • Processes task events
  • DB, data-store, health checks, shutdown ...etc
  • Sleep interval..

END LOOP

Probably a bigger issue to look into the ideal order of events, however there's (intentionally or otherwise) a number of things that happen between processing a reload command (reconfiguring) and reloading the task pool..

Would it be better to do the reload straight after processing the reload command (setting up the new config)?
Because at present tasks are potentially released/prepped between setting up a new config and reloading the task pool. Perhaps the first two steps can be switched? (or does it not matter?)

@oliver-sanders
Copy link
Member Author

Or be specific by flagging if the job file was written yet, or not. (Can be a follow-up if needed).

Not sure, needs a bit more investigation - #4990

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jul 20, 2022

Probably a bigger issue to look into the ideal order of events

Definitely, this code is fragile as the order will be critical to particular niche behaviours so would need some time to unravel :(

Because at present tasks are potentially released/prepped between setting up a new config and reloading the task pool

Because of remote init etc, prep can span multiple main loop iterations so I don't think swapping the order will solve the problem outright.

Comment on lines -1040 to -1048
itask.submit_num -= 1
self.task_events_mgr._retry_task(
itask, time(), submit_retry=True
)
return
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxtim could you take a look at this bit to make sure you're happy.

  • It used to reset the task back to waiting and slap a submission-retry trigger on it.
  • It now leaves the task unchanged and sets a flag to send it back through job submission.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit and much nicer than my original solution 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stuck in preparing state
4 participants