Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck in preparing state #4974

Closed
oliver-sanders opened this issue Jul 8, 2022 · 5 comments · Fixed by #4984
Closed

stuck in preparing state #4974

oliver-sanders opened this issue Jul 8, 2022 · 5 comments · Fixed by #4984
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

Tasks can get "stuck" in the preparing state, seen in the wild, twice, but struggling to reproduce.

Possibly related to reloads, possibly caused by a mismatch of submission number (we increment sub num on submission so a subsequent preparing job will have the same sub num as a previous instance), dunno.

Order of events:

  • submission 01 preparing
  • submission 01 submit-failed
  • reload # no change
  • submission 02 preparing
  • unhandled jobs-submit output ... same-task/02 ...
  • [jobs-submit cmd] ... same-task/02
    [jobs-submit ret_code] 1
    [jobs-submit out] ...|cycle/task/01|1
  • submission 01 (polled-ignored) submission failed # i.e. previous job showed up but was ignored

At the end of this, job 02 is still marked as preparing, however, is no longer preparing anything, submission seemingly having failed long ago just after the unhandled jobs-submit output. The task is not killable (because the preparing state is not killable).

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Jul 8, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Jul 8, 2022
@oliver-sanders oliver-sanders changed the title preparing stuck in preparing state Jul 8, 2022
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jul 8, 2022

It looks like the cause is a mix up of submit numbers.

When we perform reload we make a copy of the task:

new_task = TaskProxy(
self.config.get_taskdef(itask.tdef.name),
itask.point, itask.flow_nums, itask.state.status)
itask.copy_to_reload_successor(new_task)

The copy is then triggered incrementing its submit number. If something out there somewhere is maintaining a reference to the original task, however, that could explain the submit-number disparity.

A possible mechanism is pre_prep_tasks which is not updated with reloads:

  • Run the task.
  • Pause the workflow.
  • Get the task into pre_prep_tasks somehow.
    • Probably not possible whilst the workflow is paused but there may be other mechanisms beyond workflow-paused, slow remote-init? auto retries? pausing and triggering in the same main loop cycle? pre_prep_tasks length above batched submission limit?
  • Reload the workflow.
  • Release the workflow.
  • Now the task in pre_prep_tasks is the pre-reload copy vs the newer version in the task_pool!!!

@oliver-sanders
Copy link
Member Author

The issue was resolved via a restart which suggests that the issue was due to scheduler state not mirrored in the DB so the pre_prep_tasks theory is likely, any easy fix if so...

@oliver-sanders
Copy link
Member Author

Haven't managed to reproduce, however, this test shows how reload can interfere with the pre_prep_tasks pipeline:

async def test_pre_prep_reload(one, start):
    """Ensure reload does not interfere with the task preparation pipeline."""                         
    async with start(one):
        one.resume_workflow()
        one.pool.release_runahead_tasks()
        one.release_queued_tasks()
        assert len(one.pre_prep_tasks) == 1
        one.command_reload_workflow()           
        assert one.pool.get_tasks()[0] == one.pre_prep_tasks   

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jul 11, 2022

The underlying issue here is that the submit_num is incremented during the job submission process. Because of the way reload works, this means that if a job is awaiting submission (i.e. if it is preparing or in the pre_prep_tasks list) when a reload occurs the submit_num will be incremented on the pre-reload copy of the task (which exists in the submission pipeline) and not the post-reload copy (which exists in the task pool).

This causes the submit-number to be 1 out of line which means job polling will be looking for the wrong submission and job messaging will be associated with the wrong submission. If something goes wrong with preparation/submission the task can become a ghost so to speak, resulting in the stuck state reported.

Solutions:

  1. Remove the pre_prep_tasks list.
    • I don't think we need to maintain this from cycle-to-cycle, we can build it when we need it.
    • This solves the issue for reloads on paused workflows (because released tasks build up in pre_prep_tasks).
  2. Consider incrementing the submission number before preparation rather than during/after submission.
    • I think this makes more logical sense, especially since a failure in preparation results in the submit-failed state (and an increment in the submission number anyway).
    • At the moment the logs are confusing to read, you will see entries like this all referring to the same submission:
      • Queue released: 1/foo
      • [1/foo waiting job:00 flows:1] => preparing (00 preparing)
      • [1/foo preparing job:01 flows:1] host=<host> (01 preparing ???)
      • [1/foo preparing job:01 flows:1] (internal)submitted at <time> (01 submitted)
    • Would also make the submit_num logic a bit more water-tight as we would only be incrementing the number in one places rather than many (e.g. we actually decrement the submission number after SSH failures, then re-increment it once we've exhausted available platform-hosts ATM).
  3. Don't re-create tasks from scratch at reload, just swap out the TaskDef and perform any required updates.
    • Potentially harder, we would need to update prepreqs/outputs.
    • This avoids the issue of having two copies of the task kicking around with all of the potential headaches that can cause.
  4. Other?

I think we should do (1) (quick n' easy), to close this issue we would also need to do either (2) or (3).

@dwsutherland
Copy link
Member

I was thinking (2) before I read your solutions..

I can't imagine it would be too hard to update the task pool first during the submission process?

@oliver-sanders oliver-sanders self-assigned this Jul 12, 2022
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jul 15, 2022
* Addresses cylc#4974
* Job submission number used to be incremented *after* submission
  (i.e. only once there is a "submission" of which to speak).
* However, we also incremented the submission number if submission
  (or preparation) failed (in which cases there isn't really a
  "submission" but we need one for internal purposes).
* Now the submission number is incremented when tasks enter the
  "preparing" state.
* This resolves an issue where jobs which were going through the
  submission pipeline during a reload got badly broken in the scheduler
  (until restarted).
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jul 15, 2022
* Addresses cylc#4974
* Tasks which are awaiting job preparation used to be stored in
  `Scheduler.pre_prep_tasks`, however, this effectively created an
  intermediate "task pool" which had nasty interactions with reload.
* This commit removes the pre_prep_tasks list by merging the listing
  of these tasks in with TaskPool.release_queued_tasks (to avoid
  unnecessary task pool iteration).
* `waiting_on_job_prep` now defaults to `False` rather than `True`.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jul 19, 2022
* Addresses cylc#4974
* Tasks which are awaiting job preparation used to be stored in
  `Scheduler.pre_prep_tasks`, however, this effectively created an
  intermediate "task pool" which had nasty interactions with reload.
* This commit removes the pre_prep_tasks list by merging the listing
  of these tasks in with TaskPool.release_queued_tasks (to avoid
  unnecessary task pool iteration).
* `waiting_on_job_prep` now defaults to `False` rather than `True`.
wxtim added a commit that referenced this issue Jul 21, 2022
* job: increment the submission number at preparation time

* Addresses #4974
* Job submission number used to be incremented *after* submission
  (i.e. only once there is a "submission" of which to speak).
* However, we also incremented the submission number if submission
  (or preparation) failed (in which cases there isn't really a
  "submission" but we need one for internal purposes).
* Now the submission number is incremented when tasks enter the
  "preparing" state.
* This resolves an issue where jobs which were going through the
  submission pipeline during a reload got badly broken in the scheduler
  (until restarted).

* scheduler: re-compute pre_prep_tasks for each iteration

* Addresses #4974
* Tasks which are awaiting job preparation used to be stored in
  `Scheduler.pre_prep_tasks`, however, this effectively created an
  intermediate "task pool" which had nasty interactions with reload.
* This commit removes the pre_prep_tasks list by merging the listing
  of these tasks in with TaskPool.release_queued_tasks (to avoid
  unnecessary task pool iteration).
* `waiting_on_job_prep` now defaults to `False` rather than `True`.

* platforms: don't re-try, re-attempt submission

* Previously if submission on a host fails 255 (SSH error), then we put
  a submission retry on it to allow the task to retry on another host
  We decremented the submission number to make it look like the same
  attempt.
* Now we set the flag which sends the task back through the submission
  pipeline allowing it to retry without intermediate state changes.

* changelog [skip ci]

Co-authored-by: Tim Pillinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants