-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stuck in preparing state #4974
Comments
It looks like the cause is a mix up of submit numbers. When we perform reload we make a copy of the task: cylc-flow/cylc/flow/task_pool.py Lines 845 to 848 in 8ff0c80
The copy is then triggered incrementing its submit number. If something out there somewhere is maintaining a reference to the original task, however, that could explain the submit-number disparity. A possible mechanism is
|
The issue was resolved via a restart which suggests that the issue was due to scheduler state not mirrored in the DB so the |
Haven't managed to reproduce, however, this test shows how reload can interfere with the async def test_pre_prep_reload(one, start):
"""Ensure reload does not interfere with the task preparation pipeline."""
async with start(one):
one.resume_workflow()
one.pool.release_runahead_tasks()
one.release_queued_tasks()
assert len(one.pre_prep_tasks) == 1
one.command_reload_workflow()
assert one.pool.get_tasks()[0] == one.pre_prep_tasks |
The underlying issue here is that the This causes the submit-number to be Solutions:
I think we should do (1) (quick n' easy), to close this issue we would also need to do either (2) or (3). |
I was thinking (2) before I read your solutions.. I can't imagine it would be too hard to update the task pool first during the submission process? |
* Addresses cylc#4974 * Job submission number used to be incremented *after* submission (i.e. only once there is a "submission" of which to speak). * However, we also incremented the submission number if submission (or preparation) failed (in which cases there isn't really a "submission" but we need one for internal purposes). * Now the submission number is incremented when tasks enter the "preparing" state. * This resolves an issue where jobs which were going through the submission pipeline during a reload got badly broken in the scheduler (until restarted).
* Addresses cylc#4974 * Tasks which are awaiting job preparation used to be stored in `Scheduler.pre_prep_tasks`, however, this effectively created an intermediate "task pool" which had nasty interactions with reload. * This commit removes the pre_prep_tasks list by merging the listing of these tasks in with TaskPool.release_queued_tasks (to avoid unnecessary task pool iteration). * `waiting_on_job_prep` now defaults to `False` rather than `True`.
* Addresses cylc#4974 * Tasks which are awaiting job preparation used to be stored in `Scheduler.pre_prep_tasks`, however, this effectively created an intermediate "task pool" which had nasty interactions with reload. * This commit removes the pre_prep_tasks list by merging the listing of these tasks in with TaskPool.release_queued_tasks (to avoid unnecessary task pool iteration). * `waiting_on_job_prep` now defaults to `False` rather than `True`.
* job: increment the submission number at preparation time * Addresses #4974 * Job submission number used to be incremented *after* submission (i.e. only once there is a "submission" of which to speak). * However, we also incremented the submission number if submission (or preparation) failed (in which cases there isn't really a "submission" but we need one for internal purposes). * Now the submission number is incremented when tasks enter the "preparing" state. * This resolves an issue where jobs which were going through the submission pipeline during a reload got badly broken in the scheduler (until restarted). * scheduler: re-compute pre_prep_tasks for each iteration * Addresses #4974 * Tasks which are awaiting job preparation used to be stored in `Scheduler.pre_prep_tasks`, however, this effectively created an intermediate "task pool" which had nasty interactions with reload. * This commit removes the pre_prep_tasks list by merging the listing of these tasks in with TaskPool.release_queued_tasks (to avoid unnecessary task pool iteration). * `waiting_on_job_prep` now defaults to `False` rather than `True`. * platforms: don't re-try, re-attempt submission * Previously if submission on a host fails 255 (SSH error), then we put a submission retry on it to allow the task to retry on another host We decremented the submission number to make it look like the same attempt. * Now we set the flag which sends the task back through the submission pipeline allowing it to retry without intermediate state changes. * changelog [skip ci] Co-authored-by: Tim Pillinger <[email protected]>
Tasks can get "stuck" in the preparing state, seen in the wild, twice, but struggling to reproduce.
Possibly related to reloads, possibly caused by a mismatch of submission number (we increment sub num on submission so a subsequent preparing job will have the same sub num as a previous instance), dunno.
Order of events:
[jobs-submit ret_code] 1
[jobs-submit out] ...|cycle/task/01|1
At the end of this, job 02 is still marked as preparing, however, is no longer preparing anything, submission seemingly having failed long ago just after the unhandled jobs-submit output. The task is not killable (because the preparing state is not killable).
The text was updated successfully, but these errors were encountered: