Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ihs: catch NoHostsError #5195

Merged
merged 2 commits into from
Dec 7, 2022
Merged

ihs: catch NoHostsError #5195

merged 2 commits into from
Dec 7, 2022

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Oct 14, 2022

  • Catch NoHostsError in the code it can occur in.
  • Attempt to handle the error in these contexts.
  • Mark the functions where the error can crop up for future warning.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PRs raised to both master and the relevant maintenance branch.

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Oct 14, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.0.4 milestone Oct 14, 2022
@oliver-sanders oliver-sanders self-assigned this Oct 14, 2022
@oliver-sanders oliver-sanders marked this pull request as ready for review October 18, 2022 12:26
@oliver-sanders oliver-sanders force-pushed the no-hosts-error branch 2 times, most recently from 02cd1b8 to e0fca23 Compare October 18, 2022 14:26
Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Read code
  • Checked out, run new tests
  • Proof of pudding test, played with code.

caplog.set_level(logging.WARN, 'cylc')
host_stats, data = _get_metrics(['not-a-host'], None)
# a warning should be logged
assert len(caplog.records) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any warning at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

Note: This warning could only come from _get_metrics which is a small simple function.

cylc/flow/task_events_mgr.py Show resolved Hide resolved
cylc/flow/task_job_mgr.py Show resolved Hide resolved
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found this does not address Dave's NoHostsError during shutdown

Traceback (most recent call last):
  File "~/cylc-flow/cylc/flow/scheduler.py", line 1742, in _shutdown
    self.proc_pool.terminate()
  File "~/cylc-flow/cylc/flow/subprocpool.py", line 345, in terminate
    self.process()
  File "~/cylc-flow/cylc/flow/subprocpool.py", line 224, in process
    self._proc_exit(
  File "~/cylc-flow/cylc/flow/subprocpool.py", line 206, in _proc_exit
    self._run_command_exit(
  File "~/cylc-flow/cylc/flow/subprocpool.py", line 529, in _run_command_exit
    res = _run_callback(callback_255, callback_255_args)
  File "~/cylc-flow/cylc/flow/subprocpool.py", line 481, in _run_callback
    callback(ctx, *args_)
  File "~/cylc-flow/cylc/flow/task_job_mgr.py", line 820, in _poll_task_jobs_callback_255
    self._manip_task_jobs_callback(
  File "~/cylc-flow/cylc/flow/task_job_mgr.py", line 806, in _manip_task_jobs_callback
    summary_callback(workflow, itask, ctx, line)
  File "~/cylc-flow/cylc/flow/task_job_mgr.py", line 829, in _poll_task_job_callback_255
    self.poll_task_jobs(workflow, [itask])
  File "~/cylc-flow/cylc/flow/task_job_mgr.py", line 210, in poll_task_jobs
    self._run_job_cmd(
  File "~/cylc-flow/cylc/flow/task_job_mgr.py", line 948, in _run_job_cmd
    host = get_host_from_platform(
  File "~/cylc-flow/cylc/flow/platforms.py", line 516, in get_host_from_platform
    raise NoHostsError(platform)

(line numbers here are as on master not this branch)

tests/unit/test_host_select.py Outdated Show resolved Hide resolved
@oliver-sanders
Copy link
Member Author

Rebased, added a try/except for kill & poll to address the reported traceback.

@MetRonnie
Copy link
Member

MetRonnie commented Oct 26, 2022

Hmm, with this I now get new traceback for an invalid ssh config

INFO - platform: exvcylcdev01 - remote init (on exvcylcdev01)
ERROR - [Errno 2] No such file or directory: 'jobs-poll'
    Traceback (most recent call last):
      File "~/cylc-flow/cylc/flow/subprocpool.py", line 429, in _run_command_init
        proc = procopen(
      File "~/cylc-flow/cylc/flow/cylc_subproc.py", line 55, in procopen
        process = Popen(command, bufsize, executable, stdin, stdout,  # nosec
      File "~/.conda/envs/cylc8/lib/python3.9/subprocess.py", line 951, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "~/.conda/envs/cylc8/lib/python3.9/subprocess.py", line 1821, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'jobs-poll'
ERROR - [jobs-poll cmd] jobs-poll -- '$HOME/cylc-run/remote-example/run1/log/job'
    [jobs-poll ret_code] 1
    [jobs-poll err] [Errno 2] No such file or directory: 'jobs-poll'

and the workflow seems to hang.

Might be why tests/f/restart/26-remote-kill.t is failing, it also seems to hang

@oliver-sanders oliver-sanders marked this pull request as draft October 26, 2022 12:46
* Catch NoHostsError in the code it can occur in.
* Attempt to handle the error in these contexts.
* Mark the functions where the error can crop up for future warning.
* Fix job log retrieval retries.
@oliver-sanders oliver-sanders marked this pull request as ready for review December 6, 2022 16:18
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 6, 2022

Hmm, with this I now get new traceback for an invalid ssh config

Sorted. I had made a mistake with the error handling in one place, all tests should now be passing.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tested, no problems found 👍

def test_get_metrics_no_hosts_error(caplog):
"""It should handle SSH errors.

If a host is not contactable then it should be shipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"shipped/skipped"? (Obvious enough, I guess!)

@hjoliver
Copy link
Member

hjoliver commented Dec 7, 2022

Tests didn't run on last commit? Kicking...

@hjoliver hjoliver closed this Dec 7, 2022
@hjoliver hjoliver reopened this Dec 7, 2022
@hjoliver
Copy link
Member

hjoliver commented Dec 7, 2022

Passed. Patch-diff coverage only 75%, but it'll do for now.

@hjoliver hjoliver merged commit d8d4946 into cylc:8.0.x Dec 7, 2022
@oliver-sanders oliver-sanders deleted the no-hosts-error branch December 7, 2022 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants