hardening the cancelation functionality #675

HansVRP · 2024-11-29T15:14:05Z

No description provided.

soxofaan · 2024-12-02T09:28:41Z

openeo/extra/job_management.py

+        try:
+            running_start_time_str = row.get("running_start_time")
+            if not running_start_time_str or pd.isna(running_start_time_str):
+                _log.warning(f"Job {job.job_id} does not have a valid running start time. Cancellation skipped.")


This warning might be a bit too alarming. It will be shown every minute on each job that has no recorded start time, so this could be quite spammy.

"cancellation skipped" might also give wrong impression that job manager still thinks that job should be cancelled for some reason, but it's won't actually do it.

some possible improvements:

only show this once per job, or for the whole job tracking run

if running start time is missing, fill it in with timestamp of first observation that it's missing, to have a fallback value, so that the auto-cancel feature can still work

Is the underlying issue then in:

` if previous_status in {"created", "queued"} and new_status == "running": stats["job started running"] += 1 active.loc[i, "running_start_time"] = rfc3339.utcnow() if new_status == "canceled": stats["job canceled"] += 1 self.on_job_cancel(the_job, active.loc[i]) if self._cancel_running_job_after and new_status == "running": self._cancel_prolonged_job(the_job, active.loc[i])`

The problem would be removed if I also only run the cancel prolonged job if the previous state was created or queued. Then we know for sure that a starting time has been set?

that won't work in practice: you want cancelling to happen long after the state changed to "running", so both previous state and current state will be "running" when you typically want to cancel.

what you could do is changing the setting of "running_start_time", to something like (pseudo-code):

if running_start_time is not set and new_status == "running": active.loc[i, "running_start_time"] = rfc3339.utcnow()

then running_start_time further degrades to a best effort guess of the actual start time, but at least you have something to work with

soxofaan · 2024-12-05T09:49:29Z

openeo/extra/job_management.py

+        """
+        Ensures the running start time is valid. If missing, approximates with the current time.
+        Returns the parsed running start time as a datetime object.
+        """


This extra method makes the whole construction quite complex. E.g. it drags in the requirement to have the whole dataframe (df) at this point, including the assumption mutations on it will properly be persisted.

Isn't it easier to just modify this existing if in _track_statuses:

openeo-python-client/openeo/extra/job_management.py

Lines 731 to 733 in 3fd041a

if previous_status in {"created", "queued"} and new_status == "running":

stats["job started running"] += 1

active.loc[i, "running_start_time"] = rfc3339.utcnow()

e.g. something like

if new_status == "running" and (not active.loc[i, "running_start_time"] or pd.isna(active.loc[i, "running_start_time"]): if previous_status not in {"created", "queued"}: _log.warning( f"Unknown 'running_start_time' for running job {job_id}. Using current time as an approximation." ) stats["job started running"] += 1 active.loc[i, "running_start_time"] = rfc3339.utcnow()

sounds good,

The reason for the additional function was to harden the cancel prolonged jobs in itself. It makes sense to use this from within track statuses. It would however not resolve the issue on the cancellation function in itself.

HansVRP · 2024-12-13T10:27:04Z

@soxofaan any other changes required?

soxofaan

just a minor thing, but apart from that ok to merge I think

soxofaan · 2024-12-13T15:52:28Z

openeo/extra/job_management.py

+            elapsed = current_time - job_running_start_time
+
+            if elapsed > self._cancel_running_job_after:
+                try:


I think this nested try-except is bit overkill now and doesn't add any value. I'd remove it to keep this _cancel_prolonged_job more to the point

hardening the cancelation functionality

c1edf0d

soxofaan reviewed Dec 2, 2024

View reviewed changes

HansVRP added 2 commits December 3, 2024 15:01

influence the job_database_directly

441387c

clean up

a71144c

soxofaan reviewed Dec 5, 2024

View reviewed changes

soxofaan linked an issue Dec 5, 2024 that may be closed by this pull request

harden _cancel_prolonged_job against missing running_start_time #665

Open

simplify

6f76268

HansVRP requested a review from soxofaan December 11, 2024 18:24

soxofaan reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hardening the cancelation functionality #675

hardening the cancelation functionality #675

HansVRP commented Nov 29, 2024

soxofaan Dec 2, 2024

HansVRP Dec 2, 2024

soxofaan Dec 2, 2024

soxofaan Dec 5, 2024

HansVRP Dec 5, 2024

HansVRP commented Dec 13, 2024

soxofaan left a comment

soxofaan Dec 13, 2024

	if previous_status in {"created", "queued"} and new_status == "running":
	stats["job started running"] += 1
	active.loc[i, "running_start_time"] = rfc3339.utcnow()

hardening the cancelation functionality #675

Are you sure you want to change the base?

hardening the cancelation functionality #675

Conversation

HansVRP commented Nov 29, 2024

soxofaan Dec 2, 2024

Choose a reason for hiding this comment

HansVRP Dec 2, 2024

Choose a reason for hiding this comment

soxofaan Dec 2, 2024

Choose a reason for hiding this comment

soxofaan Dec 5, 2024

Choose a reason for hiding this comment

HansVRP Dec 5, 2024

Choose a reason for hiding this comment

HansVRP commented Dec 13, 2024

soxofaan left a comment

Choose a reason for hiding this comment

soxofaan Dec 13, 2024

Choose a reason for hiding this comment