Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data store: preparing jobs appearing on restart #4994

Closed
oliver-sanders opened this issue Jul 21, 2022 · 13 comments · Fixed by #5011
Closed

data store: preparing jobs appearing on restart #4994

oliver-sanders opened this issue Jul 21, 2022 · 13 comments · Fixed by #5011
Assignees
Labels
bug Something is wrong :( small
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 21, 2022

Reported by @MetRonnie on Element.

"Preparing" is a task state not a job state, we should see preparing tasks but never preparing jobs (jobs represent submissions, they have IDs and can be queried).

It looks like this only happens on restart but that will need to be confirmed properly. The data store loads in jobs from the DB and is probably being tricked by the submit number being higher than expected as a result of #4984 - #4994 (comment)

The code needs to be patched to prevent these preparing jobs from being spawned somewhere around here:

for row in flow_db.select_jobs_for_datastore(
{itask.tokens.relative_id}
):
self.insert_db_job(1, row)

XUinUpHfovaqAsrvxfKhTvtM
HsFtrEzqcAmKPYuQaVLLelah

Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).

@oliver-sanders oliver-sanders added bug Something is wrong :( small labels Jul 21, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Jul 21, 2022
@datamel datamel self-assigned this Jul 21, 2022
@MetRonnie
Copy link
Member

Actually it predates #4984

@dwsutherland
Copy link
Member

preparing state was introduced to jobs here:
#4891

Because jobs exist before submission, and had to have some sort of state, so they are assigned on DB load according to their attributes:

        if run_status is not None:
            if run_status == 0:
                status = TASK_STATUS_SUCCEEDED
            else:
                status = TASK_STATUS_FAILED
        elif time_run is not None:
            status = TASK_STATUS_RUNNING
        elif submit_status is not None:
            if submit_status == 0:
                status = TASK_STATUS_SUBMITTED
            else:
                status = TASK_STATUS_SUBMIT_FAILED
        else:
            status = TASK_STATUS_PREPARING

If the workflow shutdown before updating the status of the job in the DB, or the status never got followed up on, then the job will be in this preparing state.

@hjoliver
Copy link
Member

Hmm, in that case do we need either:
-1. hide/ignore preparing jobs in the datastore or UI?
-2. or a preparing job state (and icon)?

Option 2. has been discussed and decided against, but it's not entirely unreasonable - "preparing" is not that much more abstract than "submitted" as a job state (submitted does not represent a running process yet either, although it does have a job ID at that point).

@MetRonnie
Copy link
Member

MetRonnie commented Jul 22, 2022

I think option 1 (exclude preparing jobs from the data store) is straightforward, essentially boils down to

@@ @@ def insert_db_job(self, row_idx, row):

         if run_status is not None:
             if run_status == 0:
                 status = TASK_STATUS_SUCCEEDED
             else:
                 status = TASK_STATUS_FAILED
         elif time_run is not None:
             status = TASK_STATUS_RUNNING
         elif submit_status is not None:
             if submit_status == 0:
                 status = TASK_STATUS_SUBMITTED
             else:
                 status = TASK_STATUS_SUBMIT_FAILED
         else:
-            status = TASK_STATUS_PREPARING
+            return

(keeps the the historical failed tasks in the store)

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jul 22, 2022

-2. or a preparing job state (and icon)?

NOOOO!

Option 2. has been discussed and decided against

^ This.

@hjoliver
Copy link
Member

Fair enough, but as I said, it wouldn't be entirely unreasonable, it's just a choice 😁

@dwsutherland
Copy link
Member

dwsutherland commented Jul 22, 2022

-1. hide/ignore preparing jobs in the datastore or UI?

We may have missing job history (i.e. "why can I only see job #2? where's #1?"), and possibly have an orphaned job running (if the successful submission never got back to the scheduler and/or DB before shutdown), then on restart I assume an orphaned preparing job/task wouldn't be followed up on... Is this possible?
If so, this could be dangerous, as a 2nd job may be submitted..

@dwsutherland
Copy link
Member

dwsutherland commented Jul 22, 2022

We can't just "sweep it under the rug" I'm afraid.

The DB load isn't the only place we have to contend with a preparing job status:

JOB_STATUSES_ALL = [
    TASK_STATUS_PREPARING,
    TASK_STATUS_SUBMITTED,
    TASK_STATUS_SUBMIT_FAILED,
    TASK_STATUS_RUNNING,
    TASK_STATUS_SUCCEEDED,
    TASK_STATUS_FAILED,
]

# Faster lookup where order not needed.
JOB_STATUS_SET = set(JOB_STATUSES_ALL)
    def insert_job(self, name, point_string, status, job_conf):
        """Insert job into data-store.

        Args:
            name (str): Corresponding task name.
            point_string (str): Cycle point string
            job_conf (dic):
                Dictionary of job configuration used to generate
                the job script.
                (see TaskJobManager._prep_submit_task_job_impl)

        Returns:

            None

        """
        sub_num = job_conf['submit_num']
        tp_id, tproxy = self.store_node_fetcher(name, point_string)
        if not tproxy:
            return
        update_time = time()
        tp_tokens = Tokens(tp_id)
        j_tokens = tp_tokens.duplicate(job=str(sub_num))
        j_id, job = self.store_node_fetcher(
            j_tokens['task'],
            j_tokens['cycle'],
            j_tokens['job'],
        )
        if job:
            # Job already exists (i.e. post-submission submit failure)
            return

        if status not in JOB_STATUS_SET:
            status = TASK_STATUS_PREPARING

And we can't just do away with jobs here (live/new jobs)...

So if it's not preparing and definitely not submitted yet, then what is it?

@datamel
Copy link
Contributor

datamel commented Jul 22, 2022

Maybe a pre-submit job state?

@hjoliver
Copy link
Member

Maybe a pre-submit job state?

Well that's exactly what "preparing" means 😁

@dwsutherland
Copy link
Member

dwsutherland commented Jul 23, 2022

Maybe a pre-submit job state?

Sounds sensible, it can't be nothing (since the submit num is incremented), and it can't be submitted (as that means it's been handed off to the runner successfully).

Some job states are tasks states also, so they don't necessarily have to be different 😄

@wxtim wxtim assigned wxtim and unassigned datamel Jul 25, 2022
@dwsutherland
Copy link
Member

If we really don't want preparing job state, then we should not increment the submit number, create date-store job, and not save to the DB until submitted or submit-failure happens...

Otherwise, we should just add an icon to the UI.

@oliver-sanders
Copy link
Member Author

Surely it's just a single if branch that returns if the task state is preparing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( small
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants