Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler and Task Creation improvements #1076

Merged

Conversation

seriousben
Copy link
Member

@seriousben seriousben commented Dec 2, 2024

Context

Fixes:

  • If a reducer parent and a reducer task finish at the same time, the scheduler will not know about it which might result in more than one reduced output.
  • If a reducer failed for an output, we should stop scheduling new reducers while we wait for all parent tasks to complete. (This is will show up as error processing state change: No outputs found for previous reducer task, should never happen)

What

  • We skip creating or enqueuing a new reducer task, if the compute node has failed tasks.
    • New unit test covers this case.
  • We only do process and persist one state change at a time to prevent timing issues.
    • Existing unit test was changed to test this.

Testing

Contribution Checklist

  • If the python-sdk was changed, please run make fmt in python-sdk/.
  • If the server was changed, please run make fmt in server/.
  • Make sure all PR Checks are passing.

@seriousben seriousben force-pushed the seriousben/stop-creating-reduce-tasks-if-prev-failed branch from c42ad60 to 1f589a6 Compare December 2, 2024 20:44
@seriousben seriousben requested a review from diptanu December 2, 2024 20:45
Comment on lines +103 to +114
let invocation_finished = if invocation_ctx.outstanding_tasks == 0 {
info!("task failed, invocation finished");
true
} else {
info!("task failed invocation finishing, waiting for outstanding tasks to finish");
false
};
return Ok(TaskCreationResult::no_tasks(
&task.namespace,
&task.compute_graph_name,
&task.invocation_id,
invocation_finished,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixing a regression introduced in #1072

@seriousben seriousben merged commit 34892a1 into main Dec 2, 2024
5 checks passed
@seriousben seriousben deleted the seriousben/stop-creating-reduce-tasks-if-prev-failed branch December 2, 2024 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant