-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address various scheduler timing issues #1069
Address various scheduler timing issues #1069
Conversation
@seriousben Can you run some tests on the following scenarios -
|
server/src/scheduler.rs
Outdated
@@ -111,11 +99,41 @@ impl Scheduler { | |||
}, | |||
diagnostic_msgs, | |||
}), | |||
state_changes_processed: processed_state_changes, | |||
state_changes_processed: processed_state_changes.iter().map(|x| x.id).collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Are we filtering something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are mapping the processes_state_changes to their id.
3b738fd
to
94b1526
Compare
|
||
pub fn get_compute_parent(&self, node_name: &str) -> Option<&str> { | ||
// Find parent of the node | ||
self.edges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just precompute this in a hash map in the ComputeGraph object. But the logic seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity and because it is only needed for a edge case, I would like to postpone precomputing it. precomputing comes with challeneges like support for existing graphs that I would prefer not tackle in this PR.
task_key = task.key(), | ||
"Task already completed but allocation still exists, deleting allocation", | ||
); | ||
txn.delete_cf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should check this in. This feels like a bandaid. Let's investigate some more before we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the root cause as part of this PR. But without this, we risk loosing executors stuck in a bad state.
I think if this happens in the future it should be an alert and we should debug it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the root cause is fixed and this will prevent outages in case a similar problem happens in the future, I would like to keep this and have an alert to get us to investigate and fix other root causes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of discussion: Since the root cause is fixed, we'll go ahead with this change.
9f7ad73
to
ad18f21
Compare
if requires_task_allocation { | ||
let task_placement_result = self.task_allocator.schedule_unplaced_tasks()?; | ||
new_allocations.extend(task_placement_result.task_placements); | ||
diagnostic_msgs.extend(task_placement_result.diagnostic_msgs); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what could cause the same task to be allocated multiple times.
41c38d9
to
3f2500f
Compare
3f2500f
to
3cf4f75
Compare
Merging to get rid of lots of timing issues. I am happy to make quick changes before next release as needed @diptanu. |
Context
Running test_graph_behavior.py continuously with as many as 9 executors results in various failures that do not show themselves with only one executor.
Error seen and addressed:
What
In this PR, on top of addressing the edge cases found we are also adding lots of traces and making sure a scheduler error will not block the loop to other state changes.
Reducer problem 1
Reducer problem 2
Known edge case to address in a future PR: the scheduler run loops expects to process state changes for a single compute graph at a time. This is an incorrect assumption and can results in edge cases.
Testing
In order to test fixes and detect edge cases, I have detected errors by running the following:
command_stress_test
is https://github.com/seriousben/serious-nixos-config/blob/main/home-manager/files/command_stress_test.fishBefore these changes:
After 480s (107/500 runs) ALL executors become very quickly stuck in a loop doing ingest_file for already finished tasks.
We can still infer these failures:
After these changes:
The single failure seen in this run is:
Future work will look into this other edge case.
Contribution Checklist
make fmt
inpython-sdk/
.make fmt
inserver/
.