Address various scheduler timing issues #1069

seriousben · 2024-11-26T21:45:57Z

Context

Running test_graph_behavior.py continuously with as many as 9 executors results in various failures that do not show themselves with only one executor.

Error seen and addressed:

Executor stuck in an infinite loop when a scheduler loop processes 2 state changes.
Reducer function returning more than the expected single output when it a reducer finishes before the next parent output finishes.
Reducer function returning more than the expected single output when the ingest file for it happens right before a scheduler run loop.

What

In this PR, on top of addressing the edge cases found we are also adding lots of traces and making sure a scheduler error will not block the loop to other state changes.

Reducer problem 1

Reducer problem 2

Known edge case to address in a future PR: the scheduler run loops expects to process state changes for a single compute graph at a time. This is an incorrect assumption and can results in edge cases.

Testing

In order to test fixes and detect edge cases, I have detected errors by running the following:

TEST_MAX=500 INDEXIFY_URL=http://localhost:8900 command_stress_test poetry run python -u -m unittest tests/test_graph_behaviours.py 2>&1 | tee test-out.log

command_stress_test is https://github.com/seriousben/serious-nixos-config/blob/main/home-manager/files/command_stress_test.fish

Before these changes:

After 480s (107/500 runs) ALL executors become very quickly stuck in a loop doing ingest_file for already finished tasks.

We can still infer these failures:

     12 FAIL: test_map_reduce_operation_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_map_reduce_operation_1)
     11 FAIL: test_pipeline_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_pipeline_1)
      8 FAIL: test_router_graph_behavior_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_router_graph_behavior_1)

After these changes:

===========================
Success Rate = 99.8%
===========================
Failures     = 1
Total        = 500
Elapsed      = 1566s

The single failure seen in this run is:

FAIL: test_map_reduce_operation_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_map_reduce_operation_1)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/seriousben/Library/Caches/pypoetry/virtualenvs/indexify-dlsxfW2b-py3.11/lib/python3.11/site-packages/parameterized/parameterized.py", line 620, in standalone_func
    return func(*(a + p.args), **p.kwargs, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/seriousben/src/tensorlakeai/indexify/python-sdk/tests/test_graph_behaviours.py", line 318, in test_map_reduce_operation
    self.assertEqual(output_sum_sq, [Sum(val=5)])
AssertionError: Lists differ: [Sum(val=1)] != [Sum(val=5)]

First differing element 0:
Sum(val=1)
Sum(val=5)

- [Sum(val=1)]
?          ^

+ [Sum(val=5)]
?

Future work will look into this other edge case.

Contribution Checklist

If the python-sdk was changed, please run make fmt in python-sdk/.
If the server was changed, please run make fmt in server/.
Make sure all PR Checks are passing.

server/src/scheduler.rs

server/task_scheduler/src/lib.rs

diptanu · 2024-11-27T04:45:44Z

@seriousben Can you run some tests on the following scenarios -

Create a graph, with no executor, invoke the graph(it will create tasks, but no allocations), delete the invocation, bring an executor(it will create allocation) -- expecting something to break here
Create a graph, with an executor, invoke the graph with a function that takes 10 seconds to complete, meanwhile delete the graph, and let the task complete -- expecting something to break.
Run the graph with a map reduce with a sequence of 100000000 and 15 executors running on docker locally. The map and reduce will run concurrently on all the machines -- expecting a lot to break here

server/src/scheduler.rs

diptanu · 2024-11-27T04:41:35Z

server/src/scheduler.rs

@@ -111,11 +99,41 @@ impl Scheduler {
                },
                diagnostic_msgs,
            }),
-            state_changes_processed: processed_state_changes,
+            state_changes_processed: processed_state_changes.iter().map(|x| x.id).collect(),


What does this do? Are we filtering something here?

We are mapping the processes_state_changes to their id.

server/data_model/src/lib.rs

diptanu · 2024-12-01T22:11:21Z

server/data_model/src/lib.rs

+
+    pub fn get_compute_parent(&self, node_name: &str) -> Option<&str> {
+        // Find parent of the node
+        self.edges


We could just precompute this in a hash map in the ComputeGraph object. But the logic seems fine.

For simplicity and because it is only needed for a edge case, I would like to postpone precomputing it. precomputing comes with challeneges like support for existing graphs that I would prefer not tackle in this PR.

server/src/scheduler.rs

diptanu · 2024-12-01T22:36:43Z

server/state_store/src/state_machine.rs

+                task_key = task.key(),
+                "Task already completed but allocation still exists, deleting allocation",
+            );
+            txn.delete_cf(


I don't think we should check this in. This feels like a bandaid. Let's investigate some more before we do this.

I fixed the root cause as part of this PR. But without this, we risk loosing executors stuck in a bad state.

I think if this happens in the future it should be an alert and we should debug it.

Since the root cause is fixed and this will prevent outages in case a similar problem happens in the future, I would like to keep this and have an alert to get us to investigate and fix other root causes.

Summary of discussion: Since the root cause is fixed, we'll go ahead with this change.

seriousben · 2024-12-02T02:08:10Z

server/src/scheduler.rs

+        if requires_task_allocation {
+            let task_placement_result = self.task_allocator.schedule_unplaced_tasks()?;
+            new_allocations.extend(task_placement_result.task_placements);
+            diagnostic_msgs.extend(task_placement_result.diagnostic_msgs);
        }


This is what could cause the same task to be allocated multiple times.

…rrors

seriousben · 2024-12-02T13:42:29Z

Merging to get rid of lots of timing issues. I am happy to make quick changes before next release as needed @diptanu.

seriousben commented Nov 26, 2024

View reviewed changes

server/src/scheduler.rs Outdated Show resolved Hide resolved

seriousben commented Nov 26, 2024

View reviewed changes

server/src/scheduler.rs Outdated Show resolved Hide resolved

seriousben commented Nov 26, 2024

View reviewed changes

server/task_scheduler/src/lib.rs Show resolved Hide resolved

seriousben requested a review from diptanu November 26, 2024 21:54

diptanu reviewed Dec 1, 2024

View reviewed changes

seriousben changed the title ~~make scheduler process its work without being blocked on errors~~ Address various scheduler timing issues Dec 1, 2024

seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch from 3b738fd to 94b1526 Compare December 1, 2024 20:50

diptanu requested changes Dec 1, 2024

View reviewed changes

seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch 4 times, most recently from 9f7ad73 to ad18f21 Compare December 2, 2024 00:58

seriousben commented Dec 2, 2024

View reviewed changes

seriousben requested a review from diptanu December 2, 2024 02:48

seriousben added 8 commits December 2, 2024 06:35

make scheduler/executor process their work without being blocked on e…

f7b410b

…rrors

improve logging

a72c8b2

fixing determinism issue of reducers

968dc2d

reducer unit test

889013a

reducer unit test for edge case

15e1cbb

adding unit test for new reducer edge case

0fa694e

fixing possible duplicate placements

4c1580b

address feedback

fba76e0

seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch from 41c38d9 to 3f2500f Compare December 2, 2024 11:36

improved tracing and cleanup

3cf4f75

seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch from 3f2500f to 3cf4f75 Compare December 2, 2024 11:42

seriousben merged commit deee6c0 into main Dec 2, 2024
5 checks passed

seriousben deleted the seriousben/scheduler-process-all-state-changes-on-error branch December 2, 2024 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address various scheduler timing issues #1069

Address various scheduler timing issues #1069

seriousben commented Nov 26, 2024 •

edited

Loading

diptanu commented Nov 27, 2024

diptanu Nov 27, 2024

seriousben Dec 1, 2024

diptanu Dec 1, 2024

seriousben Dec 2, 2024

diptanu Dec 1, 2024

seriousben Dec 1, 2024

seriousben Dec 2, 2024

seriousben Dec 2, 2024

seriousben Dec 2, 2024

seriousben commented Dec 2, 2024

Address various scheduler timing issues #1069

Address various scheduler timing issues #1069

Conversation

seriousben commented Nov 26, 2024 • edited Loading

Context

What

Reducer problem 1

Reducer problem 2

Testing

Before these changes:

After these changes:

Contribution Checklist

diptanu commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seriousben commented Dec 2, 2024

seriousben commented Nov 26, 2024 •

edited

Loading