Reconsidering the behavior of `call()` in simulation #537

Twisol · 2022-12-13T18:52:47Z

Twisol
Dec 13, 2022
Collaborator

Summary

Currently, when a task spawned by call() terminates, we wait one tick of the simulation engine before resuming its parent. As an illustrative edge case, performing call(() -> {}) (i.e. delegating to and blocking on the completion of a do-nothing task) ends up behaving identically to delay(Duration.ZERO), since both cause the current task to yield for a tick.

I am concerned that this behavior is actually incorrect in general, as it can cause tasks that a modeler expects to occur simultaneously to, in fact, occur sequentially. Moreover, the only remedy for a task occurring late is to add extra delays to all the other potentially-simultaneous tasks, which is exactly the kind of non-local reasoning we sought to eliminate in the Merlin modeling system (c.f. the effects of immediately on model design and maintenance in APGen).

I propose that we change the behavior of call() to resume the calling task immediately when the called task completes. Not only does this address the above problem, but it more generally aligns with existing expectations about method invocation in Java. In other words, this removes a rough edge where Merlin does not meet the bar of being "just Java with added flavor". Moreover, if the original behavior is desired in some places, it can be achieved by inserting delay(Duration.ZERO) after the call -- or, indeed, anywhere else near the call site that might be more appropriate to capture the modeler's local intent.

Proposed source-level change

This is a very small change in terms of lines of code, so I'm putting this section first so we can set the development effort question to rest quickly.

In SimulationEngine#stepEffectModel, instead of deferring the AwaitingChildren phase of a task to the next tick, begin processing children immediately. In other words, replace these lines:

this.tasks.put(task, progress.completedAt(currentTime, children));
this.scheduledJobs.schedule(JobId.forTask(task), SubInstant.Tasks.at(currentTime));

With this line:

stepWaitingTask(task, progress.completedAt(currentTime, children), frame, currentTime);

The stepWaitingTask logic already resumes the parent immediately on completion, so it is only the tick artificially inserted between "modeled logic completed" and "all delegated work completed" that needs to be removed.

If we also decide that the specific case of call(new MyActivity()) should retain the extra tick -- see below -- then we will want to insert an extra delay(Duration.ZERO) into the generation of call stubs in MissionModelGenerator#generateActivityActions.

Historical rationale

The current behavior is something of a historical artifact due to how we ended up with the current design. Back when the system was first implemented, we didn't have call -- at least, not as a primitive action. Instead, we had waitFor, which would take a task ID and block until that task completed. The call action was defined as waitFor(spawn()), so it inherited the behavior of waitFor.

With waitFor, the "general" case is that the target task won't be completed yet, so the code following a waitFor will resume at a later time. (We considered this to be the general case because you wouldn't use waitFor if you knew you didn't have to wait!) This means that the transaction enclosing the effects prior to waitFor will have closed, and the transactions of any other task at that time will also have closed, meaning that the later code will observe the effects of all of those transactions.

In the edge case where the task has already completed, we don't want to change our behavior suddenly and discontinuously from the planner's normal expectation. As such, we extended the bulk behavior to the boundary, inserting an extra tick to ensure that the modeler can always assume they're in a new transaction following a waitFor.

However, waitFor is now gone, and with it goes any possibility that a task might attempt to wait on a task that has already completed. (Consider: when you use call(act), the task described by act can't have completed because we haven't even started it yet!) This means that the reason for the extra tick has now disappeared. That doesn't on its own mean that we should remove it, but it does mean we should ask whether it's still necessary -- and whether we gain more by a different choice.

Problems with the current behavior

Keep in mind that the problems below have always existed -- but our hands were tied because call was built on waitFor, and waitFor had to gracefully deal with the edge case described above. Now we have the opportunity to fix these problems.

`call` cannot be used for optimization

First, call is little more than a simulation-aware Java method call. This is more true than ever with the recent changes to allow call to return the result of the called task. In fact, there are exactly two differences between call(() -> x); and x itself:

The observable extra tick closing the current transaction and opening a new one
The unobservable creation of a separate task in the simulation engine

The first difference can already be achieved with the explicit use of delay(Duration.ZERO), so call() adds no additional capabilities here.

The second actually has significant positive optimization benefits, as a modeler can wrap expensive parts of a task in a call() to a ThreadedTask where the thread overhead is insignificant relative to the computation itself, then replace the outer task with a ReplayingTask which affords fast context-switching at the cost of replaying over earlier parts of the task. Since the costly parts of the task are wrapped in a separate task, the replaying task remains paused until the subtask completes, and every time thereafter the replaying task can skip over that whole subtask in one step, no matter how complex the logic in that subtask was.

Unfortunately, the extra tick actively prevents call() from being used for optimization, because that extra tick can end up causing extremely different observable behavior! For instance, if a task added a series of rates that might alternate between positive and negative -- a situation not uncommon in data models with channels that may even overflow into other channels -- then the net sum might end up rather close to zero, but any number of intermediate sums might end up quite positive or quite negative. These unnecessarily-observable positive or negative quantities could then trip conditions, e.g. on daemon tasks designed to take automatic action when some threshold is reached. So any attempt to use call() to optimize simulation performance can actually lead to significant discrepancies in modeled behavior.

Aside on `call` on activities:

There is a third use of call: as convenient syntactic sugar for invoking activities with call(new MyActivity()). The tasks produced by activities are special in that their output spans are visible to planners, so modelers will often model things that should be visible to planners as activities, then invoke them with call(). From a simulation perspective, however, this boils down to a normal call -- the generated code stub for that activity type just looks up the actual task to invoke, which itself emits events causing its span to be visible in the simulation results.

The proposed change doesn't implicate activities specially; activity output spans will still be visible to planners as always, and the concrete task lookup will remain unchanged. However, If there is a desire for activities to generally force an extra tick on their callers, that can be added to the generated call stubs. (I would prefer not to, for the sake of a consistent mental model.)

(I've long wanted to add an ability to generate visible spans without having to model them with activity types, so I don't consider this part of the essence of call. In fact, modeling something as an activity type means that a planner can use that activity type in their plan, even if the modeler only wanted to provide output information, and not a new opportunity for control. IIRC, this also came up in discussions about APGen, but I don't remember the details.)

Nth-distant ancestors are delayed by N ticks

Tasks can decompose into quite complex and deep trees of subtasks; and oftentimes, parent tasks end by calling a child task. Since an extra tick is inserted between every terminating activity and its caller, a sequence of five nested calls will incur a total of five ticks' worth of delay by the time the first task resumes. Together with spawning multiple children, this can lead to tasks that look like they should occur simultaneously actually occurring in sequence. For instance:

register.set(0);

spawn(() -> {
  call(() -> {});  // 1 tick of delay
  register.set(1);
});

spawn(() -> {
  call(() -> call(() -> {}));  // 2 ticks of delay
  System.out.println(register.get());
});

In the current system, this program will output 1; under the proposed change, this program will output 0.

The calls do nothing, and intuitively we should be able to remove them locally. If we removed them from the second spawn, under the current system we will now observe 0, while under the proposed change there will be no difference.

We can make something similar happen with an unintuitive ordering of effects:

register.set(0);

spawn(() -> {
  call(() -> {});
  System.out.println(register.get());
});

register.set(1);

Under the current system, this program will output 1; under the proposed change, this program will output 0. The common theme here is that call ought to be transparent -- if you call a no-op, the whole call should be a no-op. The behavior of a call should be all and only the behavior of the thing being called.

This issue also has implications for my prototype proposal for exposing a relative scheduling capability from the simulation engine. That prototype draws on the existing ability of tasks to form a delegation tree, leveraging call to achieve end-relative invocation. However, in the course of compiling a plan into a top-level task that decomposes into its constituent directives, and to achieve a sensible modularity between the tree-structure and the individual directives, it ends up being useful to nest two calls together. The inclusion of an extra tick makes me nervous about whether it's possible to have two directives planned which we expect to occur at the same time, but actually occur in two distinct ticks due to the current behavior of call. Whether or not the problem is real, the current call complicates reasoning about simple program refactorings like this.

(Yes, this particular issue is what led me to think about the current behavior of call; but I hope I've demonstrated that this isn't particular to my pet prototype.)

mattdailis · 2023-03-30T19:28:46Z

mattdailis
Mar 30, 2023
Maintainer

Thanks for the write-up! (and apologies for the glacial response)

I am supportive of removing the extra tick from the end of call. It adds symmetry to the begin and end semantics (the first bit of a call currently starts in the same tick as the caller - it seems reasonable for the last tick to occur in the same tick during which the caller is resumed).

This will also make it easier to reason about a caller and its callee together - there can be no interference between the observable simulation state between the end of the callee's execution and the resumption of the caller.

Regarding how this affects existing models, and whether we should auto-generate a delay(0) for calling activities, I'm not sure. My leaning is towards not doing that, and instead documenting the semantics change, and providing guidance for where to insert the tick if it turns out to be important. My hope is that in most cases, this change won't violate any explicit assumptions made by user code.

1 reply

Twisol Mar 31, 2023
Collaborator Author

Regarding how this affects existing models, and whether we should auto-generate a delay(0) for calling activities, I'm not sure. My leaning is towards not doing that, and instead documenting the semantics change, and providing guidance for where to insert the tick if it turns out to be important. My hope is that in most cases, this change won't violate any explicit assumptions made by user code.

I will say, it would be very nice if we could run some user plans both before and after this change to see if (and by how much) the simulation results are meaningfully affected.

Twisol · 2023-09-05T23:17:31Z

Twisol
Sep 5, 2023
Collaborator Author

Actually, I've realized that call(() -> x) is distinct from x in a way not mentioned originally. (Thanks to @DavidLegg for causing me to break my own mental model.) I believe my arguments above still hold, especially w.r.t. the nested-call issue (with n calls causing a delay of n ticks), but I want to correct the record.

Consider the following snippets (where log is made up):

// 1
call(() -> spawn(() -> delay(2, SECONDS)));
log("Hello");

// 2
spawn(() -> delay(2, SECONDS));
log("Hello");

In the first example, "Hello" is logged after two seconds pass. In the second, "Hello" is logged immediately. This is because call is morally a delegation of duties: we want to resume only once those duties have been perfomed. Whether the called task further delegates via spawn is immaterial: the delegated duties are only complete once the whole subtree of delegates is complete. Regular Java function calls do not make this promise, so they can return before any spawned tasks are completed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsidering the behavior of `call()` in simulation #537

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Reconsidering the behavior of call() in simulation #537

Twisol Dec 13, 2022 Collaborator

Summary

Proposed source-level change

Historical rationale

Problems with the current behavior

call cannot be used for optimization

Aside on call on activities:

Nth-distant ancestors are delayed by N ticks

Replies: 2 comments · 1 reply

mattdailis Mar 30, 2023 Maintainer

Twisol Mar 31, 2023 Collaborator Author

Twisol Sep 5, 2023 Collaborator Author

Reconsidering the behavior of `call()` in simulation #537

Twisol
Dec 13, 2022
Collaborator

`call` cannot be used for optimization

Aside on `call` on activities:

Replies: 2 comments 1 reply

mattdailis
Mar 30, 2023
Maintainer

Twisol Mar 31, 2023
Collaborator Author

Twisol
Sep 5, 2023
Collaborator Author