Parallelism across independent resources in simulation #620

JoelCourtney · 2023-01-26T00:15:26Z

JoelCourtney
Jan 26, 2023
Collaborator

Currently simulation has no parallelism, even though each task (such as an activity) has its own thread. Only one task is allowed to proceed at a time, even though many activities in different subsystems could theoretically operate on completely disjoint sets of resources.

Given that we aren't going to make a backtracking engine (i.e. time warp), the limiting factor of how much parallelism we could achieve is Merlin's knowledge of what resources are going to be operated on, when, how, and by what activities. Currently, we have no knowledge, which forced the current design to have no parallel tasks at all. This is one extreme. At the other extreme, we could have the mission modeller register the operations each activity could make in between each delay, which would tell Merlin exactly when each resource would be queried or changed. This would be impractical for everyone involved, would limit the modeller's ability to use variable delays, and would just generally be a whole lotta work.

I'd like to outline a system in the middle that keeps track of resource usage at the activity level; i.e. Merlin would know what resources each activity is allowed to query or mutate, but not precisely when it will happen. Unfortunately this would be a nearly ground-up redesign of Merlin, but wouldn't require much work for mission modellers. And of course I wouldn't propose this if I didn't think the performance gains would be worth it.

Design

This would be best suited for Loom virtual threads, which are currently a preview feature in Java 19. This is because there will be a lot of context switching in between CPU-bound (not IO-bound) units of work, which would be much more expensive on OS threads. I also fiddled around with async Futures in Rust for a while to make sure it would work with that paradigm - and it would, but it would be pretty obnoxious with Java's CompletableFutures.

Mission model changes

Instead of getting the whole MissionModel to operate on, each activity effect model would declare in its @EffectModel annotation which resources it will read and write, and be provided resource handles that allow reading and/or writing accordingly. This allows Merlin to know which activities can read/write which resources, and requires the activity to abide by the resources that it declares.

I don't imagine any other changes to mission model code.

Clocks

A key part of any parallel simulation engine will be that each piece of computation might be happening at different simulation times. In this case, since this is not a backtracking algorithm, each thread has exactly one clock that tracks where it is in simulation time. I'll be referring to a hypothetical Clock class that each thread owns, and allows other threads to wait on them until they advance to a certain simulation time.

Activities

Instead of an activity waiting on delay, they would wait (if needed) on resource queries (.get()). delay would simply advance the activity clock and notify any resource threads that were waiting on it (see below). This should still be compatible with incremental simulation.

Resources

Ideal world

If our jobs were easy, no activities would cause dependencies between resources, meaning that set-like operations can be instantiated immediately in any order. So resA.set(5) is acceptable, and even resA.add(5). But resA.set(resB.get()) and if (resB.get() == 5) resA.set(5) create a dependency between resA and resB. In this ideal world where we don't have to query resources, simulation could look like this (and be pretty performant):

Merlin spawns one thread for each activity and one thread for each resource.
Each resource thread has a min heap of operations generated by activities, which is ordered by the clock time when the op should be applied.
The activity threads use their resource handles to dump all of their resource operations into the appropriate heaps on the resource threads. As they progress through the effect model with delays, their clocks advance.
The resource threads look at the operations heap and wait until all relevant activities' clocks have advanced past the minimum operation's time. It then pops the heap, applies the operation, streams a new segment to the database, and moves to the next operation.
- I promise there's a way to make that more efficient than it sounds, but this post is nitty-gritty enough as it is.
When all activity threads have finished and all resource operation heaps are empty, simulation is finished.

The point of all of this is that each resource can churn through its heap of operations in parallel with the others.

Reality

Turns out our jobs are not easy. Any time an activity needs to query a resource it needs to wait on the resource thread's clock to pass the activity clock, before reading a value. But, a) the above algorithm doesn't store values in memory, and b) let's be real, we want profile streaming. So instead we do this:

Merlin spawns one thread for each activity, and two threads for each resource - an operator and a streamer.
The operator processes the operations heap as before, but instead of streaming to the database directly, it stores the segments in a queue. Like before, the operator has to stay behind (i.e. wait for) activities that can write to the resource.
The activity threads send operations to the operator's heap whenever they are able, but when they query they have to wait on the resource's operator's clock before looking in the queue.
The streamer processes pop segments off the queue and stream to the database. But in order to make sure data isn't freed before it's needed, the streamer has to stay behind activities that can read from the resource.
When all activity threads have finished, and all resource heaps and queues are empty, simulation is finished.

Limitations

A potential gotcha that would remove all the performance gain is in the difference between res.add(5) and res.set(res.get() + 5). add is an operation that can be sent to the resource heap immediately. set(get() + 5) will block the activity thread until the operator catches up, and then block the operator until the activity has a chance to resume and send the operation.
Similarly, the gains could be mitigated if a high-usage resource like state of charge is read frequently by the majority of activities. I assume that most activities's charge behavior can be expressed without .get() (as in charge.subtract(5) or charge.addRate(0.1)), but if they can't, then we have a problem. We would still get parallelism, because other resources could progress in small increments on other cores, but state of charge would become a bottleneck.

Benefits

speeeeed

Twisol · 2023-01-26T13:23:39Z

Twisol
Jan 26, 2023
Collaborator

Hi Joel! Thanks for writing all of this up. I think speeding up simulation along these lines is a really good idea. I have some misgivings about the specifics, but I do believe that there is a way to push the core idea through.

I'll focus here on my responses to your proposal itself. I may write a separate reply about my thoughts for overcoming the obstacles I identify below.

which resources it will read and write

Maybe the nomenclature has shifted (again), but from what I recall, a "resource" in Merlin is a quantity that is observable and trackable by the external environment (i.e. to be recorded as part of simulation results). I think you're referring to cells, i.e. the internal state of the model.

I also fiddled around with async Futures in Rust for a while to make sure it would work with that paradigm

Can I take this to mean you have a Rust prototype of this proposal? 👀

I'd like to outline a system in the middle that keeps track of resource usage at the activity level; i.e. Merlin would know what resources each activity is allowed to query or mutate, but not precisely when it will happen.

For some fun historical context, this was something we explored very early in the development of activity modeling in Merlin. At least at the time, we found it really hard to design a static dependency declaration and injection system that didn't entail a significant loss of ergonomics. That's why we ended up with the current approach, where we can infer a dynamic dependency graph based on what the activity is observed to do.

We never did end up fully constructing that dependency graph, although we do perform the same kind of dependency analysis to determine when a resource needs to be re-queried due to changes to upstream cells. There may even still be a comment in SimulationEngine about doing this for tasks, too.

Instead of getting the whole MissionModel to operate on, each activity effect model would declare in its @EffectModel annotation which resources it will read and write, and be provided resource handles that allow reading and/or writing accordingly. This allows Merlin to know which activities can read/write which resources, and requires the activity to abide by the resources that it declares.

When we first considered an approach like this, we had a lot of trouble with how to give names to those pieces of state, and how to do type-safe dependency injection of the same. Moreover, individual cells are not really meant to be accessed directly -- a cell is morally a field of some other model class, and it is that model that defines the relevant methods for manipulating that state. I think we would want to name these models and do dependency injection at that level.

The issue with binding against models instead of cells is that there's no obvious way to tell what cells can be influenced from one model. Firstly, the sim driver doesn't observe the mission model as anything more than one big black box -- we don't currently have any insight into its internal structure. We would need some new mechanism to at least identify groups of cell allocations. Secondly, there is nothing stopping a modeler from providing one model instance a reference to another model instance, so the cells a model can influence is a superset of the cells the model allocated itself. If an activity declares a dependency on a given model instance, it doesn't give us enough information to precisely infer all of the cells that are accessible from that root.

On the flip side, the issue with binding against cells is that it entirely breaks encapsulation. An activity is liable to call a method on a model, which calls another method on another model, etc., until finally doing something to a cell. Activity type authors cannot be expected to know the full breadth of cells each method influences, just so that they can declare them in the @EffectModel annotation.

In addition (and this is perhaps a more solvable issue), cells (and models) can be allocated in e.g. a loop. If we want to name each cell (or each model) that can be depended upon by an activity, we'll need an addressing scheme that can cope with the fact that we don't necessarily know all of the instantiated cells/models until the mission model has been instantiated.

I'm further concerned about how activity decomposition factors into this. If an activity with access to state X spawns an activity with access to state Y, then before the spawn, we might be fooled into thinking that nothing can influence Y, and hence Y can move forward. However, if the child is then spawned, we suddenly have changes to Y occurring when we didn't expect them to. It seems we would need to declare which child activities can be spawned, too, so that the child dependencies can be imputed to the parent as well.

All this means that I think it will be very hard to precisely identify pre-simulation which cells an activity is allowed to influence whilst also preserving the "it's just Java with mild restrictions" flavor of the current modeling style.

Instead of an activity waiting on delay, they would wait (if needed) on resource queries (.get()). delay would simply advance the activity clock and notify any resource threads that were waiting on it (see below). This should still be compatible with incremental simulation.

I like this idea very much; it's definitely at the heart of how this whole endeavor could be achieved. With our current approach to .get(), since we don't step up a cell until it's actually requested, we do so synchronously from a .get() and then return the newly-computed value. In principle, if we can't step up the cell -- say, because other activities are working on it at earlier times -- then we would, of course, need to block, halting our progress until the cell becomes available.

The problem is, how do we tell when we can't step a cell up? Put differently, how do we know whether stepping the cell up will miss effects that haven't been committed yet to earlier times? Our current solution is to simply never run a task until all cells are guaranteed to be available -- we never block during the task, because we already blocked on all cells before we resumed the task to begin with.

Your solution to this problem can, I think fairly, be characterized as statically declaring which cells an activity can affect; this would allow us to simply look at the activities prior to our current time and ask if any of them could potentially affect us. However, as noted above, I'm concerned about (1) the amount of information we need to be given to achieve a precise analysis, (2) the ergonomics implications of requiring that information, and (3) how certain we can reasonably be that we didn't miss some loophole that renders our analysis unsound.

I think there's a middle ground, where we still use runtime dependency tracking, but optimistically run tasks that might query cells that cannot be stepped up to the current time with confidence. I want to call this the "branch prediction" approach -- we step a cell up optimistically, and roll back if we find our assumption invalidated -- but I'll wait to describe this separately. (I'm not sure if this is what you referred to by "time warp", but this would leverage logic similar to how ReplayingTask already works.)

The streamer processes pop segments off the queue and stream to the database. But in order to make sure data isn't freed before it's needed, the streamer has to stay behind activities that can read from the resource.

@mattdailis and I have been talking about an approach that decouples resource profiling from task simulation, so I think this doesn't need to be considered (luckily). The idea is that, as the simulation commits events to its history (in your case proposal, changes are considered committed when all clocks have moved past them), a separate worker thread walks along history behind the simulation, and steps copies of the resource-relevant cells. That is, simulation and profiling have their own copies of state; profiling is just replaying the history that simulation has already traversed. In principle, this lets simulation get arbitrarily far ahead of profiling, without jeopardizing our ability to stream results.

A potential gotcha that would remove all the performance gain is in the difference between res.add(5) and res.set(res.get() + 5). add is an operation that can be sent to the resource heap immediately. set(get() + 5) will block the activity thread until the operator catches up, and then block the operator until the activity has a chance to resume and send the operation.

Yes, this is something we knew and designed around early on. add is always going to be better if you can get away with it -- not only does it not imply a read-dependency, but it also doesn't conflict with other concurrent add events. On the other hand, two set events will conflict.

This is also why we didn't bake any type of cell into Merlin itself. Merlin allows modelers to define their own cell types, with their own events and own semantics for those events. The hope is that, if you know a lot about what you need your particular cell for, you can encode domain operations on that cell as events, giving them better behavior with respect to concurrency and potentially better behavior with respect to dependency tracking.

As this becomes more important, I hope that we can see cell modeling become more of a valuable tool. As it stands, the existing cells may not always be the most performant, but at least modelers can start with them and refine them as they identify performance bottlenecks.

Similarly, the gains could be mitigated if a high-usage resource like state of charge is read frequently by the majority of activities. I assume that most activities's charge behavior can be expressed without .get() (as in charge.subtract(5) or charge.addRate(0.1)), but if they can't, then we have a problem. We would still get parallelism, because other resources could progress in small increments on other cores, but state of charge would become a bottleneck.

I would very much like if we could instrument simulation to get a better idea of how these dependencies look. As I've mentioned, we designed the interface between the driver and the model so that we could precisely identify all state, and reads and writes against that state, over the course of the simulation. The intent was to build a complete graph of dependencies amongst cells, tasks, and resources, laid out over time, which could be used to drive incremental resimulation. (If, when re-running a changed activity, it still produces all the same effects, then we need not resimulate anything else. If it's different in some ways but not others, we can propagate those changes only to those entities that were influenced.)

This kind of dependency graph would also be useful for us, to get an idea of how interdependent -- or how decoupled -- the activities and cells in a simulation are. And if read-write dependencies become as performance-critical as we hope in this proposal, the graph will also be a major boon to modelers who are trying to optimize their system.

As such, I think a really strong first step would be to implement the capability to generate a dependency graph from a simulation, so that we can get a better handle on the kinds of performance gains we'd realistically see -- and perhaps identify any other structure that could be exploited in the name of performance.

18 replies

Twisol Feb 10, 2023
Collaborator

🤔 I think this would end up mostly linearizing task execution again. Tasks call get() very frequently, so more often than not we're going to have a (non-materialized) linear queue of tasks at various points in the timeline, all but one waiting for the furthest-back task to advance enough so that it isn't the earliest task.

If I understand correctly, this could be implemented at the same efficiency without threads -- pump a task until it calls get ahead of where any other task is; yield the task, put it in a priority queue-shaped thing, then pop the next task in the queue and repeat.

A benefit of this approach is that each task may get to run longer without yielding, potentially reducing task-switching overhead. However, I don't think we end up squeezing any more parallelism out of the system.

Mythicaeda Feb 13, 2023
Collaborator

🤔 I think this would end up mostly linearizing task execution again.

As I said, it would make our current performance the worst case.

Tasks call get() very frequently, so more often than not we're going to have a (non-materialized) linear queue of tasks at various points in the timeline, all but one waiting for the furthest-back task to advance enough so that it isn't the earliest task.

Regarding the get() calls, how many of them are getting to actually know the value vs getting to mutate the value? (That's a mission-modeler-aimed question, so you don't need to to answer it.) We only end up with the worst-case performance if all activities need to "know-get" the value of resources frequently. If some never have to "know-get" a resource or if all activities only "know-get" once (ie at the start), then this should have better performance.

Additionally, it would open up an avenue for mission modellers to increase performance by reducing the number of "know-get" get calls they make.

Twisol Feb 13, 2023
Collaborator

Regarding the get() calls, how many of them are getting to actually know the value vs getting to mutate the value? [...] We only end up with the worst-case performance if all activities need to "know-get" the value of resources frequently. If some never have to "know-get" a resource or if all activities only "know-get" once (ie at the start), then this should have better performance.

We cannot, from outside the model, tell the difference between these two kinds of get. All we can see from the engine side of things is the sequence of Merlin actions and queries produced by the task -- not the interdependencies among them. In particular, we have to assume that the task's behavior after any get may arbitrarily vary depending on the returned value. It could do a branch-free computation on the returned value that only feeds into the following emit, or it could do arbitrary branching and looping based on the returned value -- or any other thing allowed by Java. Since we can't tell, we're forced to be conservative -- or to pull incremental/rollback hijinks to make up for when we're wrong.

Additionally, it would open up an avenue for mission modellers to increase performance by reducing the number of "know-get" get calls they make.

The designed avenue for avoiding needless dependency inference due to get() is to remove get() altogether by using a more appropriate Cell type. This is the purpose of the Counter: instead of performing set(get() + d), you can perform add(d), which is independent of the current cell value.

Twisol Feb 13, 2023
Collaborator

My (unstated, I suppose) assumption is that activities frequently get() and that most such get()s are not replaceable with a state-dependent update -- either because the activity fundamentally needs to branch on that information, or because it would be an invasive and potentially non-local change to the model to replace the relevant cells. Also, updates dependent on two or more upstream cells can't be collapsed like this, since something like an add() effect only has access to the receiving cell's current state when computing a new one.

Twisol Feb 13, 2023
Collaborator

Hm. My other unstated assumption is that most tasks, when resumed, reach a get() very quickly (in CPU time), meaning that while we may reveal a lot of concurrent steps when we move time forward, they are all a very small amount of work overall before going back into the priority queue. I actually don't have a good intuition for how this would balance out, but I think it's another one of those things that a dependency graph would help us analyze.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism across independent resources in simulation #620

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 18 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Parallelism across independent resources in simulation #620

JoelCourtney Jan 26, 2023 Collaborator

Design

Mission model changes

Clocks

Activities

Resources

Ideal world

Reality

Limitations

Benefits

Replies: 1 comment · 18 replies

Twisol Jan 26, 2023 Collaborator

Twisol Feb 10, 2023 Collaborator

Mythicaeda Feb 13, 2023 Collaborator

Twisol Feb 13, 2023 Collaborator

Twisol Feb 13, 2023 Collaborator

Twisol Feb 13, 2023 Collaborator

JoelCourtney
Jan 26, 2023
Collaborator

Replies: 1 comment 18 replies

Twisol
Jan 26, 2023
Collaborator

Twisol Feb 10, 2023
Collaborator

Mythicaeda Feb 13, 2023
Collaborator

Twisol Feb 13, 2023
Collaborator

Twisol Feb 13, 2023
Collaborator

Twisol Feb 13, 2023
Collaborator