-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split the golang and Node.js parts of cosmic-swingset ? #5287
Comments
One impact is the loadgen which currently detects vat worker process by finding children of the chain process it starts. This will require updating the loadgen to support finding the kernel process as a child of the |
Could control via a flag for the node binary and a flag for simagd (e.g. default is both in the same process, -split means we are running in split mode). Not a lot of changes on the Go side, since it talks JSON - serialize string and write to socket instead of calling function. |
Another argument for splitting would be signal handling. While looking into remediations for #5419, we realized that Node likely conflicts with go's signal handlers, and loses. We'd like to add a handler in Node to prevent committing a block after it has received a SIGTERM. |
Notes from "Split Brain Design" document Motivation
Current Architecture
Proposed New Architecture
Development plan
Notes on JSON-RPCv1
MVP for the actual split
Example TrafficVbank balance update notification (agd → agvm):
Vbank balance query request (agvm → agd):
response (agd → agvm):
|
One issue that may potentially be fixed by this split is #8047 |
As discussed in the project management meeting for OrchestrationI will remove this from the orchestration delivery timeline and project |
What is the Problem Being Solved?
In our chain architecture, each validator runs N+1 processes: one
xsnap
worker process for each vat, plus a single "hub" process to drive them all. This "hub" process hosts four components:In ancient times, @michaelfig performed some deep magic to create this composite Go+Node executable. It depends upon some features of Node that might go away in future versions.
Recently, we've seen some strange crashes (#5031) in the loadgen task, in which this hub process experiences segfaults or double-free errors. We don't know if this is a Node.js assertion triggering, or something in the Go runtime.
We think it might be easier to diagnose problems like this if we could separate the hub into two separate programs: one written entirely in Go, the other written entirely in Node.js.
Description of the Design
We'd move cosmic-swingset (which would import Swingset) into its own process. The
agd
binary would be written in Go. Whenagd
is launched, it would open some pipes and then spawn a copy ofNode.js
, pointed at the cosmic-swingset entry point.js
file. All further communication between the Go side and the Node side would happen over those pipes, much like how the kernel talks to its workers over pipes.Performance Considerations
This will introduce some new overhead between the two processes. Some examples of interactions that would be somewhat slower are:
mailbox
orbridge
messages, which must be delivered to a swingset devicebridge
device)activityhash
) need to traverse the boundaryWe don't think those paths are heavily travelled right now, but making them slower might discourage us from taking fuller advantage of them. For example, it would help #3769 tremendously if swingset could simply store all of its state in the cosmos-sdk IAVL tree. Our first prototype actually did just that, but we found it was too slow. We've probably learned some tricks that would make it better now (we haven't tried the experimen recently), but that would certainly put the Go<->Node boundary on the hot path (worse because we already have a worker<->kernel hop on many of those DB requests).
Security Considerations
In general, separating software into smaller pieces is better for security, because the edges can perform validation on their inputs, and compromise of one component doesn't automatically mean compromise of the other. But, realistically, the kernel is such a deeply-trusted part of our system, that we wouldn't win much security by separating it from Node.js . A compromise of either side is fatal.
With the kernel/Node.js code running in its own process, we might be able to impose a #2836 -style seccomp jail upon it. If the kernel process became compromised, the attacker could control all of the vats, but could not take over the validator's account, which might protect other security and/or logging tools (including the supervisor that could re-launch the process).
Deployment Considerations
More processes are slightly more complicated. If one dies, we should make sure the other dies too. Admins will see more processes in their task listings, making it harder to know which one they should kill. However, most of this should be hidden in the primary
agd
process, and it's only a small increase over the status quo (where operators see several dozen worker processes already).Test Plan
Probably just the existing integration tests in
packages/cosmic-swingset
, although it would be great to have an automated test of the two processes terminating if/when they see their peer exit.The text was updated successfully, but these errors were encountered: