split the golang and Node.js parts of cosmic-swingset ? #5287

warner · 2022-05-04T19:18:27Z

What is the Problem Being Solved?

In our chain architecture, each validator runs N+1 processes: one xsnap worker process for each vat, plus a single "hub" process to drive them all. This "hub" process hosts four components:

the Tendermint consensus layer (written in Go)
all the cosmos-sdk modules (written in Go)
the "cosmic-swingset" glue layer (written in a combination of Go and Node.js)
the swingset kernel (written in Node.js)

In ancient times, @michaelfig performed some deep magic to create this composite Go+Node executable. It depends upon some features of Node that might go away in future versions.

Recently, we've seen some strange crashes (#5031) in the loadgen task, in which this hub process experiences segfaults or double-free errors. We don't know if this is a Node.js assertion triggering, or something in the Go runtime.

We think it might be easier to diagnose problems like this if we could separate the hub into two separate programs: one written entirely in Go, the other written entirely in Node.js.

Description of the Design

We'd move cosmic-swingset (which would import Swingset) into its own process. The agd binary would be written in Go. When agd is launched, it would open some pipes and then spawn a copy of Node.js, pointed at the cosmic-swingset entry point .js file. All further communication between the Go side and the Node side would happen over those pipes, much like how the kernel talks to its workers over pipes.

Performance Considerations

This will introduce some new overhead between the two processes. Some examples of interactions that would be somewhat slower are:

cosmos transactions that supply mailbox or bridge messages, which must be delivered to a swingset device
cosmos Bank transactions which change a vpurse balance need to notify swingset (probably through the bridge device)
swingset writes to the merkle tree (both merkle-subtree-writing objects #4558 and the activityhash) need to traverse the boundary

We don't think those paths are heavily travelled right now, but making them slower might discourage us from taking fuller advantage of them. For example, it would help #3769 tremendously if swingset could simply store all of its state in the cosmos-sdk IAVL tree. Our first prototype actually did just that, but we found it was too slow. We've probably learned some tricks that would make it better now (we haven't tried the experimen recently), but that would certainly put the Go<->Node boundary on the hot path (worse because we already have a worker<->kernel hop on many of those DB requests).

Security Considerations

In general, separating software into smaller pieces is better for security, because the edges can perform validation on their inputs, and compromise of one component doesn't automatically mean compromise of the other. But, realistically, the kernel is such a deeply-trusted part of our system, that we wouldn't win much security by separating it from Node.js . A compromise of either side is fatal.

With the kernel/Node.js code running in its own process, we might be able to impose a #2836 -style seccomp jail upon it. If the kernel process became compromised, the attacker could control all of the vats, but could not take over the validator's account, which might protect other security and/or logging tools (including the supervisor that could re-launch the process).

Deployment Considerations

More processes are slightly more complicated. If one dies, we should make sure the other dies too. Admins will see more processes in their task listings, making it harder to know which one they should kill. However, most of this should be hidden in the primary agd process, and it's only a small increase over the status quo (where operators see several dozen worker processes already).

Test Plan

Probably just the existing integration tests in packages/cosmic-swingset, although it would be great to have an automated test of the two processes terminating if/when they see their peer exit.

The text was updated successfully, but these errors were encountered:

mhofman · 2022-05-06T00:56:40Z

One impact is the loadgen which currently detects vat worker process by finding children of the chain process it starts. This will require updating the loadgen to support finding the kernel process as a child of the agd process first.

Tartuffo · 2022-05-11T19:18:30Z

Could control via a flag for the node binary and a flag for simagd (e.g. default is both in the same process, -split means we are running in split mode). Not a lot of changes on the Go side, since it talks JSON - serialize string and write to socket instead of calling function.

mhofman · 2022-05-25T04:45:26Z

Another argument for splitting would be signal handling. While looking into remediations for #5419, we realized that Node likely conflicts with go's signal handlers, and loses. We'd like to add a handler in Node to prevent committing a block after it has received a SIGTERM.

JimLarson · 2023-02-23T18:59:54Z

Notes from "Split Brain Design" document

Motivation

Conventional debugging and profiling tools do not play well with a dual-runtime process. In particular, the Go runtime holding the Cosmos-level state cannot be dynamically debugged.
It’s difficult to have any integration test short of a full system test, and as above it’s difficult to debug a full system test.
Full system tests are also less stable and more resource intensive.
Ag-solo remains an idiosyncrasy within the overall architecture.
Some valuable use modes are awkward, such as async downcalls or sync upcalls.

Current Architecture

agd
- spawns ag-chain-cosmos, and Node.js runtime, then exits
ag-chain-cosmos
- Single SwingSet instance
- loads libagcosmosdaemon, containing an entire Go runtime for the Cosmos environment, plus some glue code in C++
- The JS runtime is dominant. Synchronously calls into Go, then returns.
- Glue code copies byte strings between the two runtimes, so the two GCs can run independently.
- Databases are written exclusively by one side or the other - no shared DBs.
ag-solo
- Single SwingSet instance
- Persistent standalone Agoric runtime.
- Runs simulated chain across an OS pipe via Node.js specific protocol.
  - Sends same ABCI phase messages (Begin/EndBlock, etc.)
- Reads Agoric chain mailboxes over Tendermint WebSocket, and submits messages by execing agd tx subcommands
ag-cosmos-helper
- Go runtime only.
- Obsolete - can be removed

Proposed New Architecture

agd
- remains a pure Go runtime
- agd start - spawns SwingSet peer via agvm start and drops into Cosmos-level startup
- communicates via JSON-RPC (v1) over a pipe.
- Each side registers Cosmic-Swingset protocol handler with pipe
- Each side writes to its own databases, as before.
- Either process exits upon peer termination or closed pipe.
agvm
- Pure JS runtime
- Single SwingSet instance
- connection.json configurations to support chain, solo, sim
  - Support all existing ag-solo connection types
  - JSON-RPCv1-over-pipe
  - Cosmic-swingset handler function sharable with sim-chain
  - Composable to create solo+sim-chain
- agvm start - new command to easily configure and/or resume idempotently
  - simple flag to configure connections with only stdin/stdout JSON-RPC
  - argument for different boot script
libagcosmosdaemon goes away
- C++ glue code goes away.
- No more use of NAPI, leaving us less dependent on Node.js internals.
  - Easier to change version of Node.js
  - Plausible to use different Javascript environments.
- Since current bridge code copies data anyhow (separate garbage collectors), and per-Tx data is sent indirectly via “actions”, performance regression, if any, should be minimal.
New sim-vm can simulate Swingset for agd, log bridge messages or make them available for unit tests.
JSON-RPC(v1)
- minimal peer-to-peer async RPC framing on top of bi-directional byte stream
- similar to web sockets, but much simpler and less overhead
- each side implements a handler for incoming calls
- each side associates responses with blocked callers and/or promise resolution
- replaces current bridge protocol with its ad-hoc channels
- Can have async downcalls (JS → Go) and sync upcalls (Go → JS)
- Either side can initiate the connection.
- Use newline-delimited JSON for ease-of-use
Simplifications
- C++ glue code goes away.
- No more use of NAPI, leaving us less dependent on Node.js internals.
  - Easier to change version of Node.js
  - Plausible to use different Javascript environments.

Development plan

Take bridge performance measurements to detect later regression.
- Maybe. Really only significant
Rewrite JS side sync device downcalls to be promise-based.
Add new JSON-RPC connection type to connections.js and have a modified sim-chain speak it by updating “bridge” device callbacks.
Create new agvm binary and command-line, replacing ag-solo.
- ag-solo can be a simple script for compatibility
Create modified “agd start” command to launch agvm and call-to-controller via pipe.
Implement sim-vm for Cosmos-level testing with simulated swingset
Clean up various scaffolding code to deal with new binaries
- testing
- release
- monitoring
- documentation
Clean up obsolete code:
- ag-chain-cosmos
- gyp instructions in Makefile
- g/c/build/libagcosmosdaemon.h
- g/c/cmd/helper and libdaemon
- g/c/src/* including agcosmosdaemon-node.cc

Notes on JSON-RPCv1

Spec says that result field, error field, and each entry of the params array must be a JSON object. This is a silly restriction, so as long as we’re rolling our own code for this, we’ll allow these all to be arbitrary JSON values. The method field needs to remain a string, however.
- matches the Go standard library implementation
The Go side will use 32-bit integers for request IDs (looping and skipping over any IDs already in use), but will accept any JSON value for incoming IDs, since it doesn’t need to match against them.
To replace the “port” grouping, and to play nicely with Golang’s rpc framework, method names will use the pattern ., such as vbank.BalanceUpdate. The method name should follow Golang conventions: capitalized and CamelCase.
The Javascript side can set its own conventions.

MVP for the actual split

The naughty intra-process RPC data exchange bits:
- vm.ReplyToGo(replyPort int, isError int, resp string) int
- vm.SendToGo(port int, msg string) string
- daemon.RunWithController(func (needReply bool, str string) (string, error))
agvm ‘{“agd”:{“protocol”:”jsonrpc”,“fd”:[5,6]}}’

Example Traffic

Vbank balance update

notification (agd → agvm):

{
  “id”: null,
  “method”: “vbank.BalanceUpdate”,
  “params”: [{
    “nonce”: 123,
    “updates”: [{
      “address”: “agoric1abc123”,
      “denom”: “uist”,
      “amount”: “654321”
    }]
  }]
}

Vbank balance query

request (agvm → agd):

{
  “id”: 123,
  “method”: “vbank.GetBalance”,
  “params”: [{“address”: “agoric1abc123”, “denom”: “uist”}]
}

response (agd → agvm):

{“id”: 123, “result”: “654321”, “error”: null}

mhofman · 2023-08-23T21:57:12Z

One issue that may potentially be fixed by this split is #8047

aj-agoric · 2024-06-25T14:24:35Z

As discussed in the project management meeting for OrchestrationI will remove this from the orchestration delivery timeline and project

warner added enhancement New feature or request cosmic-swingset package: cosmic-swingset labels May 4, 2022

warner assigned michaelfig and JimLarson May 4, 2022

Tartuffo unassigned JimLarson Nov 4, 2022

JimLarson added the needs-design label Nov 10, 2022

Tartuffo added migrate-in-progress and removed migrate-in-progress labels Nov 17, 2022

ivanlei added the vaults_triage DO NOT USE label Dec 4, 2022

michaelfig mentioned this issue Dec 23, 2022

feat(vats): async serial bridge #6683

Merged

ivanlei modified the milestones: Vaults RC0, Vaults Functional Testing, Vaults EVP Feb 1, 2023

dckc mentioned this issue Feb 10, 2023

refactor(agoric-cli): sh->js port of agops-perf-smoketest with zx #6966

Closed

michaelfig removed this from the Vaults EVP milestone Feb 22, 2023

This was referenced Feb 23, 2023

Split agd into Go-only and JS-only executables #7061

Open

Update bridge protocol to JSON-RPC 1.0 #7062

Open

mhofman mentioned this issue May 2, 2024

Expose some Swingset state through cosmos RPC #9313

Open

toliaqat linked a pull request Jul 17, 2024 that will close this issue

feat(cosmic-swingset): add agcc implementation for agd split vm #8002

Draft

turadg mentioned this issue Jul 17, 2024

fully remove 'sim-chain' #9725

Open

aj-agoric closed this as completed Aug 7, 2024

aj-agoric reopened this Aug 7, 2024

gibson042 assigned gibson042 and unassigned michaelfig Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split the golang and Node.js parts of cosmic-swingset ? #5287

split the golang and Node.js parts of cosmic-swingset ? #5287

warner commented May 4, 2022

mhofman commented May 6, 2022

Tartuffo commented May 11, 2022

mhofman commented May 25, 2022

JimLarson commented Feb 23, 2023 •

edited

Loading

mhofman commented Aug 23, 2023

aj-agoric commented Jun 25, 2024

split the golang and Node.js parts of cosmic-swingset ? #5287

split the golang and Node.js parts of cosmic-swingset ? #5287

Comments

warner commented May 4, 2022

What is the Problem Being Solved?

Description of the Design

Performance Considerations

Security Considerations

Deployment Considerations

Test Plan

mhofman commented May 6, 2022

Tartuffo commented May 11, 2022

mhofman commented May 25, 2022

JimLarson commented Feb 23, 2023 • edited Loading

Motivation

Current Architecture

Development plan

Notes on JSON-RPCv1

MVP for the actual split

Example Traffic

mhofman commented Aug 23, 2023

aj-agoric commented Jun 25, 2024

JimLarson commented Feb 23, 2023 •

edited

Loading