-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add transcript events: init, snapshot save/load, shutdown #7484
Conversation
w.r.t. #6770:
I think we're in agreement that computrons should be in-consensus, so adding a w.r.t. #7199
For these, the new I think the transcript (as recorded in swing-store) should only have bundleIDs, not the full bundles. However it might make sense for a tool like |
Agreed, and this is roughly what #7434 does: introduce a new tool that extracts the bundles, and keep referencing bundleId in the extracted transcript file. |
3ee143a
to
9e67ee6
Compare
@mhofman : best reviewed commit-by-commit. The commit comments have explanations of the intended changes. |
9e67ee6
to
93e0a6a
Compare
cc @FUDCo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple minor nits. This is awesome!
8ea57b6
to
719ee82
Compare
The deployment test failure revealed a bug: swingstore imports insist upon |
719ee82
to
b91f3ad
Compare
@FUDCo could you look at the last commit in particular? In the new approach, each old span looks like e.g.:
followed by a current span like:
Which means the latest snapshot will be from This tripped a consistency check in the swingstore importer, but the only test that actually triggered it was the To reduce the chances I miss this again, I added a more direct test to swingset itself, which creates a state-sync export from a short-lived kernel, and then tries to load it back in. I tested that the new test fails if I revert the swingstore-importer consistency check to it's previous "positions must be equal" form. Please double-check that I updated the test properly, and that you agree with my new consistency check. |
I haven't yet looked at the code (next in my queue) but I think this is a mistake. My take is there shouldn't be anything like that happening. The special entries you are adding to the transcript shouldn't be special from the transcriptStore's point of view; they're just entries. Remember that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add comments once I've read the code, but since @mhofman approved I'm preemptively clicking the "Request changes" button so this doesn't land until we sort out the startPos/endPos thing, which from your description is just broken.
take a look.. the check is comparing the snapshot endPos (which used to be the number just past the snapshot was taken, because it was the not-yet-used number at that moment, and is now the number of the snapshot, save-snapshot is now a fully-fledged member of the numberspace) against the current-transcript |
dtype === 'save-snapshot' || | ||
dtype === 'load-snapshot' || | ||
dtype === 'initialize-worker' || | ||
dtype === 'shutdown-worker' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be safer under maintenance to look for and then allow the kinds of entries that are allowed under replay rather than looking for and then rejecting the ones that are prohibited? It's one less special if
that we need to remember to update if we add another special entry type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. The list of ordinary deliveries lives in the VatDeliveryObject
type in liveslots:
agoric-sdk/packages/swingset-liveslots/src/types.js
Lines 34 to 47 in dde6eee
* @typedef { [tag: 'message', target: string, msg: Message]} VatDeliveryMessage | |
* @typedef { [vpid: string, isReject: boolean, data: SwingSetCapData ] } VatOneResolution | |
* @typedef { [tag: 'notify', resolutions: VatOneResolution[] ]} VatDeliveryNotify | |
* @typedef { [tag: 'dropExports', vrefs: string[] ]} VatDeliveryDropExports | |
* @typedef { [tag: 'retireExports', vrefs: string[] ]} VatDeliveryRetireExports | |
* @typedef { [tag: 'retireImports', vrefs: string[] ]} VatDeliveryRetireImports | |
* @typedef { [tag: 'changeVatOptions', options: Record<string, unknown> ]} VatDeliveryChangeVatOptions | |
* @typedef { [tag: 'startVat', vatParameters: SwingSetCapData ]} VatDeliveryStartVat | |
* @typedef { [tag: 'stopVat', disconnectObject: SwingSetCapData ]} VatDeliveryStopVat | |
* @typedef { [tag: 'bringOutYourDead' ]} VatDeliveryBringOutYourDead | |
* @typedef { VatDeliveryMessage | VatDeliveryNotify | VatDeliveryDropExports | |
* | VatDeliveryRetireExports | VatDeliveryRetireImports | VatDeliveryChangeVatOptions | |
* | VatDeliveryStartVat | VatDeliveryStopVat | VatDeliveryBringOutYourDead | |
* } VatDeliveryObject |
and currently contains 9 types: message
, notify
, dropExports
, retireExports
, retireImports
, changeVatOptions
, startVat
, stopVat
, bringOutYourDead
. To do it right, we'd need liveslots to export a (non-type) list of normal ones. Except that different vats can be using different versions of liveslots, so if we e.g. add one in the future, then this vat-warehouse.js check would need to fetch the right list (vatKeeper.getLiveslotsSettings
??).
It feels a bit more future-proof to have swingset reject the pseudo-delivery types that swingset has defined (since no version of liveslots will know about any of these), than to have it only accept the normal-delivery types that liveslots (which version?) defines.
But it would feel a bit better if swingset had a concrete list of these pseudo-delivery types, rather than vat-warehouse having a special if
statement that is supposed to know about them. I'm not sure where that list ought to live, though, they're nominally defined in types-internal.js
:
agoric-sdk/packages/SwingSet/src/types-internal.js
Lines 102 to 112 in b91f3ad
* @typedef { [tag: 'initialize-worker', options: TranscriptDeliveryInitializeWorkerOptions] } TranscriptDeliveryInitializeWorker | |
* @typedef { [tag: 'save-snapshot'] } TranscriptDeliverySaveSnapshot | |
* @typedef { { snapshotID: string } } TranscriptDeliverySnapshotConfig | |
* @typedef { [tag: 'load-snapshot', config: TranscriptDeliverySnapshotConfig] } TranscriptDeliveryLoadSnapshot | |
* @typedef { [tag: 'shutdown-worker'] } TranscriptDeliveryShutdownWorker | |
* @typedef { VatDeliveryObject | |
* | TranscriptDeliveryInitializeWorker | |
* | TranscriptDeliverySaveSnapshot | |
* | TranscriptDeliveryLoadSnapshot | |
* | TranscriptDeliveryShutdownWorker | |
* } TranscriptDelivery |
but of course types aren't available for introspection by code. I suppose I could add a real export to types-internal.js
, with a list of names? Or a Set
, which might allow for faster lookup?
Anyways, I'm inclined to defer this to the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't there simply be an isDeliverable
predicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If isDeliverable
would live in vat-warehouse.js
, then that's basically what onlyRealDelivery
is, except that it takes a transcriptEntry and returns a delivery object:
/**
* @param {TranscriptEntry} transcriptEntry
* @returns {VatDeliveryObject}
*/
function onlyRealDelivery(transcriptEntry) {
const dtype = transcriptEntry.d[0];
if (
dtype === 'save-snapshot' ||
dtype === 'load-snapshot' ||
dtype === 'initialize-worker' ||
dtype === 'shutdown-worker'
) {
throw Fail`replay should not see ${dtype}`;
}
return transcriptEntry.d;
}
If it should live in liveslots, then we have the question of which (version of) liveslots it lives in. The kernel doesn't talk directly to liveslots, it talks to a vat which may or may not contain (some version of) liveslots. So there's no simple way for packages/liveslots
to export a predicate that the kernel can query (no liveslots code is ever evaluated inside the kernel's Compartment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to be a kernel predicate, as it's about purely kernel-side stuff that the kernel wants to avoid delivering to the vat. My concern was simply that an allow-list is safer than a deny-list if we introduce another kernel-side special pseudo-deliverable thingie in the future. If somebody does that and misses the need to update the onlyRealDelivery
function here, then we could end up dropping something into a vat that could break it. The list of things that can be delivered into a vat is likely to be more stable, as there's no way to introduce a new actual deliverable type without coordination with the vat anyway.
To reiterate, this is purely a hygiene thing and doesn't touch on fundamental semantics.
@@ -1073,7 +1073,7 @@ export async function importSwingStore(exporter, dirPath = null, options = {}) { | |||
transcriptInfo.startPos === 0 || | |||
Fail`missing current snapshot for vat ${q(vatID)}`; | |||
} else { | |||
snapshotInfo.endPos === transcriptInfo.startPos || | |||
snapshotInfo.endPos + 1 === transcriptInfo.startPos || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On deeper thought I'm reasonably convinced this is a mistake and that the old invariant check is the correct one. The thing that's adding the snapshot entry into the transcript should be incrementing the endPos
value the same as any other addition to the transcript. After looking at the code that's adding these special entries, I'm puzzled by why that's not happening, but I think something is wrong there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this some thoughts, and I believe this change is logically correct. However it highlights a potential problem with this PR
When we create a snapshot, we record the current transcript position endPos
.
Now with this PR when the snapshot is created, we also insert a transcript entry with the snapshot details, then roll over the span.
It used to be that the new span started at that endPos
, but now it starts at that endPos + 1
because of the extra transcript entry.
Either we accept that difference (which has the nice property that the snapshot position is the position of the corresponding snapshot-save
transcript entry), or we record the snapshot position as endPos + 1
instead to keep the snapshotPos === next transcript span start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't get this. The endPos
isn't part of the snapshot, so I don't see how the timing of the snapshot enters into the discussion at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Brian and I had a conversation about this and I figured out the source of my confusion, which I'm recording here for the benefit of anybody else who might have the same confusion in the future. The confusion stemmed from the fact that the name endPos
is being used in two different contexts, and Brian was referring to one context while I was interpreting it in the other one.
In context 1, the transcript span record, endPos
is the (open) upper end of the range of transcript items that the span covers. In context 2, the snapshot record (note: NOT in the snapshot itself), endPos
identifies the snapshot within a series of snapshots for a given vat. We chose to use the end position for this purpose because it uniquely identifies the snapshot relative to its vat, conveniently sorts in the temporal order that the snapshots were taken, and makes it easy to locate where the snapshot fits in the timeline that is the sequence of events in the transcript item table. In particular, vat replay relies on the latter because replay's model of the state of a vat is in terms of a point in time T where the snapshot captures all the state earlier than T and the transcript captures all the state changes later than T. The value endPos
was, in effect, what we were storing as our representation of T, simply by noting the transcript's endPos
value in the snapshot record when we make the snapshot. Consequently it made sense as a consistency check to verify that the snapshot's T (looking backwards) was the same as the T (looking forwards) that begins the following transcript span. Because this was the (prior) transcript span's end position, using the same field name, endPos
also made sense. However, by recording the snapshot action itself as an event in the transcript, we end up incrementing the span's endPos
value (in the transcript span record, context 1) but the value stored into the snapshot record (context 2) was the value that was captured at the time the snapshot was initiated, i.e., prior to the increment operation. This is what caused the off-by-one that motivated Brian to change the endPos
consistency check in the way that I was objecting to.
In principle there are several ways to deal with the off-by-one. One would be what Brian did. Another would be to pass endPos + 1
into the function that writes the snapshot record or for that function to increment its copy of that value prior to storing it. Both of the latter things, however, would read a bit weird in the code, so I think the modified consistency check is the right choice. However, we would probably be well served to change the name of the field in the snapshot record from endPos
to, say, snapshotPos
to make it clear that what's being recorded is the time of the snapshot without entangling it with the transcript span record via name punning.
b91f3ad
to
0a33e8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point my concerns have been resolved to my satisfaction.
63521fb
to
ab644de
Compare
ab644de
to
6cad69c
Compare
These didn't need to be exposed in types-external.js
This introduces four new pseudo-delivery events to the transcript: * 'initialize-worker': a new empty worker is created * 'load-snapshot': a worker is loaded from heap snapshot * 'save-snapshot': we tell the worker to write a heap snapshot * 'shutdown-worker': we stop the worker (e.g. during upgrade) These events are not actually delivered to the worker: they are not VatDeliveryObjects. However many of them are implemented with commands to the worker (but not `deliver()` commands). The vat-warehouse records these events in the transcript to help subsequent manual/external replay tools know what happened. Without them, we'd need to deduce e.g. the heap-snapshot writing schedule by counting deliveries and comparing them against snapshot initial/interval. The 'save-snapshot'/'load-snapshot' pair indicates what a replay would do. It does not mean that the vat-warehouse actually tore down the old worker and immediately replaced it with a new one (from snapshot). It might choose to do that, or the worker itself might choose to replace its XS engine instance with a fresh one, or it might keep using the old engine. The 'save-snapshot' command has side-effects (it does a forced GC), so it is important to keep track of when it happened. As before, the transcript is broken up into "spans", delimited by heap snapshots or upgrade-related shutdowns. To bring a worker up to date, we want to start a worker (either a blank one, or from a snapshot), and then replay the "current span". With this change, the current span always starts either with 'initialize-worker' or with 'load-snapshot', telling us exactly what needs to be done. The span then contains all the deliveries that must be replayed. Old spans will end with `save-snapshot` or `shutdown-worker`, but the current span will never include one of those: the span is closed immediately after those events are added. When the kernel replays a transcript to bring a worker up to date, that replay will never see 'save-snapshot' or 'shutdown-worker'. But an external tool which replays a historical span will see them at the end. The `initialize-worker` event contains `workerOptions` (which includes which type of worker is being used, as well as helper bundle IDs like lockdown and supervisor), as well as the `source.bundleID` for the vat bundle. The `save-snapshot` event results contain the `snapshotID` hash that was generated. The `load-snapshot` event includes the `snapshotID` in a record that could be extended with additional details in the future (like an xsnap version). The types were improved to make `TranscriptDelivery` be a superset of `VatDeliveryObject`. We also record TranscriptDeliveryResult, which is currently a stripped down subset of VatDeliveryResult (just the "ok" status), plus the save-snapshot hash. In the future, we'll probably record the deterministic subset of metering results (computrons, maybe something about memory allocation). In the slog, the `heap-snapshot-save` event details now contain `snapshotID` instead of `hash`, to be consistent. Previously vat-warehouse used `lastVatID` to track which vat received a delivery most recently, and `saveSnapshot()` used that to decide which vat requires a snapshot. This commit changes that path to be more explicit, and removes `lastVatID`. refs #7199 refs #6770
Any delivery that results in `metering.compute` (i.e. anything to an xsnap worker) will record that number as `metering.computrons` in the transcript delivery results. Our standard replay code doesn't do anything with this: `replayTranscript` ignores the old results. But it gives us something to compare against during external/diagnostic replays.
`importSwingStore()` has a consistency check which, for each vat, compares the `endPos` of the current snapshot against the `startPos` of the current transcript span. Now that the `save-snapshot` pseudo-entry is the last one of a transcript, these two values are expected to differ by one. Fixed the check, updated swingstore's test-exportImport.js to behave more like the real swingset, and added a new swingset test to exercise the whole run/export/import/run cycle, so we can discover this sort of thing locally in swingset's tests. Previously this only caused a failure of the "deployment-test", which doesn't run until all other CI has passed and the PR is at the front of the merge queue.
This should reduce the chances of confusing it with `transcriptSpan.endPos`. The two numbers are related: `snapPos` is the same as `endPos` in the brief moment when we've just written out the snapshot, but we immediately append a `save-snapshot` event, causing `endPos` to increment. The `endPos` of the transcript span, being an exclusive-right-edge kind of number, will be one larger than the last item in the span, which means it will also be one larger than the `snapPos`.
6cad69c
to
d59d930
Compare
When I added pseudo-deliveries to the transcript in #7484, I forgot to update the slog entry `deliveryNum` counter to match. As result, the `startVat` was deliveryNum 0 but transcript position 1 (since 0 is the `initialize-worker`). The gap grew by 2 after every snapshot save/load event pair. We don't slog the pseudo-deliveries as `.type: 'deliver'` (they're recorded other ways). While it will be weird to have gaps in the slogfile's `deliveryNum` sequence, I think it's better to maintain easy correlation between the slog and the transcript. So this changes `vatKeeper.nextDeliveryNum()`. Previously, that function maintained a separate counter (in `kvStore`) for each vat, which was only used by slog entries. Now, it queries the transcriptStore, with `getCurrentSpanBounds().endPos`. This is only incremented when we add an entry to the transcript, so the tests of `nextDeliveryNum()`'s auto-increment feature were updated too. closes #7549
When I added pseudo-deliveries to the transcript in #7484, I forgot to update the slog entry `deliveryNum` counter to match. As result, the `startVat` was deliveryNum 0 but transcript position 1 (since 0 is the `initialize-worker`). The gap grew by 2 after every snapshot save/load event pair. We don't slog the pseudo-deliveries as `.type: 'deliver'` (they're recorded other ways). While it will be weird to have gaps in the slogfile's `deliveryNum` sequence, I think it's better to maintain easy correlation between the slog and the transcript. So this changes `vatKeeper.nextDeliveryNum()`. Previously, that function maintained a separate counter (in `kvStore`) for each vat, which was only used by slog entries. Now, it queries the transcriptStore, with `getCurrentSpanBounds().endPos`. This is only incremented when we add an entry to the transcript, so the tests of `nextDeliveryNum()`'s auto-increment feature were updated too. closes #7549
This introduces four new pseudo-delivery events to the transcript:
These events are not actually delivered to the worker: they are not VatDeliveryObjects. However many of them are implemented with commands to the worker (just not
deliver()
commands). The vat-warehouse records these events in the transcript to helpsubsequent (manual/external) replay tools know what happened. Without them, we'd need to deduce e.g. the heap-snapshot writing schedule by counting deliveries and comparing them against
snapshotInitial/snapshotInterval .
The 'save-snapshot'/'load-snapshot' pair indicates what a replay would do. It does not mean that the vat-warehouse actually tore down the old worker and immediately replaced it with a new one (from snapshot). It might choose to do that, or the worker itself might choose to replace its XS engine instance with a fresh one, or it might keep using the old engine. The 'save-snapshot' command has side-effects (it does a forced GC), so it is important to keep track of when it happened.
The transcript is broken up into "spans", delimited by heap snapshots or upgrade-related shutdowns. To bring a worker up to date, we want to start a worker (either a blank one, or from a snapshot), and then replay the "current span".
With this change, the current span always starts either with 'initialize-worker' or with 'load-snapshot', telling us exactly what needs to be done. The span then contains all the deliveries that must be replayed. The current span will never include a 'save-snapshot' or 'shutdown-worker': the span is closed immediately after those events are added, so replay will never see them. But a tool which replays a historical span will see them at the end.
The types were improved to make
TranscriptDelivery
be a superset ofVatDeliveryObject
. We also record TranscriptDeliveryResult, which is currently a stripped down subset of VatDeliveryResult (just the "ok" status), except that save-snapshot includes the snapshot hash in its results. In the future, we'll probably record the deterministic subset of metering results (computrons, maybe something about memory allocation).refs #7199
refs #6770