-
Notifications
You must be signed in to change notification settings - Fork 1.6k
pvf: Update docs for PVF artifacts #6551
pvf: Update docs for PVF artifacts #6551
Conversation
node/core/pvf/src/artifacts.rs
Outdated
//! | ||
//! # Lifecycle of an artifact | ||
//! | ||
//! 1. During node start-up, the artifacts cache is cleaned up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to re-prepare all PVF artifacts on each node restart? Or does "clean up" mean the artifacts' build host version matches the current host version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all local artifacts are cleared - we'll re-prepare the PVF the first time a new execute request comes in. I'll update the doc to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.
Replying here to keep the threads focused. Yeah, thinking about it, I actually don't see why we want to clear the artifacts cache. On start-up, we could instead re-populate the Artifacts
table from the compiled artifacts on disk -- the PVF hash should already be in the filename -- and re-start the 24-hour TTL timers for each artifact. Or we could even use the system's last-modified/accessed metadata for the files (with some sanity checks). Then instead of lazily re-compiling the PVFs, we would lazily delete the ones we end up not needing, which seems a lot more efficient. 🙂
@s0me0ne-unkn0wn Would this need coordination with your execution environment PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't see the last comment yet and continued in that thread. Yes, I forgot about 24h timer, it solves the problem. Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this need coordination with your execution environment PR?
Special coordination is not needed, I'm already used to merging master to that branch three times a week 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't see the last comment yet and continued in that thread. Yes, I forgot about 24h timer, it solves the problem. Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.
Yes, makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised https://github.com/paritytech/polkadot/issues/6940 to continue this discussion.
//! 1. If the artifact is already being processed, we add another execution request to the | ||
//! existing preparation job, without starting a new one. | ||
//! | ||
//! 2. Note that if the state is [`ArtifactState::FailedToProcess`], we usually do not retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not currently have any "I'm taking a long time" messages, so if we send out approval assignments but do artifact builds lazily, then we'll cause no shows, given that builds can take more than the 12? second no show time out.
In theory, we could send messages for "building artifact" and/or "It's slow but I'm here", but @rphmeier wanted to avoid complicating the approval process with such messages, probably a wise decision. We therefore need PVF artifacts to be built in advance, or else we suck up the risk of correlated artifact builds creating de fact escalations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was wondering about this a couple of times already myself. I think for the time being, preparation is usually pretty fast so there are no issues.
The problem with preparation in advance is, that this will likely result in wasted effort in case of parathreads. As all validators would need to prepare a PVF, although only 30 approval checkers will actually need it. Might be fine.
Other options:
- Gather some data on actual preparation times - if they turn out to be very low, we could just enforce a reasonably low value in pre-checking.
- Interpret parathread PVFs instead of compiling. Given that they are expected to be executed only rarely, the compiliation effort is likely not worth the trouble anyway. (And for parachains prepare in advance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.
We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.
Interpreting kinda works. We have consensus upon who gets compiled vs interpreted, so interpreted then runs with different approval time parameters. We could similarly adjust approval parameters to include recompiling parathreads each block. This makes parathreads more expensive and second class though. We could've parathreads that "buy" being compiled in advance like parachains.
We do still have everyone compile the parathread when the PVF initially gets uploaded though, yes? I'd think this suggests parathreads and parachains should be all be precompiled, which just makes uploading a PVF more expensive. Implicitly then host upgrades become relatively more expensive, but this makes sense too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should imho avoid clearing the artifacts cache when the host did not change.
We'd need an intelligent garbage collector, then. Imagine the node is restarted and has a hundred artifacts in the cache. How do we know which ones we will use and which are stale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.
Why? They would not pass the pre-checking phase, assuming:
Gather some data on actual preparation times - if they turn out to be very low, we could just enforce a reasonably low value in pre-checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to garbage collect PVFs when parachains deregister or when the PVF gets superseded by later PVFs. We could do pre-checking after a PVF upload gets finalized, so then we avoid fork concerns and each parachan has at most two PVFs in the cache. We'd loose the ability to upgrade parachain PVFs when finality stalls though, so system parachain could require some escape hatch here.
As always we pay for optimizations with complexity. Ain't clear how far this should go right now of course. We could stick with the current proposal for now, but make an issue for smarter PVF garbage collection in future.
Why? They would not pass the pre-checking phase, assuming:
It'll be possible to pass the pre-checking but be quite slow compared with average PVF builds.
It'll occasionally be possible to pass the pre-checking on one host, but be abysmal on some host upgrade in the pipeline. We could imho ignore this risk though, so yeah maybe you're right..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always we pay for optimizations with complexity. Ain't clear how far this should go right now of course. We could stick with the current proposal for now, but make an issue for smarter PVF garbage collection in future.
That does sound like it would introduce some complexity. I guess my question would be, is it really necessary? How much disk space does each compiled PVF actually require? How bad is it to keep old artifacts around? So far, it seems that we have not had issues with the 24-hour TTL of artifacts AFAIK, so my grug brain thinks that we shouldn't introduce unnecessary optimizations. For parathreads I could see the artifacts needing to stay around for longer - but in that case I would just have a longer TTL for those, and not worry about the extra used disk space. 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this complexity comes from one scenario: An adversary can create many relay chain forks, so they can upload one artifact on each fork. We support all forks until we know which fork survives.
We've rough consensus on artifact age so we could use duration as a proxy for finality though: We retain all artifacts compatible with the current host, so long as either the artifact is active on some relay chain fork, or else the artifact is less than 24 hours old.
We add some abandon artifact call for artifacts uploaded but not activated, or else force activation at some block height, or something like that.
It's messy to create many relay chain forks without equivocation, so the attack might already result in slashing, which maybe suffices. If you've a run of blocks, then you could've some forks without equivocations, but not too many.
It's maybe just easier to wait for finality and have some override for system parachains, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised https://github.com/paritytech/polkadot/issues/6941 to continue this discussion.
Above my comments are not critiques of the text of course, but this is a convenient place to have a discussion that clarifies some design points, and maybe that discussion will be useful here. |
Since the discussions here seem to have stalled, I have raised a couple of issues to unblock this PR. This is a small PR that aims to document the current state of affairs, so discussion about future design can happen in those issues. |
Bump. :) |
bot merge |
* master: (30 commits) update rocksdb to 0.20.1 (#7113) Reduce base proof size weight component to zero (#7081) PVF: Move PVF workers into separate crate (#7101) Companion for #13923 (#7111) update safe call filter (#7080) PVF: Don't dispute on missing artifact (#7011) XCM: Properly set the pricing for the DMP router (#6843) pvf: Update docs for PVF artifacts (#6551) Bump syn from 2.0.14 to 2.0.15 (#7093) Companion for substrate#13771 (#6983) Added Dwellir Nigeria bootnodes. (#7097) Companion for Substrate #13889 (#7063) Switch to DNS name based bootnodes for Rococo (#7040) companion for substrate#13883 (#7059) [xcm] Added `UnpaidExecution` instruction to `UnpaidRemoteExporter` (#7091) Bump serde_json from 1.0.85 to 1.0.96 (#7072) Bump hex-literal from 0.3.4 to 0.4.1 (#7071) Small simplification (#7089) [XCM - UnpaidRemoteExporter] Remove unreachable code (#7088) sync versions with current release (#7083) ...
* master: (39 commits) malus: dont panic on missing validation data (#6952) Offences Migration v1: Removes `ReportsByKindIndex` (#7114) Fix stalling dispute coordinator. (#7125) Fix rolling session window (#7126) [ci] Update buildah command and version (#7128) Bump assigned_slots params (#6991) XCM: Remote account converter (#6662) Rework `dispute-coordinator` to use `RuntimeInfo` for obtaining session information instead of `RollingSessionWindow` (#6968) Revert default proof size back to 64 KB (#7115) update rocksdb to 0.20.1 (#7113) Reduce base proof size weight component to zero (#7081) PVF: Move PVF workers into separate crate (#7101) Companion for #13923 (#7111) update safe call filter (#7080) PVF: Don't dispute on missing artifact (#7011) XCM: Properly set the pricing for the DMP router (#6843) pvf: Update docs for PVF artifacts (#6551) Bump syn from 2.0.14 to 2.0.15 (#7093) Companion for substrate#13771 (#6983) Added Dwellir Nigeria bootnodes. (#7097) ...
Closes #3581 (no fix needed there, see comments. Just updating the docs).