PVF: Avoid clearing the artifacts cache on restart #685

mrcnski · 2023-03-23T12:27:11Z

ISSUE

Overview

Avoid clearing the artifacts cache on restart when the host did not change

Issue extracted from discussion in paritytech/polkadot#6551 (review). Let's continue here to unblock that PR.

Previous discussion

We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.

We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.

Interpreting kinda works. We have consensus upon who gets compiled vs interpreted, so interpreted then runs with different approval time parameters. We could similarly adjust approval parameters to include recompiling parathreads each block. This makes parathreads more expensive and second class though. We could've parathreads that "buy" being compiled in advance like parachains.

We do still have everyone compile the parathread when the PVF initially gets uploaded though, yes? I'd think this suggests parathreads and parachains should be all be precompiled, which just makes uploading a PVF more expensive. Implicitly then host upgrades become relatively more expensive, but this makes sense too.

Originally posted by @burdges in paritytech/polkadot#6551 (comment)

We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.

[...] I actually don't see why we want to clear the artifacts cache. On start-up, we could instead re-populate the Artifacts table from the compiled artifacts on disk -- the PVF hash should already be in the filename -- and re-start the 24-hour TTL timers for each artifact. Or we could even use the system's last-modified/accessed metadata for the files (with some sanity checks). Then instead of lazily re-compiling the PVFs, we would lazily delete the ones we end up not needing, which seems a lot more efficient. 🙂

Originally posted by @mrcnski in paritytech/polkadot#6551 (comment)

[...] Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.

Originally posted by @s0me0ne-unkn0wn in paritytech/polkadot#6551 (comment)

Related issue

https://github.com/paritytech/polkadot/issues/6941 (PVF preparation in advance)

The text was updated successfully, but these errors were encountered:

s0me0ne-unkn0wn · 2023-03-23T13:58:45Z

As far as I understand, the main concern about not clearing the artifact cache is:

Unsafe functions are used to import the artifact
Their safety guarantees rely on the fact the node has produces them with prepare() and they are not just some random files put there by someone.

Not that we can guarantee nobody will overwrite them during the node run time. I believe the cache pruning was implemented as a measure to mitigate those unwanted outcomes. But I don't have a strong opinion if it makes any sense. If someone wants to screw things up, he will.

bkchr · 2023-03-24T16:44:29Z

This will then require that we don't dispute when loading from the cache fails (which should be implemented nevertheless) and that invalid cache values are being recreated.

burdges · 2023-03-25T17:54:58Z

I'm curious, how deterministic are artifact builds?

s0me0ne-unkn0wn · 2023-03-25T22:59:34Z

I'm curious, how deterministic are artifact builds?

Right now, they are not deterministic at all, but we're actively fighting for that. There are several levels of non-determinism, and we're trying to find a way to handle each of them.

Non-determinism inside a single Wasmtime version on a single platform. When doing register allocation, Cranelift shuffles registers. We don't care much about this one. From the point of view of the CPU, general-purpose registers are equal. So we don't care which one is used for a concrete value. But that means that on two different nodes, the same Wasm code compiled with the same version of Wasmtime on the same platform produces different artifacts; that is, their execution result is the same, but when compared byte-to-byte, they are different. We're taking that into account (e.g., we're not trying to calculate artifact hashes and compare them), but we're not trying to avoid this non-determinism.
Non-determinism between different Wasmtime versions. That's something that should be handled by Explicit versioning of PVF execution environment #917. Right now, when we upgrade the Wasmtime version, we're just praying for the best. Nobody really guarantees that a function compiled with Wasmtime version N would use the same amount of native stack as a function compiled with version N-1. Thus, having half of the nodes upgraded to a new version and half of the nodes on an old version, we can theoretically get a situation when one-half of the nodes execute PVF successfully, and the other half fails. Executor environment parametrization is the first step to mitigate that, the next will be the versification that hopefully will allow us to eliminate this non-determinism completely.
Non-determinism between Wasmtimes with different environment semantics. It's not a problem right now as we don't change environment semantics, but we want to be able to. That is why execution environment parametrization was introduced in the first place in Executor Environment parameterization polkadot#6161. For now, this type of non-determinism is hopefully adequately handled.
Non-determinism between different platforms. There's no clear pathway to how to handle that yet, but we don't care much right now as the vast majority of validators are running on the same platform.

burdges · 2023-03-26T06:57:13Z

Around 1, is there some need for randomness like to prevent the PVF doing rowhammer attacks upon nodes or whatever? If so, we could select the epoch randomness from the epoch after the PVF was uploaded, so the flow goes: PVF uploaded during epoch n. Wait for randomness during epoch n+1. Artifact builds start in epoch n+2 after the end of epoch n+1 is finalized. We could optimize this flow considerably if you actually want this.

Around 3 & 4, there are not so many combinations here really, assuming 1 gets solved, right? If so, then we could sign a hash of our artifact build when we vote for the PVF. I'm not sure if this really helps that much, but it's maybe useful information when debugging, and maybe relevant elsewhere. I donno..

mrcnski · 2023-03-26T10:07:56Z

This will then require that we don't dispute when loading from the cache fails (which should be implemented nevertheless) and that invalid cache values are being recreated.

Yeah, this can already happen if the artifact goes missing for some reason. We assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there. We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath). And when enqueuing an execute job we check for the artifact on-disk, and start preparation if not found.

Of course, on node restart we would just recreate the list of known artifacts based on what's on-disk. But someone could still wipe the cache between that point and an execution attempt, which indeed would lead to disputes.

Not that we can guarantee nobody will overwrite them during the node run time. I believe the cache pruning was implemented as a measure to mitigate those unwanted outcomes. But I don't have a strong opinion if it makes any sense. If someone wants to screw things up, he will.

Yep, kind of similar to above point, as someone can screw it up by just modifying/deleting files. But the concern of overwritten files can also be mitigated. One simple way is to keep track of exact file modified time, and we could theoretically even make this persist across restart. (Though at that point we should move to using a persistent database.) When loading a file we make sure the times match what we expect. Someone would really have to be intentionally sabotaging their own system to change the file attributes.

And speaking of time attributes, we could also use the last-read attribute to recompute the TTL of artifacts on restart.

bkchr · 2023-03-26T19:10:06Z

Yeah, this can already happen if the artifact goes missing for some reason. We assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there. We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath). And when enqueuing an execute job we check for the artifact on-disk, and start preparation if not found.

A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check.

burdges · 2023-03-26T23:41:50Z

Yes, we believe we, or someone like us, built the artifact in the past. We'll de facto no-show if we cannot rebuild the artifact, which sounds fine I guess.

s0me0ne-unkn0wn · 2023-03-27T08:15:54Z

Around 1, is there some need for randomness

Well, currently I'm not even sure we can influence the randomness inside Cranelift, but the idea is interesting, I'll investigate.

Around 3 & 4, there are not so many combinations here really, assuming 1 gets solved, right?

I hope 3) is not a problem anymore, and 4) is not a problem yet. But if we see at some point in time that people try to run validators on ARM or whatever, we'll need to find a way to handle it properly.

And speaking of time attributes, we could also use the last-read attribute to recompute the TTL of artifacts on restart.

Do you mean the filesystem's "last access" attribute? It cannot be relied on. In server configurations, its usage is often switched off through filesystem mounting options to increase the filesystem's throughput.

mrcnski · 2023-03-27T10:33:27Z

A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check.

Ah yeah, absolutely. Raised paritytech/polkadot#6959 as this should be fixed regardless.

Do you mean the filesystem's "last access" attribute? It cannot be relied on. In server configurations, its usage is often switched off through filesystem mounting options to increase the filesystem's throughput.

Maybe we can use it if present, and otherwise just reset the artifact to the default TTL.

Alternatively, we could address this, as well as data integrity concerns, by storing last time accessed and a hash of the file contents + last time accessed, in the file name itself.

burdges · 2023-03-27T14:38:40Z

I'm not even sure we can influence the randomness inside Cranelift

As a hack, you could cargo patch getrandom to replace OsRng by thread local SeedableRng which you seed, but maybe only if you've some deterministic thread identifier, otherwise all threads return the same, or else their startup might create non-determinism. Just please never do this inside code that signs anything. lol

As an aside, I find this cap-rand crate bizarre btw, like you might derandomize something, but it should always be secure to give out system randomness, so restricting it as "ambient authority" makes no sense.

s0me0ne-unkn0wn · 2023-03-27T21:34:49Z

Maybe we can use it if present, and otherwise just reset the artifact to the default TTL.

We cannot tell if it's active or not ☹️ It's always present, just not getting updated if switched off.

If we want to be on the safe side, it makes sense to implement @koute's proposal to add version number to the artifact path, it solves all the possible issues

mrcnski · 2023-03-28T16:33:15Z

If we want to be on the safe side, it makes sense to implement @koute's proposal to add version number to the artifact path, it solves all the possible issues

I don't think it solves all the data integrity issues that have been brought up, though IMO those are almost orthogonal to this issue. I raised a separate ticket: #677. And IMO, to keep the scope low here we should just reset the artifact TTL to the default on restart.

For this issue we can add the version number to the filename, simple enough. Did I miss any other objections to clearing the cache?

Polkadot-Forum · 2023-03-28T17:15:52Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/exploring-alternatives-to-wasm-for-smart-contracts/2434/11

closes #695 Could potentially be helpful to preserving caches when applicable, as discussed in #685 kusama address: FvpsvV1GQAAbwqX6oyRjemgdKV11QU5bXsMg9xsonD1FLGK

…h#699) * SNO-324: Enable multiple accounts in outbound channel contract (paritytech#685) * Track nonces by source address * Remove principal from basic outbound channel * Update relayer bindings * Track nonces by origin address * Fix outbound channel tests * SNO-325: Enable multiple accounts in inbound channel pallet (paritytech#687) * Add origin field to message decoding * Track nonces by message origin * Include origin address in basic channel message id * Update basic inbound fixture data * Add note about updating inbound channel test data * Mention message data encoding in comment * Update test data in envelope test * Remove channel id from message id * s/origin/user/g * Swap the order of source & user * Reword test data generation section * s/value/nonce * Use named fields in MessageId enum * Switch to Twox64Concat hasher This prevents accidental expensive lookups while remaining relatively fast vs Blake2_128 and resistant to attacks that cause prefix collisions, thanks to the security of keccak used to create the Ethereum address. * Remove rogue print statement * SNO-326: Forward messages by address in relayer (paritytech#695) * Remove redundant slice * Add initial address filtering * Remove extra nonce var * Clean up basic channel address config * Rename parachain relayer account config * Remove unused method * Fix up comments * Add default eth addresses for basic channel * Improve map log readability * Fix parachain writer error messages * Bump lodestar version in example command * Switch to default eth account for E2E script * Increase default timeout from 6m 40s to 25m This helps with some of the longer running tests that are timing out when given 400s. * Rename mapping to nonce * Rename origin to account in basic outbound channel * Remove principal from outbound channel contract * Rename user to account in basic inbound channel * Fix event account field rename in test

Bumps [clap](https://github.com/clap-rs/clap) from 3.1.6 to 3.1.18. - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@v3.1.6...v3.1.18) --- updated-dependencies: - dependency-name: clap dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

mrcnski added I10-optimisation labels Mar 23, 2023

This was referenced Aug 24, 2023

PVF: Preparation in advance #684

Closed

pvf: Update docs for PVF artifacts paritytech/polkadot#6551

Merged

mrcnski mentioned this issue Mar 27, 2023

PVF: Don't dispute on missing artifact paritytech/polkadot#6959

Closed

mrcnski mentioned this issue Mar 28, 2023

PVF: Compromised artifact file integrity can lead to disputes #677

Open

mrcnski mentioned this issue Apr 5, 2023

PVF: Don't dispute on missing artifact paritytech/polkadot#7011

Merged

6 tasks

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

mrcnski mentioned this issue Aug 24, 2023

PVF: include polkadot's version into the artifact path #695

Closed

the-right-joyce added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T8-parachains_engineering and removed I10-optimisation labels Aug 25, 2023

eagr mentioned this issue Oct 9, 2023

Include polkadot version in artifact path #1828

Merged

bkchr pushed a commit that referenced this issue Oct 15, 2023

Include polkadot version in artifact path (#1828)

9e14470

closes #695 Could potentially be helpful to preserving caches when applicable, as discussed in #685 kusama address: FvpsvV1GQAAbwqX6oyRjemgdKV11QU5bXsMg9xsonD1FLGK

eagr mentioned this issue Oct 17, 2023

Preserve artifact cache unless stale #1918

Merged

the-right-joyce added this to parachains team board Oct 18, 2023

the-right-joyce moved this to Backlog in parachains team board Oct 18, 2023

the-right-joyce removed the T8-parachains_engineering label Oct 23, 2023

s0me0ne-unkn0wn moved this from Backlog to In Progress in parachains team board Nov 2, 2023

mrcnski closed this as completed in #1918 Nov 19, 2023

bkchr mentioned this issue Jan 5, 2024

Disputes raised due to RuntimeConstruction error in pvf execution #2863

Closed

eskimor moved this from In Progress to Completed in parachains team board Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Avoid clearing the artifacts cache on restart #685

PVF: Avoid clearing the artifacts cache on restart #685

mrcnski commented Mar 23, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Mar 23, 2023

bkchr commented Mar 24, 2023

burdges commented Mar 25, 2023

s0me0ne-unkn0wn commented Mar 25, 2023

burdges commented Mar 26, 2023 •

edited

Loading

mrcnski commented Mar 26, 2023

bkchr commented Mar 26, 2023

burdges commented Mar 26, 2023

s0me0ne-unkn0wn commented Mar 27, 2023 •

edited

Loading

mrcnski commented Mar 27, 2023

burdges commented Mar 27, 2023

s0me0ne-unkn0wn commented Mar 27, 2023 •

edited

Loading

mrcnski commented Mar 28, 2023

Polkadot-Forum commented Mar 28, 2023

PVF: Avoid clearing the artifacts cache on restart #685

PVF: Avoid clearing the artifacts cache on restart #685

Comments

mrcnski commented Mar 23, 2023 • edited Loading

ISSUE

Overview

Previous discussion

Related issue

s0me0ne-unkn0wn commented Mar 23, 2023

bkchr commented Mar 24, 2023

burdges commented Mar 25, 2023

s0me0ne-unkn0wn commented Mar 25, 2023

burdges commented Mar 26, 2023 • edited Loading

mrcnski commented Mar 26, 2023

bkchr commented Mar 26, 2023

burdges commented Mar 26, 2023

s0me0ne-unkn0wn commented Mar 27, 2023 • edited Loading

mrcnski commented Mar 27, 2023

burdges commented Mar 27, 2023

s0me0ne-unkn0wn commented Mar 27, 2023 • edited Loading

mrcnski commented Mar 28, 2023

Polkadot-Forum commented Mar 28, 2023

mrcnski commented Mar 23, 2023 •

edited

Loading

burdges commented Mar 26, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Mar 27, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Mar 27, 2023 •

edited

Loading