Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Research the use of CPU virtualization for PVF execution #652

Open
sandreim opened this issue Apr 18, 2023 · 15 comments
Open

PVF: Research the use of CPU virtualization for PVF execution #652

sandreim opened this issue Apr 18, 2023 · 15 comments

Comments

@sandreim
Copy link
Contributor

For an improved security posture, we should consider running the PVF on a KVM virtualized CPU. As I/O is not required and communication with the host can happen via a memory mapped region there shouldn't be any performance degradation even in the case of nested virtualization.

@bkchr
Copy link
Member

bkchr commented Apr 19, 2023

Sounds like this could be easier than using seccomp?

@sandreim
Copy link
Contributor Author

No, it is not. Hardware virtualization provides far better security than any other software sandboxing technology. AWS , GCP and other cloud providers are using KVM to virtualize their compute and memory.

@sandreim
Copy link
Contributor Author

sandreim commented Apr 19, 2023

Building a PoC would help to understand any shortcomings in the context our very specific usecase. We don't even need device emulation or a real VM, just a single virutalized CPU with a few memory regions. We could easily fork https://github.com/firecracker-microvm/firecracker and strip it down to our basic needs.

@bkchr
Copy link
Member

bkchr commented Apr 19, 2023

No, it is not. Hardware virtualization provides far better security than any other software sandboxing technology

Yeah for sure, I know :D What I meant is that doing this is much better then trying to predict what kind of syscalls we are doing and prohibiting the others. I'm still afraid that this will fail at some point when we oversee some syscall.

@sandreim
Copy link
Contributor Author

sandreim commented Apr 19, 2023

Yes, I am also afraid of stalling the chain because we did not add a syscall to whitelist. seccomp is a defense in depth measure, when all else has failed and you want to reduce the blast radius of the incident.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 19, 2023

No, it is not. Hardware virtualization provides far better security than any other software sandboxing technology

Yeah for sure, I know :D What I meant is that doing this is much better then trying to predict what kind of syscalls we are doing and prohibiting the others. I'm still afraid that this will fail at some point when we oversee some syscall.

That's definitely a concern. I've been putting in a lot of work to mitigate that. Using @koute's script at build-time should mostly prevent even building a binary that contains disallowed syscalls. (In practice though, the syscalls that are actually used by the worker threads are quite few (about a dozen).)

But yeah, we want additional measures because just blocking syscalls is not enough -- if an attacker can break out of the WASM sandbox they probably can also get out-of-bounds memory, which essentially gives them a source of randomness. So say they make the worker job vote against with 50% chance, that would stall the chain.

@koute
Copy link
Contributor

koute commented Apr 19, 2023

Note that AFAIK seccomp should essentially work out-of-box everywhere while KVM might require some extra setup from the users (e.g. some distros disallow access to KVM and require the user to be added to a special kvm group), or could not work at all on machines without support for hardware virtualization (not sure how common is that nowadays).

But you're right that hardware virtualization would be technically more secure.

It should be worth it to make a proof of concept and test it out in practice.

if an attacker can break out of the WASM sandbox they probably can also get out-of-bounds memory, which essentially gives them a source of randomness

Hmm... preventing the attacker from acquiring a source of randomness is going to be tricky. In presence of remote code execution it's not possible to disallow access to a source of randomness. Even if you disallow things like creating threads or measuring time through seccomp (although IIRC grabbing the time on amd64 goes through the vdso shim, so it might not be even possible to sandbox that with seccomp as no syscalls are involved) you still have e.g. the rdrand hardware instruction which the attacker could execute to just ask the CPU directly for some random bytes. AFAIK the only way to prevent that is to use virtualization and set the appropriate VMX bit to make the VM abort when that's called. There might be more corner cases here.

@sandreim
Copy link
Contributor Author

Note that AFAIK seccomp should essentially work out-of-box everywhere while KVM might require some extra setup from the users (e.g. some distros disallow access to KVM and require the user to be added to a special kvm group), or could not work at all on machines without support for hardware virtualization (not sure how common is that nowadays).

AFAIK it is widely supported, and if it is not the validator can run without it. Indeed it requires some extra setup, but this is a small price to pay for the increased security. It could easily be part of the validator setup guide.

But you're right that hardware virtualization would be technically more secure.

It should be worth it to make a proof of concept and test it out in practice.

Yeah, that is something we should pursue !

if an attacker can break out of the WASM sandbox they probably can also get out-of-bounds memory, which essentially gives them a source of randomness

Hmm... preventing the attacker from acquiring a source of randomness is going to be tricky. In presence of remote code execution it's not possible to disallow access to a source of randomness. Even if you disallow things like creating threads or measuring time through seccomp (although IIRC grabbing the time on amd64 goes through the vdso shim, so it might not be even possible to sandbox that with seccomp as no syscalls are involved) you still have e.g. the rdrand hardware instruction which the attacker could execute to just ask the CPU directly for some random bytes. AFAIK the only way to prevent that is to use virtualization and set the appropriate VMX bit to make the VM abort when that's called. There might be more corner cases here.

What we would need is to develop and maintain a CPU template that customizes what we expose in CPUID and allow. An example, which is for general purpose VMs can be seen here: https://github.com/firecracker-microvm/firecracker/tree/main/src/vmm/src/cpuid . AFAIK this was not really supported on aarch64, but things might have changed. In our case we would have something very restrictive, or maybe we want to enable some CPU instruction set extensions that WASM can use for increased performance.

@bkchr
Copy link
Member

bkchr commented Apr 23, 2023

It should be worth it to make a proof of concept and test it out in practice.

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

@koute
Copy link
Contributor

koute commented Apr 24, 2023

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

Yep. But I think we still can first do the work of splitting the worker into a separate binary (and stripping it as much as possible) without necessarily sandboxing it yet, as that will be necessary regardless of which approach we pick.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 24, 2023

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

Agreed. I already have much of the seccomp logging implemented, so IMO it makes sense to finish that before a big context switch. And then the logging can run on validators for a while while I work on virtualization. And yeah, I will first split out the worker binaries (without musl-builder for now).1

Footnotes

  1. Not having musl may make the syscalls less deterministic in theory, but in practice very few are triggered and the logging may show us that musl is not strictly needed.

@bkchr
Copy link
Member

bkchr commented Apr 24, 2023

Okay ty!

@sandreim
Copy link
Contributor Author

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

Agreed. I already have much of the seccomp logging implemented, so IMO it makes sense to finish that before a big context switch. And then the logging can run on validators for a while while I work on virtualization. And yeah, I will first split out the worker binaries (without musl-builder for now).1

Footnotes

  1. Not having musl may make the syscalls less deterministic in theory, but in practice very few are triggered and the logging may show us that musl is not strictly needed.

IMO virtualization PoC is a big rock to push uphill. We should shoud first enable the logging to at least collect some data while we work on the PoC.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/1

@alindima
Copy link
Contributor

alindima commented Sep 8, 2023

I like this idea and I think it'll provide far superior than any process-level sandboxing indeed 👍🏻 but as Andrei said, it'll be challenging.

What we would need is to develop and maintain a CPU template that customizes what we expose in CPUID and allow. An example, which is for general purpose VMs can be seen here: https://github.com/firecracker-microvm/firecracker/tree/main/src/vmm/src/cpuid . AFAIK this was not really supported on aarch64, but things might have changed. In our case we would have something very restrictive, or maybe we want to enable some CPU instruction set extensions that WASM can use for increased performance.

Actually, even if a given CPUID template is set (which doesn't advertise a certain instruction), a guest program can still try to call that instruction (and it'll work). CPUID only helps with non-malicious guests, to provide a common view of the CPU features to all guests.
We'd still need to set the right VMX bits to trap on instructions like RDRAND, on top of CPUID masking.

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Unify rpc api naming

Signed-off-by: koushiro <[email protected]>

* Add some docs

Signed-off-by: koushiro <[email protected]>

* Make test more stable

Signed-off-by: koushiro <[email protected]>
bkchr pushed a commit that referenced this issue Apr 10, 2024
* Revert "Pin Rust Nightly to 2020-12-17 (#652)"

This reverts commit e54e6f7.

* fix clippy

* clippy again

* more clippy in test code

* and new cargo fmt

* another try
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

7 participants