PVF worker: restrict networking #7334

mrcnski · 2023-06-05T15:59:36Z

Overview

For security we restrict networking by blocking the creation of sockets with seccomp. See explanation on the related issue.

Still has some TODOs that should be resolved, but it already works.

Not to my knowledge. It's what makes seccomp notoriously hard to use, and why landlock is generally better (or will be when it's fully implemented; it kind of does what you're asking for). But the benefit of this PR over https://github.com/paritytech/polkadot/issues/4718 is that this has a much smaller scope - it's only two syscalls. So, less chance of something going wrong until we have the full sandbox. And I proposed we do something like the following to make it even less brittle:

For example, we can set up a regular job that downloads the syscall table from the latest kernel version and makes sure no new syscalls containing "socket" were added. However, the risk of a new syscall appearing, and libc libraries using it, in the immediate-term is very low, so I think it shouldn't block this development.

Thanks for clarifying!

Blacklisting is indeed very problematic IMO. It's hard to keep up with the syscalls and the way they can be misused.
There may be newly discovered CVEs in currently unused syscalls, so limiting the blast radius of potentially harmless syscalls that are not neccessary is useful. Otherwise, an attacker could find a way of triggering an unneeded syscall from a linked library to cause havoc.

Another thing that comes to mind is io_uring. It exposes a way of bypassing system calls for creating sockets and performing network IO. This is another example of why blacklists are not a good idea.

I see. I started working on a bigger PR with seccompiler that whitelists only the syscalls needed, but I thought this solution here would be quicker to land. If it's controversial I can just close this PR and focus on the other one.

sounds good, glad to hear we're taking the whitelist approach. This one will be tricky as well, but better in the long run.
I'm not against an incremental approach either, any sandboxinig is better than none :)

koute · 2023-06-06T08:41:07Z

node/core/pvf/execute-worker/src/lib.rs

+					#[cfg(target_os = "linux")]
+					if let Err(err) = seccomp::try_restrict_thread() {
+						return (
+							Response::InternalError(InternalValidationError::Seccomp(format!(
+								"sandboxing the thread failed: {}",
+								err.to_string()
+							))),
+							landlock_status,
+						)
+					}


Hmmm... security-wise doing this here probably won't be very useful. (:

Since this only restricts a single thread any threads (including the main thread) which were spawned before this one will still be fully unrestricted, and since all of the threads share an address space if the attacker can execute arbitrary code they could just hijack another unsandboxed thread to do what they want. Slightly annoying but won't stop a motivated attacker.

So 1) this needs to be done on the main thread, and 2) it needs to be done before any other threads are spawned (ideally we'd add an assertion assert_eq!(std::fs::read_dir("/proc/self/task").unwrap().count(), 1); before the sandboxing is enabled.)

Interesting. We can add exceptions for both landlock and seccomp (birdcage adds exceptions for local sockets). For FS, the main thread only needs the cached artifact directory, so we can add a landlock exception for it. But I didn't do this because I would prefer attackers not being able to read/modify other artifacts. We could spawn a new process for each job and fully restrict it, but that would be more overhead and complication.

I would want to make sure that is actually needed though. How is hijacking other threads done? Why do landlock and seccomp allow restricting per-thread, if it can be worked around?

Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.

I would want to make sure that is actually needed though. How is hijacking other threads done?

There are many ways to do it. With no extra sandboxing the simplest one would be to just 1) stop the thread, 2) set its rip to where you want the thread to go through ptrace, 3) let it run again. But even if you'd sandbox all syscalls it's still possible to do it simply by just reading and writing memory (e.g. just overwrite the return address of a function call which the other thread is coincidentally executing; it takes a little bit of skill to pull it off, but it's possible)

Why do landlock and seccomp allow restricting per-thread, if it can be worked around?

Because if they'd work on a per-process basis then that could result in race conditions. (:

Any spawned threads inherit the original thread's sandboxing, so what everyone does is that they first set up the sandboxing, and then spawn their threads. (And then optionally apply extra blacklists on all threads so that no further threads can be spawned.)

Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.

For stopping simple attacks I think wasmtime's sandbox is enough of a barrier. If someone is skilled enough to find and exploit a remote code execution hole in wasmtime then I'm pretty sure they won't be stopped by this. (:

Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.

Well, I mean, it's always good to lock your front door, but if you'll leave the back door unlocked it's not going to stop anyone. (: (But regardless extra defense in depth is always good to have.)

Full process isolation sounds like the safest approach to me. The point of threads is sharing an address space, in this case we really don't want that, so we should be using a process.

Full process isolation sounds like the safest approach to me. The point of threads is sharing an address space, in this case we really don't want that, so we should be using a process.

Indeed. The issue here is that this is already running in a separate process (the worker process), but not the whole worker process is sandboxed.

Maybe some restructuring is in order, where we have a clear (process) separation between what needs to be sandboxed and what is not. On the flip side the unsandboxed stuff might not need to be its own process?

@koute so you propose to restrict the entire worker process? Wouldn't it be more logical to do that after the worker binary separation, then? Seems like it would be more straightforward, as right now, we'd need to apply sandboxing in the polkadot-cli root, and later we'd need to move it into some worker-cli root anyway?

If we want this to be secure then we must restrict the entire process. As to whether do this now or later, well, I guess it's up to you. But as-is right now the sandboxing effectively doesn't do anything security-wise, so we shouldn't say it does. (:

Maybe some restructuring is in order, where we have a clear (process) separation between what needs to be sandboxed and what is not. On the flip side the unsandboxed stuff might not need to be its own process?

Yeah, and it might also be more secure to limit processes to running one job instead of multiple jobs in sequence. See e.g. https://github.com/paritytech/polkadot/issues/7232#issuecomment-1549706304. Unfortunately it would be added overhead for each job to spawn and tear down a new process. We would also need to whitelist the artifacts directory from sandboxing.

I would wait on a response to landlock-lsm/rust-landlock#37. Maybe this is something that was already considered during development of landlock as I have not seen any caveats about this.

Update on this: @koute was right (of course), I've raised https://github.com/paritytech/polkadot/issues/7497 and started working on it.

koute · 2023-06-06T08:48:46Z

node/core/pvf/common/src/worker/security.rs

+		blacklisted_rules.insert(libc::SYS_socketpair, vec![]);
+		blacklisted_rules.insert(libc::SYS_socket, vec![]);


Hm... there could still be a way around this if there's an existing socket already created; the attacker could then just probably reconfigure/repurpose it. (I've never tried that but I suppose it should work?) Perhaps we could blacklist all socket-related syscalls? setsockopt, getsockopt, connect, accept, listen, bind, sendto, recvfrom, sendmsg, recvmsg, sendmmsg, recvmmsg, and I also think there were some compat_* variant of some of these.

The worker thread communicates with the host through a unix socket, I think restricting send/recv would break that communication?

If it uses those syscalls then yes; I don't remember if it does, or if it just uses normal read/write.

(Doing the communication through a shared memory map + futex instead of a socket would probably allow us to completely sandbox that, but it's probably not worth it.)

The thread shouldn't have access to the existing sockets unless it can go outside its address space as you mention in the other comment, right? The idea with only blocking two syscalls is to keep the scope low and minimize the chance of breakage, while we work on the main seccomp PR (right now still waiting on binary separation).

How is socket-repurposing done? Should be impossible to use them for networking if we block sockets with an exception for local sockets in the main thread (like birdcage does). Then we don't need to let through all the other syscalls.

If you want to be conservative and minimize the change for breakage you can probably leave the send/recv/etc., accept and getsockopt syscalls unsandboxed. AFAIK blacklisting the rest shouldn't break anything as they should only be used with new sockets.

The thread shouldn't have access to the existing sockets unless it can go outside its address space as you mention in the other comment, right?

Sockets don't reside in the process' memory per-se; they are kernel space objects to which the process holds a handle in the form of a file descriptor (which is just a normal number). The only thing necessary to access a given socket is that you know its socket number, which you can trivially get by e.g. iterating over /proc/self/fd (since filesystem access is not sandboxed) or you could even just loop through numbers from 0 to N and see if any of them are sockets (since file descriptors are not randomized).

Nevertheless, all threads have full access to all other threads' address space; or in other words, there's only a single address space - the address space of the process, which all of the threads share. That's the nature of how threads work. (:

koute · 2023-06-06T08:51:48Z

node/core/pvf/common/src/worker/security.rs

+/// # Action on Caught Syscall
+///
+/// TODO: Update with the action and explain why given action was chosen.
+#[cfg(target_os = "linux")]


Hmm.... we probably also want to gate this on architecture (target_arch = "x86_64").

Technically seccompiler should also support aarch64, but without us auditing aarch64's list of syscalls (which do differ between architectures) and testing that it actually works there's a chance that it'd be broken/incomplete/insecure.

seccompiler supports aarch64 and I didn't think there'd be any complication with the two socket creation syscalls. I considered gating on these two architectures, but seccompiler already does it so it would already be a compiler error on an unsupported arch.

What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes? I think we would need eventually it but it should not be a blocker for this PR as the scope is so small.

seccompiler supports aarch64 and I didn't think there'd be any complication with the two socket creation syscalls.

Yeah, probably, but we're going to extend this, right? I'd rather we either go all the way and properly support it, or not support it at all.

but seccompiler already does it so it would already be a compiler error on an unsupported arch.

It'd be nice to handle this gracefully on unsupported architectures (essentially treat it as being compiled on an unsupported OS) instead of failing to compile at all.

What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes?

You mean extract it from the Linux sources? I'm not entirely sure that it can be done with 100% reliability, but the scripts that are floating around which do that do seem like they can grab all of them from what I can see (at least for amd64). I don't think it's worth having a job like this to run for every PR, but a nightly job which runs once each day and pings us if there are any changes - sure, why not.

What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes? I think we would need eventually it but it should not be a blocker for this PR as the scope is so small.

This is doable. But what is the purpose?
there's a script in seccompiler for this: https://github.com/rust-vmm/seccompiler/blob/main/tools/generate_syscall_tables.sh, and I'm sure there are others.

Would this be used to manually check the newly added syscalls, to see if there's a newly introduced network syscall?

Would this be used to manually check the newly added syscalls, to see if there's a newly introduced network syscall?

Yeah, and my idea was to review new syscalls and determine if they should be blacklisted (I figure 99% of the time no action would be required)

koute · 2023-06-06T08:55:50Z

node/core/pvf/common/src/worker/security.rs

+
+	pub type Result<T> = std::result::Result<T, Error>;
+
+	// TODO: Compile the filter at build-time rather than runtime.


Yep. The simplest way to do it would probably be to generate the BpfProgram in build.rs, cast it to a &[u8] with slice::from_raw_parts, write it to file, and then include_bytes! it here. You could then slice::from_raw_parts it again into &'a [sock_filter] and pass it to seccompiler::apply_filter. (sock_filter is #[repr(C)] so this is safe to do)

Yep, that's close to the instructions here: https://github.com/rust-vmm/seccompiler#compiling-filters.

For such a small filter, I don't think it's worth it.
Build-time compilation is especially helpful for large, JSON-encoded filters.

koute · 2023-06-06T08:57:29Z

node/core/pvf/common/src/worker/security.rs

+			// Restricted thread cannot open sockets.
+			let handle = thread::spawn(|| {
+				// TODO:Open a socket, this should succeed before seccomp is applied.
+				TcpListener::bind("127.0.0.1:7070").unwrap();


Suggested change

TcpListener::bind("127.0.0.1:7070").unwrap();

TcpListener::bind("127.0.0.1:0").unwrap();

Using 0 as a port will automatically pick a free one; this way this test won't randomly fail if the port's already used by something.

Good catch, I totally forgot about that.

koute · 2023-06-06T08:57:39Z

node/core/pvf/common/src/worker/security.rs

+
+				// Try to open a socket after seccomp.
+				assert!(matches!(
+					TcpListener::bind("127.0.0.1:8080"),


Suggested change

TcpListener::bind("127.0.0.1:8080"),

TcpListener::bind("127.0.0.1:0"),

alindima · 2023-08-02T10:19:16Z

node/core/pvf/common/src/worker/security.rs

+	}
+}
+
+// TODO: Add a check for whether seccomp is supported and warn if not, like we do for landlock.


If the kernel does not support seccomp, it should return an error from the syscalls.

calling seccomp(GET_ACTIONS_AVAIL) should be easy. you could directly call the syscall using libc::syscall and a bit of unsafe code.

However, I don't think we need this, since the actions we set here (ALLOW/KILL_PROCESS/ERRNO) have been in the kernel for a very long time (ar least kernel 4.9, since we were testing Firecracker on those kernels).

If you'd really like to be safe, I can work on adding this helper function in seccompiler

🤯 You're the seccompiler dev! Cool!

calling seccomp(GET_ACTIONS_AVAIL) should be easy. you could directly call the syscall using libc::syscall and a bit of unsafe code.

I didn't realize it was that easy. I left this TODO because the seccompiler docs recommend it, but don't provide a way of directly doing so. I left this issue about it. ;) Sounds like this check may be unnecessary, we do require a somewhat recent kernel, but wouldn't hurt either.

🤯 You're the seccompiler dev! Cool!

😄 yep

Yep, I think it's unneccessary for this use case. It may be needed for more advanced use cases that use more exotic return actions, like SECCOMP_RET_USER_NOTIF (introduced in kernel 5.0), that's why I added the recommendation in the docs.

If you wanna make a contribution to seccompiler, I can help you out with reviews/guidance

mrcnski and others added 18 commits May 26, 2023 18:11

Begin adding landlock + test

146e6e2

Move PVF implementer's guide section to own page, document security

950add0

Implement test

e48605a

Add some docs

d1af7ee

Do some cleanup

3e5b6cd

Fix typo

39f2495

Warn on host startup if landlock is not supported

e555165

Clarify docs a bit

30178c9

Minor improvements

c096284

Add some docs about determinism

41d8d1a

Address review comments (mainly add warning on landlock error)

6524c81

Update node/core/pvf/src/host.rs

a9b2dfd

Co-authored-by: Andrei Sandu <[email protected]>

Update node/core/pvf/src/host.rs

b9d8fc1

Co-authored-by: Andrei Sandu <[email protected]>

Merge branch 'master' into mrcnski/pvf-landlock

609a82a

Fix unused fn

e9b5c17

Update ABI docs to reflect latest discussions

08d98e9

Remove outdated notes

2d72e31

PVF worker: restrict networking

7d3410b

mrcnski requested review from eskimor and s0me0ne-unkn0wn June 5, 2023 15:59

mrcnski self-assigned this Jun 5, 2023

eskimor reviewed Jun 5, 2023

View reviewed changes

mrcnski requested a review from koute June 5, 2023 16:29

koute reviewed Jun 6, 2023

View reviewed changes

s0me0ne-unkn0wn mentioned this pull request Jun 19, 2023

98.6% OF DEVELOPERS CANNOT REVIEW THIS PR! [read more...] #7337

Merged

Base automatically changed from mrcnski/pvf-landlock to master July 5, 2023 16:57

This was referenced Jul 7, 2023

Possibility of bypassing sandboxing of threads landlock-lsm/rust-landlock#37

Closed

PVF worker: apply sandboxing per-process paritytech/polkadot-sdk#600

Closed

alindima reviewed Aug 2, 2023

View reviewed changes

mrcnski mentioned this pull request Aug 4, 2023

PVF: Move landlock out of thread into process; add landlock exceptions #7580

Draft

4 tasks

This was referenced Aug 28, 2023

Potential ways to get around networking sandbox phylum-dev/birdcage#32

Closed

PVF: consider spawning a new process per job paritytech/polkadot-sdk#584

Closed

mrcnski mentioned this pull request Oct 24, 2023

PVF worker: Add seccomp restrictions (restrict networking) paritytech/polkadot-sdk#2009

Merged

		blacklisted_rules.insert(libc::SYS_socketpair, vec![]);
		blacklisted_rules.insert(libc::SYS_socket, vec![]);


		pub type Result<T> = std::result::Result<T, Error>;

		// TODO: Compile the filter at build-time rather than runtime.

	TcpListener::bind("127.0.0.1:7070").unwrap();
	TcpListener::bind("127.0.0.1:0").unwrap();

	TcpListener::bind("127.0.0.1:8080"),
	TcpListener::bind("127.0.0.1:0"),

PVF worker: restrict networking #7334

Are you sure you want to change the base?

PVF worker: restrict networking #7334

Conversation

mrcnski commented Jun 5, 2023

Overview

Related

Choose a reason for hiding this comment

mrcnski Jun 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koute Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koute Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

koute Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski Jun 5, 2023 •

edited

Loading

koute Jun 6, 2023 •

edited

Loading

koute Jun 6, 2023 •

edited

Loading

mrcnski Jun 6, 2023 •

edited

Loading

koute Jun 6, 2023 •

edited

Loading

mrcnski Jun 6, 2023 •

edited

Loading