Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF worker: restrict networking #7334

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Jun 5, 2023

Overview

For security we restrict networking by blocking the creation of sockets with seccomp. See explanation on the related issue.

Still has some TODOs that should be resolved, but it already works.

Related

Target branch: #7303 (builds on top of that PR)
Closes paritytech/polkadot-sdk#619

@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Jun 5, 2023
@mrcnski mrcnski self-assigned this Jun 5, 2023
blacklisted_rules.insert(libc::SYS_socketpair, vec![]);
blacklisted_rules.insert(libc::SYS_socket, vec![]);

let filter = SeccompFilter::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blacklisting is always brittle when it comes to security. Is whitelisting feasible? Maybe at least based on categories/wildcards?

Copy link
Contributor Author

@mrcnski mrcnski Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to my knowledge. It's what makes seccomp notoriously hard to use, and why landlock is generally better (or will be when it's fully implemented; it kind of does what you're asking for). But the benefit of this PR over https://github.com/paritytech/polkadot/issues/4718 is that this has a much smaller scope - it's only two syscalls. So, less chance of something going wrong until we have the full sandbox. And I proposed we do something like the following to make it even less brittle:

For example, we can set up a regular job that downloads the syscall table from the latest kernel version and makes sure no new syscalls containing "socket" were added. However, the risk of a new syscall appearing, and libc libraries using it, in the immediate-term is very low, so I think it shouldn't block this development.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blacklisting is indeed very problematic IMO. It's hard to keep up with the syscalls and the way they can be misused.
There may be newly discovered CVEs in currently unused syscalls, so limiting the blast radius of potentially harmless syscalls that are not neccessary is useful. Otherwise, an attacker could find a way of triggering an unneeded syscall from a linked library to cause havoc.

Another thing that comes to mind is io_uring. It exposes a way of bypassing system calls for creating sockets and performing network IO. This is another example of why blacklists are not a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I started working on a bigger PR with seccompiler that whitelists only the syscalls needed, but I thought this solution here would be quicker to land. If it's controversial I can just close this PR and focus on the other one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, glad to hear we're taking the whitelist approach. This one will be tricky as well, but better in the long run.
I'm not against an incremental approach either, any sandboxinig is better than none :)

@mrcnski mrcnski requested a review from koute June 5, 2023 16:29
Comment on lines +185 to +194
#[cfg(target_os = "linux")]
if let Err(err) = seccomp::try_restrict_thread() {
return (
Response::InternalError(InternalValidationError::Seccomp(format!(
"sandboxing the thread failed: {}",
err.to_string()
))),
landlock_status,
)
}
Copy link
Contributor

@koute koute Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... security-wise doing this here probably won't be very useful. (:

Since this only restricts a single thread any threads (including the main thread) which were spawned before this one will still be fully unrestricted, and since all of the threads share an address space if the attacker can execute arbitrary code they could just hijack another unsandboxed thread to do what they want. Slightly annoying but won't stop a motivated attacker.

So 1) this needs to be done on the main thread, and 2) it needs to be done before any other threads are spawned (ideally we'd add an assertion assert_eq!(std::fs::read_dir("/proc/self/task").unwrap().count(), 1); before the sandboxing is enabled.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. We can add exceptions for both landlock and seccomp (birdcage adds exceptions for local sockets). For FS, the main thread only needs the cached artifact directory, so we can add a landlock exception for it. But I didn't do this because I would prefer attackers not being able to read/modify other artifacts. We could spawn a new process for each job and fully restrict it, but that would be more overhead and complication.

I would want to make sure that is actually needed though. How is hijacking other threads done? Why do landlock and seccomp allow restricting per-thread, if it can be worked around?

Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to make sure that is actually needed though. How is hijacking other threads done?

There are many ways to do it. With no extra sandboxing the simplest one would be to just 1) stop the thread, 2) set its rip to where you want the thread to go through ptrace, 3) let it run again. But even if you'd sandbox all syscalls it's still possible to do it simply by just reading and writing memory (e.g. just overwrite the return address of a function call which the other thread is coincidentally executing; it takes a little bit of skill to pull it off, but it's possible)

Why do landlock and seccomp allow restricting per-thread, if it can be worked around?

Because if they'd work on a per-process basis then that could result in race conditions. (:

Any spawned threads inherit the original thread's sandboxing, so what everyone does is that they first set up the sandboxing, and then spawn their threads. (And then optionally apply extra blacklists on all threads so that no further threads can be spawned.)

Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.

For stopping simple attacks I think wasmtime's sandbox is enough of a barrier. If someone is skilled enough to find and exploit a remote code execution hole in wasmtime then I'm pretty sure they won't be stopped by this. (:

Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.

Well, I mean, it's always good to lock your front door, but if you'll leave the back door unlocked it's not going to stop anyone. (: (But regardless extra defense in depth is always good to have.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full process isolation sounds like the safest approach to me. The point of threads is sharing an address space, in this case we really don't want that, so we should be using a process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full process isolation sounds like the safest approach to me. The point of threads is sharing an address space, in this case we really don't want that, so we should be using a process.

Indeed. The issue here is that this is already running in a separate process (the worker process), but not the whole worker process is sandboxed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some restructuring is in order, where we have a clear (process) separation between what needs to be sandboxed and what is not. On the flip side the unsandboxed stuff might not need to be its own process?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koute so you propose to restrict the entire worker process? Wouldn't it be more logical to do that after the worker binary separation, then? Seems like it would be more straightforward, as right now, we'd need to apply sandboxing in the polkadot-cli root, and later we'd need to move it into some worker-cli root anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want this to be secure then we must restrict the entire process. As to whether do this now or later, well, I guess it's up to you. But as-is right now the sandboxing effectively doesn't do anything security-wise, so we shouldn't say it does. (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some restructuring is in order, where we have a clear (process) separation between what needs to be sandboxed and what is not. On the flip side the unsandboxed stuff might not need to be its own process?

Yeah, and it might also be more secure to limit processes to running one job instead of multiple jobs in sequence. See e.g. https://github.com/paritytech/polkadot/issues/7232#issuecomment-1549706304. Unfortunately it would be added overhead for each job to spawn and tear down a new process. We would also need to whitelist the artifacts directory from sandboxing.

I would wait on a response to landlock-lsm/rust-landlock#37. Maybe this is something that was already considered during development of landlock as I have not seen any caveats about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this: @koute was right (of course), I've raised https://github.com/paritytech/polkadot/issues/7497 and started working on it.

Comment on lines +231 to +232
blacklisted_rules.insert(libc::SYS_socketpair, vec![]);
blacklisted_rules.insert(libc::SYS_socket, vec![]);
Copy link
Contributor

@koute koute Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... there could still be a way around this if there's an existing socket already created; the attacker could then just probably reconfigure/repurpose it. (I've never tried that but I suppose it should work?) Perhaps we could blacklist all socket-related syscalls? setsockopt, getsockopt, connect, accept, listen, bind, sendto, recvfrom, sendmsg, recvmsg, sendmmsg, recvmmsg, and I also think there were some compat_* variant of some of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker thread communicates with the host through a unix socket, I think restricting send/recv would break that communication?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it uses those syscalls then yes; I don't remember if it does, or if it just uses normal read/write.

(Doing the communication through a shared memory map + futex instead of a socket would probably allow us to completely sandbox that, but it's probably not worth it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread shouldn't have access to the existing sockets unless it can go outside its address space as you mention in the other comment, right? The idea with only blocking two syscalls is to keep the scope low and minimize the chance of breakage, while we work on the main seccomp PR (right now still waiting on binary separation).

Copy link
Contributor Author

@mrcnski mrcnski Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is socket-repurposing done? Should be impossible to use them for networking if we block sockets with an exception for local sockets in the main thread (like birdcage does). Then we don't need to let through all the other syscalls.

Copy link
Contributor

@koute koute Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be conservative and minimize the change for breakage you can probably leave the send/recv/etc., accept and getsockopt syscalls unsandboxed. AFAIK blacklisting the rest shouldn't break anything as they should only be used with new sockets.

The thread shouldn't have access to the existing sockets unless it can go outside its address space as you mention in the other comment, right?

Sockets don't reside in the process' memory per-se; they are kernel space objects to which the process holds a handle in the form of a file descriptor (which is just a normal number). The only thing necessary to access a given socket is that you know its socket number, which you can trivially get by e.g. iterating over /proc/self/fd (since filesystem access is not sandboxed) or you could even just loop through numbers from 0 to N and see if any of them are sockets (since file descriptors are not randomized).

Nevertheless, all threads have full access to all other threads' address space; or in other words, there's only a single address space - the address space of the process, which all of the threads share. That's the nature of how threads work. (:

/// # Action on Caught Syscall
///
/// TODO: Update with the action and explain why given action was chosen.
#[cfg(target_os = "linux")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... we probably also want to gate this on architecture (target_arch = "x86_64").

Technically seccompiler should also support aarch64, but without us auditing aarch64's list of syscalls (which do differ between architectures) and testing that it actually works there's a chance that it'd be broken/incomplete/insecure.

Copy link
Contributor Author

@mrcnski mrcnski Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seccompiler supports aarch64 and I didn't think there'd be any complication with the two socket creation syscalls. I considered gating on these two architectures, but seccompiler already does it so it would already be a compiler error on an unsupported arch.

What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes? I think we would need eventually it but it should not be a blocker for this PR as the scope is so small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seccompiler supports aarch64 and I didn't think there'd be any complication with the two socket creation syscalls.

Yeah, probably, but we're going to extend this, right? I'd rather we either go all the way and properly support it, or not support it at all.

but seccompiler already does it so it would already be a compiler error on an unsupported arch.

It'd be nice to handle this gracefully on unsupported architectures (essentially treat it as being compiled on an unsupported OS) instead of failing to compile at all.

What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes?

You mean extract it from the Linux sources? I'm not entirely sure that it can be done with 100% reliability, but the scripts that are floating around which do that do seem like they can grab all of them from what I can see (at least for amd64). I don't think it's worth having a job like this to run for every PR, but a nightly job which runs once each day and pings us if there are any changes - sure, why not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes? I think we would need eventually it but it should not be a blocker for this PR as the scope is so small.

This is doable. But what is the purpose?
there's a script in seccompiler for this: https://github.com/rust-vmm/seccompiler/blob/main/tools/generate_syscall_tables.sh, and I'm sure there are others.

Would this be used to manually check the newly added syscalls, to see if there's a newly introduced network syscall?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be used to manually check the newly added syscalls, to see if there's a newly introduced network syscall?

Yeah, and my idea was to review new syscalls and determine if they should be blacklisted (I figure 99% of the time no action would be required)


pub type Result<T> = std::result::Result<T, Error>;

// TODO: Compile the filter at build-time rather than runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The simplest way to do it would probably be to generate the BpfProgram in build.rs, cast it to a &[u8] with slice::from_raw_parts, write it to file, and then include_bytes! it here. You could then slice::from_raw_parts it again into &'a [sock_filter] and pass it to seccompiler::apply_filter. (sock_filter is #[repr(C)] so this is safe to do)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's close to the instructions here: https://github.com/rust-vmm/seccompiler#compiling-filters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such a small filter, I don't think it's worth it.
Build-time compilation is especially helpful for large, JSON-encoded filters.

// Restricted thread cannot open sockets.
let handle = thread::spawn(|| {
// TODO:Open a socket, this should succeed before seccomp is applied.
TcpListener::bind("127.0.0.1:7070").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TcpListener::bind("127.0.0.1:7070").unwrap();
TcpListener::bind("127.0.0.1:0").unwrap();

Using 0 as a port will automatically pick a free one; this way this test won't randomly fail if the port's already used by something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I totally forgot about that.


// Try to open a socket after seccomp.
assert!(matches!(
TcpListener::bind("127.0.0.1:8080"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TcpListener::bind("127.0.0.1:8080"),
TcpListener::bind("127.0.0.1:0"),

}
}

// TODO: Add a check for whether seccomp is supported and warn if not, like we do for landlock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the kernel does not support seccomp, it should return an error from the syscalls.

calling seccomp(GET_ACTIONS_AVAIL) should be easy. you could directly call the syscall using libc::syscall and a bit of unsafe code.

However, I don't think we need this, since the actions we set here (ALLOW/KILL_PROCESS/ERRNO) have been in the kernel for a very long time (ar least kernel 4.9, since we were testing Firecracker on those kernels).

If you'd really like to be safe, I can work on adding this helper function in seccompiler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 You're the seccompiler dev! Cool!

calling seccomp(GET_ACTIONS_AVAIL) should be easy. you could directly call the syscall using libc::syscall and a bit of unsafe code.

I didn't realize it was that easy. I left this TODO because the seccompiler docs recommend it, but don't provide a way of directly doing so. I left this issue about it. ;) Sounds like this check may be unnecessary, we do require a somewhat recent kernel, but wouldn't hurt either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 You're the seccompiler dev! Cool!

😄 yep

Yep, I think it's unneccessary for this use case. It may be needed for more advanced use cases that use more exotic return actions, like SECCOMP_RET_USER_NOTIF (introduced in kernel 5.0), that's why I added the recommendation in the docs.

If you wanna make a contribution to seccompiler, I can help you out with reviews/guidance

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

PVF worker: restrict network access
5 participants