-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsenter: cloned_binary: "memfd" cleanups #1984
Conversation
This new setup would be an opportunity to fix #1980, by making the |
/cc @kolyshkin |
Thanks for writing this, you beat me to it :) Left some minor comments in the code. The only issue remaining is /* Use our own wrapper for memfd_create. */
#if !defined(__NR_memfd_create)
# ifdef __i386__
# define __NR_memfd_create 356
# elif __x86_64__
# define __NR_memfd_create 319
# elif __powerpc64__
# define __NR_memfd_create 360
# elif __aarch64__
# define __NR_memfd_create 279
# elif __arm__
# define __NR_memfd_create 385
/* FIXME more arches here */
# endif
#endif /* !__NR_memfd_create */ but this is borderline ugly, and in the light of #1988 and #1980 maybe it makes sense to just not use memfd_create() at all, only leaving other methods? I'm not proposinig it, just an idea to discuss... |
@kolyshkin We don't do that for any other syscall and I really don't like the idea of hard-coding syscall numbers. Yeah, it's part of the Linux ABI and thus won't break, but I'm really not a fan of it. |
@cyphar, did you think about dropping https://github.com/lxc/lxc/blob/2d8bd1db234462356c73b12f238a615b2d1cb550/src/lxc/rexec.c#L106 |
@brauner I don't have a strong opinion about whether I should keep |
Any strong opinions on this one (in particular, should we keep /cc @opencontainers/runc-maintainers |
ping @justincormack as well; PTAL |
@cyphar I am in favor of retaining |
I like the approach you have taken in the PR. I am on the fence on whether we should remove memfd_create right away as you provide a way to disable it. |
I am approving this but will be happy to re-approve even if majority thinks that we should drop memfd_create. Thanks! |
TL;DR: it looks like All the gory detailsI tested `memfd_create` and friends, using memfd_test.c from the kernel sources. On CentOS 7.2 kernel (3.10.0-327.el7.x86_64) and it fails right there, giving EINVAL on MFD_ALLOW_SEALING flag. I tested it on latest CentOS 7.6 kernel (3.10.0-957.5.1.el7.x86_64) and it fails mid-way on with:
Here's a relevant portion of strace:
Note that the above test case was part of initial implementation of memfd_create(). I have retried test with selinux disabled -- same failure. I didn't look whether this is the way that this runc patch uses memfd_create (probably not). All this gives me an impression that memfd_create() and friends are in bad shape even in latest RHEL7 kernels. |
Now, for my great surprise, I ran the same test binary on Debian Jessie (which is 3.16 but have memfd stuff backported), and it passes with flying colors:
|
Filed a bug to Red Hat: https://bugzilla.redhat.com/show_bug.cgi?id=1679829 |
When use _LIBCONTAINER_DISABLE_MEMFD_CLONE=1 runc create test1
The error is raised in here: runc/libcontainer/nsenter/cloned_binary.c Lines 374 to 375 in 89c0755
I use root permission to run runc. May be need some special permission? |
I'm really not sure, can you please provide some more information about your setup -- an I'm currently working on moving the |
For a variety of reasons, sendfile(2) can end up doing a short-copy so we need to just loop until we hit the binary size. Since /proc/self/exe is tautologically our own binary, there's no chance someone is going to modify it underneath us (or changing the size). Signed-off-by: Aleksa Sarai <[email protected]>
Ah dammit, this wasn't meant to be merged -- the bind-mount approach actually isn't fool-proof (if you can do a mount, you can get around it). However, the default profiles actually disallow mounting so this isn't the end of the world. I'll submit a follow-up PR that uses overlayfs rather than bind-mounts. |
Bad news... we plan to appy this patch to our production environment, looks like still need to wait |
It won't work in userns? |
No, but that's why we have fallbacks. I've submitted #2006. However I've come to the conclusion that running a privileged container with CAP_SYS_ADMIN is simply an insecure setup, so maybe it's better if we just leave it in this state -- where arguably it's not secure with CAP_SYS_ADMIN (but there's almost certainly other issues with running CAP_SYS_ADMIN). CAP_SYS_ADMIN allows you to disable seccomp protections, do privileged device ioctls, among other things. |
Backport from opencontainers#1984
{ | ||
switch (fdtype) { | ||
case EFD_MEMFD: | ||
return fcntl(*fd, F_ADD_SEALS, RUNC_MEMFD_SEALS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about modify fallback like try_bindfd
? For example:
try_memfd()
try_otmpfile()
try_oldschooltemp()
Is there a fail situation like this: memfd_create success, but F_ADD_SEALS fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to submit a patch to do that, sure.
However, I spent several weeks working on it or answering questions about it, and given painfully trivial the fix for the CVE is ("make a copy and execute that"), working on it for any more time feels like a waste of time to me.
If F_ADD_SEALS
fails (with memfd_create
and MFD_ALLOW_SEALING
working) then the kernel is broken -- and while that can happen with some distributions I'd prefer to actually see it be an existing problem before trying to fix it.
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about). See also canonical#6610. Signed-off-by: Tianon Gravi <[email protected]>
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about). See also canonical#6610. Signed-off-by: Tianon Gravi <[email protected]>
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about). See also canonical#6610. Signed-off-by: Tianon Gravi <[email protected]>
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about). See also canonical#6610. (originally from Tianon Gravi, but re-committed due to CLA issues with the PR checks) Signed-off-by: Ian Johnson <[email protected]>
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about). See also #6610. (originally from Tianon Gravi, but re-committed due to CLA issues with the PR checks) Signed-off-by: Ian Johnson <[email protected]>
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).
In order to get around the memfd_create(2) requirement, 0a8e411
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:
It required O_TMPFILE which is relatively new (having been added to
Linux 3.11).
The fallback choice was made at compile-time, not runtime. This
results in several complications when it comes to running binaries
on different machines to the ones they were built on.
The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).
The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a very bad idea).
Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked alongside the protection being
present.
Fixes #1979
Fixes #1980
Fixes #1988
Signed-off-by: Aleksa Sarai [email protected]