Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use systemd-nsresourced to allocate user namespaces and UID/GID ranges #24828

Open
ruihe774 opened this issue Dec 12, 2024 · 4 comments
Open
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ruihe774
Copy link
Contributor

Feature request description

Currently, podman uses newuidmap and newgidmap from shadow-utils to set up UID/GID mapping for user namespaces of rootless containers. This requires predefined UID/GID ranges in /etc/subuid and /etc/subgid. In some configurations, for example users managed by systemd-homed (#12590) and users managed by a network authentication system, users do not have records in /etc/subuid and /etc/subgid, preventing podman from creating rootless containers.

Systemd 256 has introduced a service systemd-nsresourced that exposes a Varlink interface io.systemd.NamespaceResource. Unprivileged clients may allocate a user namespace, and then request a transient UID/GID range to be assigned to it via this service. Users do no need to have predefined sub-UID/GID ranges. I wonder whether podman can use systemd-nsresourced to allocate user namespaces and UID/GID ranges for rootless containers.

Suggest potential solution

Podman can add a code path to use systemd-nsresourced to allocate user namespaces and UID/GID ranges. If it is not running, podman can fallback to newuidmap and newgidmap. This ensures backward compatibility.

Have you considered any alternatives?

No.

Additional context

systemd/systemd#26826
man:systemd-nsresourced.service(8)

Interface `io.systemd.NamespaceResource`
$ varlinkctl introspect /run/systemd/io.systemd.NamespaceResource io.systemd.NamespaceResource
interface io.systemd.NamespaceResource

method AllocateUserRange(
        name: string,
        size: int,
        target: ?int,
        userNamespaceFileDescriptor: int
) -> ()

method RegisterUserNamespace(
        name: string,
        userNamespaceFileDescriptor: int
) -> ()

method AddMountToUserNamespace(
        userNamespaceFileDescriptor: int,
        mountFileDescriptor: int
) -> ()

method AddControlGroupToUserNamespace(
        userNamespaceFileDescriptor: int,
        controlGroupFileDescriptor: int
) -> ()

method AddNetworkToUserNamespace(
        userNamespaceFileDescriptor: int,
        networkNamespaceFileDescriptor: int,
        namespaceInterfaceName: ?string,
        mode: string
) -> (
        hostInterfaceName: string,
        namespaceInterfaceName: string
)

error UserNamespaceInterfaceNotSupported()

error NameExists()

error UserNamespaceExists()

error DynamicRangeUnavailable()

error NoDynamicRange()

error UserNamespaceNotRegistered()

error UserNamespaceWithoutUserRange()

error TooManyControlGroups()

error ControlGroupAlreadyAdded()
@ruihe774 ruihe774 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 12, 2024
@Luap99
Copy link
Member

Luap99 commented Dec 13, 2024

I have only looked briefly at this when it was added but I don't think it is possible to switch to that with the current podman storage design.

We write the subuid's on disk in your home directory as plain directories/files so this cannot work when the uid's are transient. And even systemd goes to great length to lock this uid "backdoor" down via BPF:

In order to ensure that clients cannot gain
persistency in their transient UID/GID range a BPF-LSM based
policy is enforced that ensures that user namespaces set up this
way can only write to file systems they allocate themselves or
that are explicitly allowlisted via systemd-nsresourced.

And if we go into the nspawn man page:

systemd-nspawn may be invoked with or without privileges. The
full functionality is currently only available when invoked with
privileges. When invoked without privileges, various limitations
apply, including, but not limited to:

Only disk image based containers are supported (i.e.
--image=). Directory based ones (i.e. --directory=) are not
supported.

So my understanding is it is impossible to use a normal directory layout.

@ruihe774
Copy link
Contributor Author

Can AddMountToUserNamespace() address this problem? According to the description, it is possible to use the method to add a mounted overlayfs to the allowlist. I haven't tried it yet.

@Luap99
Copy link
Member

Luap99 commented Dec 13, 2024

I am not sure how the mounting is supposed to work but I don't that is the problem. We can natively mount overlayfs in a user namesapce without privilege escalation. The issue I think is that we cannot write files with these extra uids to disk which means all images would be limited to one uid, or need something like fuse-overlayfs that can map uids dynamically in the extended attributes.

@ruihe774
Copy link
Contributor Author

ruihe774 commented Dec 15, 2024

The issue I think is that we cannot write files with these extra uids to disk which means all images would be limited to one uid

According to my understanding of the BPF-LSM code of nsresourced, this is not enforced. For a mount that is in the userns or in the allowlist, operations in it are all allowed. The "no extra uids" limit is enforced by mountfsd. As we can mount overlayfs without privilege, we do not need to use mountfsd and can therefore bypass the limit.

They are all my guesses. I haven't confirmed them yet. Sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants