Use systemd-nsresourced to allocate user namespaces and UID/GID ranges #24828

ruihe774 · 2024-12-12T17:18:50Z

Feature request description

Currently, podman uses newuidmap and newgidmap from shadow-utils to set up UID/GID mapping for user namespaces of rootless containers. This requires predefined UID/GID ranges in /etc/subuid and /etc/subgid. In some configurations, for example users managed by systemd-homed (#12590) and users managed by a network authentication system, users do not have records in /etc/subuid and /etc/subgid, preventing podman from creating rootless containers.

Systemd 256 has introduced a service systemd-nsresourced that exposes a Varlink interface io.systemd.NamespaceResource. Unprivileged clients may allocate a user namespace, and then request a transient UID/GID range to be assigned to it via this service. Users do no need to have predefined sub-UID/GID ranges. I wonder whether podman can use systemd-nsresourced to allocate user namespaces and UID/GID ranges for rootless containers.

Suggest potential solution

Podman can add a code path to use systemd-nsresourced to allocate user namespaces and UID/GID ranges. If it is not running, podman can fallback to newuidmap and newgidmap. This ensures backward compatibility.

Have you considered any alternatives?

No.

Additional context

systemd/systemd#26826
man:systemd-nsresourced.service(8)

Interface `io.systemd.NamespaceResource`

$ varlinkctl introspect /run/systemd/io.systemd.NamespaceResource io.systemd.NamespaceResource
interface io.systemd.NamespaceResource

method AllocateUserRange(
        name: string,
        size: int,
        target: ?int,
        userNamespaceFileDescriptor: int
) -> ()

method RegisterUserNamespace(
        name: string,
        userNamespaceFileDescriptor: int
) -> ()

method AddMountToUserNamespace(
        userNamespaceFileDescriptor: int,
        mountFileDescriptor: int
) -> ()

method AddControlGroupToUserNamespace(
        userNamespaceFileDescriptor: int,
        controlGroupFileDescriptor: int
) -> ()

method AddNetworkToUserNamespace(
        userNamespaceFileDescriptor: int,
        networkNamespaceFileDescriptor: int,
        namespaceInterfaceName: ?string,
        mode: string
) -> (
        hostInterfaceName: string,
        namespaceInterfaceName: string
)

error UserNamespaceInterfaceNotSupported()

error NameExists()

error UserNamespaceExists()

error DynamicRangeUnavailable()

error NoDynamicRange()

error UserNamespaceNotRegistered()

error UserNamespaceWithoutUserRange()

error TooManyControlGroups()

error ControlGroupAlreadyAdded()

The text was updated successfully, but these errors were encountered:

Luap99 · 2024-12-13T10:32:00Z

I have only looked briefly at this when it was added but I don't think it is possible to switch to that with the current podman storage design.

We write the subuid's on disk in your home directory as plain directories/files so this cannot work when the uid's are transient. And even systemd goes to great length to lock this uid "backdoor" down via BPF:

In order to ensure that clients cannot gain
persistency in their transient UID/GID range a BPF-LSM based
policy is enforced that ensures that user namespaces set up this
way can only write to file systems they allocate themselves or
that are explicitly allowlisted via systemd-nsresourced.

And if we go into the nspawn man page:

systemd-nspawn may be invoked with or without privileges. The
full functionality is currently only available when invoked with
privileges. When invoked without privileges, various limitations
apply, including, but not limited to:

Only disk image based containers are supported (i.e.
--image=). Directory based ones (i.e. --directory=) are not
supported.

So my understanding is it is impossible to use a normal directory layout.

ruihe774 · 2024-12-13T11:50:44Z

Can AddMountToUserNamespace() address this problem? According to the description, it is possible to use the method to add a mounted overlayfs to the allowlist. I haven't tried it yet.

Luap99 · 2024-12-13T12:34:09Z

I am not sure how the mounting is supposed to work but I don't that is the problem. We can natively mount overlayfs in a user namesapce without privilege escalation. The issue I think is that we cannot write files with these extra uids to disk which means all images would be limited to one uid, or need something like fuse-overlayfs that can map uids dynamically in the extended attributes.

ruihe774 · 2024-12-15T13:51:03Z

The issue I think is that we cannot write files with these extra uids to disk which means all images would be limited to one uid

According to my understanding of the BPF-LSM code of nsresourced, this is not enforced. For a mount that is in the userns or in the allowlist, operations in it are all allowed. The "no extra uids" limit is enforced by mountfsd. As we can mount overlayfs without privilege, we do not need to use mountfsd and can therefore bypass the limit.

They are all my guesses. I haven't confirmed them yet. Sorry

ruihe774 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use systemd-nsresourced to allocate user namespaces and UID/GID ranges #24828

Use systemd-nsresourced to allocate user namespaces and UID/GID ranges #24828

ruihe774 commented Dec 12, 2024

Luap99 commented Dec 13, 2024

ruihe774 commented Dec 13, 2024

Luap99 commented Dec 13, 2024

ruihe774 commented Dec 15, 2024 •

edited

Loading

Use systemd-nsresourced to allocate user namespaces and UID/GID ranges #24828

Use systemd-nsresourced to allocate user namespaces and UID/GID ranges #24828

Comments

ruihe774 commented Dec 12, 2024

Feature request description

Suggest potential solution

Have you considered any alternatives?

Additional context

Luap99 commented Dec 13, 2024

ruihe774 commented Dec 13, 2024

Luap99 commented Dec 13, 2024

ruihe774 commented Dec 15, 2024 • edited Loading

ruihe774 commented Dec 15, 2024 •

edited

Loading