Nested LXD virtualization in peer containers causes vsock ID conflicts #11508

mpontillo · 2023-03-23T18:44:00Z

Required information

Distribution: Ubuntu
Distribution version: 22.04 "Jammy"
The output of "lxc info" or if that fails:
- Kernel version: 5.19.0-35-generic
- LXC version: 5.12
- LXD version: 5.12
- Storage backend in use: zfs

lxc info output

$ lxc info
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    <redacted>
    -----END CERTIFICATE-----
  certificate_fingerprint: 9c473ce74f6bda12dd4ec97c3a28cd8cd4063fcfbfcc0435d9bfe1e7ba7f15f6
  driver: lxc | qemu
  driver_version: 5.0.2 | 7.1.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.19.0-35-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: timeloop
  server_pid: 1603373
  server_version: "5.12"
  storage: zfs
  storage_version: 2.1.5-1ubuntu6
  storage_supported_drivers:
  - name: ceph
    version: 17.2.0
    remote: true
  - name: cephfs
    version: 17.2.0
    remote: true
  - name: cephobject
    version: 17.2.0
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.47.0
    remote: false
  - name: zfs
    version: 2.1.5-1ubuntu6
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false

Issue description

When attempting to use lxd to launch virtual machines in multiple peer containers, errors such as vhost-vsock: unable to set guest cid: Address already in use can be observed when launching VMs, causing the VMs to fail to start.

Steps to reproduce

Create a profile allowing nested virtualization, such as:

lxc profile create virt && \
lxc profile set virt security.nesting=true && \
lxc profile device add virt kvm unix-char source=/dev/kvm && \
lxc profile device add virt vhost-net unix-char source=/dev/vhost-net && \
lxc profile device add virt vhost-vsock unix-char source=/dev/vhost-vsock && \
lxc profile device add virt vsock unix-char source=/dev/vsock

Launch two or more containers with this profile, such as:

lxc launch ubuntu:jammy hv1 -p virt -p default
lxc launch ubuntu:jammy hv2 -p virt -p default

Use lxc shell to enter both containers and run:

lxd init --auto
lxc launch images:ubuntu/bionic/cloud bionic --vm

Expected results

Both virtual machines should be created and started.

Actual results

The second VM to be created will fail to start. lxc info --show-log local:bionic will display:

[...]
qemu-system-x86_64:/var/snap/lxd/common/lxd/logs/bionic/qemu.conf:115: vhost-vsock: unable to set guest cid: Address already in use

Additional Information

There was an attempt to address this in PR #10216. However, this fix seems to assume that the vsock IDs are only shared with the parent, not peer containers.

libvirt seems to avoid this problem by iterating over usable IDs until a free ID is found.

The text was updated successfully, but these errors were encountered:

tomponline · 2023-03-23T21:19:11Z

Yes, this is the track I started to go down with https://github.com/lxc/lxd/pull/10216#issuecomment-1093320228

mpontillo · 2023-03-23T23:27:01Z

I like where you were going with this, @tomponline. A few thoughts on the commit you referenced:

I doubt a hypervisor would have a number of virtual machines approaching anywhere close to the size of a 32-bit integer, and the call to check whether or not the ID is already in use should be relatively fast.
- Rather than picking 10 IDs and then giving up (which could go badly if the user gets unlucky), why not make the loop less bounded? I wouldn't give up until checking at least a few thousand. The libvirt code I referenced never gives you up - which is arguably a bug, so hopefully it doesn't let you down.
- Instead of picking values totally at random, why not create a hash using something unique about the VM and start trying IDs there? (Or maybe just take the first 16 bits of the VM UUID, shift them to the most significant bytes of the vsock ID, and try 2^16 iterations from there? That would also have the advantage of being very unlikely to clash with other tooling on the system, such as libvirt, which starts counting from 3.)
Don't forget that IDs 0-2 seem to be reserved and unusable.
I'm curious now how the kernel keeps track of the IDs, and if it is at all useful to have low-numbered IDs instead of random values. (probably an investigation for another day)

Anis-cpu-13 · 2023-03-27T01:08:54Z

Hello, I am writing to express my interest in working on the issue mentioned in the bug report #11508 for LXD. As a student at the university, I am eager to contribute to open source projects and gain valuable experience in software development.

I have experience in Linux systems and I am familiar with virtualization technologies. I believe that my skills and knowledge would be useful in resolving the issue mentioned in the bug report. I am willing to work with the LXD development team to find a solution to the problem and contribute to the project.

Thank you for considering my interest in this issue. I look forward to hearing back from you.

tomponline · 2023-03-27T08:27:38Z

Thanks @Anis-cpu-13 assigned to you!

Gio2241 · 2023-05-14T18:23:53Z

I think I have exactly same issue by peer containers causing vsock ID conflict when I try to launch a QEMU VM within containers.

What's current stage on the fix?

Is there a workaround/hack to fix it for now?

@tomponline

Gio2241 · 2023-05-26T08:34:57Z

@stgraber , this issue is one which stops our company to move to LXD, is there any workaround before the fix?
Thought 5.14 would fix this one :/

tomponline · 2023-05-26T08:46:28Z

@kochia7 what is the use case for running VMs inside containers? (it maybe there is a workaround in the short term until this is fixed).

Gio2241 · 2023-05-26T09:03:46Z

@kochia7 what is the use case for running VMs inside containers? (it maybe there is a workaround in the short term until this is fixed).

We have Android Machines (Qemu/CrossVM) running per container, unfortunately we are not able to run VMs directly on the machine, so planning to use LXC containers as lightweight isolation for each Qemu VM.

tomponline · 2023-05-26T09:06:17Z

Thanks. What is the reason for "we are not able to run VMs directly on the machine"?

And are you aware that by passing through non-namespaced devices like /dev/kvm, you are potentially exposing you host to attacks from the containers, so just want to check you are aware that doing that reduces the isolation. I wondered what sort of isolation you are expecting from running VMs inside containers?

Gio2241 · 2023-05-26T09:16:00Z

VMM we using create artifacts which interfere each other with other machines, it's how the VMM is build, unable to work with several VM in parallel. So LXC creates just enough isolation to make them work

tomponline · 2023-05-26T09:17:55Z

You can set volatile.vsock_id on the instance before starting it, so if you're able to set them to non-conflicting IDs, then you can workaround the problem for now.

lxc config set <VM instance> volatile.vsock_id=n

tomponline · 2023-05-26T09:19:41Z

I'm not sure though how it will work with LXD's own vsock listener though (as opposed to the lxd-agent's).
It maybe that vsock just won't work properly when being run inside containers.

Gio2241 · 2023-05-26T09:26:36Z

You can set volatile.vsock_id on the instance before starting it, so if you're able to set them to non-conflicting IDs, then you can workaround the problem for now.
lxc config set <VM instance> volatile.vsock_id=n

I will give it a try, seems promising! Thanks!

Gio2241 · 2023-05-26T13:15:17Z

I used lxc config set <VM instance> volatile.vsock_id=n for LXC containers from the host
and
guest-cid/-vsock_guest_cid for Qemu within the containers for nested VMs and it worked!

tomponline · 2023-05-26T13:41:36Z

Only the last one would have done anything as setting volatile.vsock_id on a container doesn't do anything.

Gio2241 · 2023-05-29T18:01:17Z

You can set volatile.vsock_id on the instance before starting it, so if you're able to set them to non-conflicting IDs, then you can workaround the problem for now.
lxc config set <VM instance> volatile.vsock_id=n

Didn't really work for LXD: https://github.com/lxc/lxd/issues/11739#issuecomment-1567389272

When acquiring a new Context ID for the communication via vsock, pick the first four bytes of the instances UUID (try to get as much randomness as possible out of the 16 bytes) and convert it into an uint32. If there is a collision, try again after adding +1 to the ID. The syscall to the vsock returns ENODEV in case the Context ID is not yet assigned. Fixes https://github.com/lxc/lxd/issues/11508 Signed-off-by: Julian Pelizäus <[email protected]>

When acquiring a new Context ID for the communication via vsock, use the UUID of the instance as a seed for generating random uint32 candidates. The loop is kept open until a free Context ID is found. The syscall to the vsock returns ENODEV in case the Context ID is not yet assigned. In case the Context ID of a stopped VM was already acquired again, a new one is now generated. Fixes https://github.com/lxc/lxd/issues/11508 Signed-off-by: Julian Pelizäus <[email protected]>

When acquiring a new Context ID for the communication via vsock, use the UUID of the instance as a seed for generating random uint32 candidates. The loop is kept open until a free Context ID is found or the timeout of 5s is reached. The syscall to the vsock returns ENODEV in case the Context ID is not yet assigned. In case the Context ID of a stopped VM was already acquired again, a new one is now generated. Fixes https://github.com/lxc/lxd/issues/11508 Signed-off-by: Julian Pelizäus <[email protected]>

When acquiring a new Context ID for the communication via vsock, use the UUID of the instance as a seed for generating random uint32 candidates. The loop is kept open until a free Context ID is found or the timeout of 5s is reached. The syscall to the vsock returns ENODEV in case the Context ID is not yet assigned. In case the Context ID of a stopped VM was already acquired again, a new one gets picked. Fixes https://github.com/lxc/lxd/issues/11508 Signed-off-by: Julian Pelizäus <[email protected]>

When acquiring a new Context ID for the communication via vsock, use the UUID of the instance as a seed for generating random uint32 candidates. The loop is kept open until a free Context ID is found or the timeout of 5s is reached. The syscall to the vsock returns ENODEV in case the Context ID is not yet assigned. In case the Context ID of a stopped VM was already acquired again, a new one gets picked. Fixes lxc#11508 Signed-off-by: Julian Pelizäus <[email protected]>

When acquiring a new Context ID for the communication via vsock, use the UUID of the instance as a seed for generating random uint32 candidates. The loop is kept open until a free Context ID is found or the timeout of 5s is reached. The syscall to the vsock returns ENODEV in case the Context ID is not yet assigned. In case the Context ID of a stopped VM was already acquired again, a new one gets picked. Removes the `vhost_vsock` feature since the value isn't anymore accessed. Fixes lxc#11508 Signed-off-by: Julian Pelizäus <[email protected]>

tomponline added the Bug Confirmed to be a bug label Mar 23, 2023

mpontillo changed the title ~~Nested LXD virtualization in multiple containers causes vsock ID conflicts~~ Nested LXD virtualization in peer containers causes vsock ID conflicts Mar 24, 2023

tomponline assigned Anis-cpu-13 Mar 27, 2023

stgraber added the Easy Good for new contributors label Apr 14, 2023

stgraber modified the milestones: soon, lxd-5.14 Apr 14, 2023

Gio2241 mentioned this issue May 13, 2023

Unable to use nested virtualization with LXC container #11674

Closed

stgraber modified the milestones: lxd-5.14, lxd-5.15 May 25, 2023

tomponline mentioned this issue May 26, 2023

shared/instance: Separate some instance type specific config key validation #11735

Merged

stgraber mentioned this issue May 29, 2023

Unable to Set volatile.vsock_id for LXD VM #11739

Closed

tomponline modified the milestones: lxd-5.15, lxd-5.16 Jun 21, 2023

tomponline assigned roosterfish Jun 23, 2023

roosterfish mentioned this issue Jun 27, 2023

lxd/instance/drivers/qemu: Pick a random vsock Context ID #11896

Merged

tomponline closed this as completed in #11896 Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested LXD virtualization in peer containers causes vsock ID conflicts #11508

Nested LXD virtualization in peer containers causes vsock ID conflicts #11508

mpontillo commented Mar 23, 2023 •

edited

Loading

tomponline commented Mar 23, 2023

mpontillo commented Mar 23, 2023 •

edited

Loading

Anis-cpu-13 commented Mar 27, 2023

tomponline commented Mar 27, 2023

Gio2241 commented May 14, 2023

Gio2241 commented May 26, 2023

tomponline commented May 26, 2023 •

edited

Loading

Gio2241 commented May 26, 2023

tomponline commented May 26, 2023 •

edited

Loading

Gio2241 commented May 26, 2023

tomponline commented May 26, 2023

tomponline commented May 26, 2023

Gio2241 commented May 26, 2023

Gio2241 commented May 26, 2023 •

edited

Loading

tomponline commented May 26, 2023 •

edited

Loading

Gio2241 commented May 29, 2023

Nested LXD virtualization in peer containers causes vsock ID conflicts #11508

Nested LXD virtualization in peer containers causes vsock ID conflicts #11508

Comments

mpontillo commented Mar 23, 2023 • edited Loading

Required information

Issue description

Steps to reproduce

Expected results

Actual results

Additional Information

tomponline commented Mar 23, 2023

mpontillo commented Mar 23, 2023 • edited Loading

Anis-cpu-13 commented Mar 27, 2023

tomponline commented Mar 27, 2023

Gio2241 commented May 14, 2023

Gio2241 commented May 26, 2023

tomponline commented May 26, 2023 • edited Loading

Gio2241 commented May 26, 2023

tomponline commented May 26, 2023 • edited Loading

Gio2241 commented May 26, 2023

tomponline commented May 26, 2023

tomponline commented May 26, 2023

Gio2241 commented May 26, 2023

Gio2241 commented May 26, 2023 • edited Loading

tomponline commented May 26, 2023 • edited Loading

Gio2241 commented May 29, 2023

mpontillo commented Mar 23, 2023 •

edited

Loading

mpontillo commented Mar 23, 2023 •

edited

Loading

tomponline commented May 26, 2023 •

edited

Loading

tomponline commented May 26, 2023 •

edited

Loading

Gio2241 commented May 26, 2023 •

edited

Loading

tomponline commented May 26, 2023 •

edited

Loading