Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SR-IOV Device Plugin Pods Keep Restarting with 'terminated' Signal on All Nodes #610

Open
koh-hr opened this issue Dec 6, 2024 · 6 comments

Comments

@koh-hr
Copy link

koh-hr commented Dec 6, 2024

What happened?

The sriov-device-plugin Pod keeps restarting repeatedly with the following log messages. This issue occurs on all Pods in the DaemonSet across the targeted nodes, not just a specific Pod.

main.go:87] Received signal "terminated", shutting down.
server.go:318] stopping hostdev device plugin server...
server.go:182] ListAndWatch(hostdev): terminate signal received

Looking at the logs of the Operator does not provide any useful information. I'm stuck and would appreciate your help.

What did you expect to happen?

The Pods should remain running without restarting.

What are the minimal steps needed to reproduce the bug?

Deploy the sriov-device-plugin to Kubernetes using the NVIDIA Network Operator v24.1.0, following the guide below:
https://docs.nvidia.com/networking/display/kubernetes2410/getting+started+with+kubernetes#src-2494425587_GettingStartedwithKubernetes-NetworkOperatorDeploymentforGPUDirectWorkloads

Anything else we need to know?

This issue is occurring in an on-premises environment.
The same issue happens in two separate clusters.

Component Versions

Please fill in the below table with the version numbers of components used.

Component Version
SR-IOV Network Device Plugin image: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:2cc723dcbc712290055b763dc9d3c090ba41e929
SR-IOV CNI Plugin Not used
Multus v3.9.3
Kubernetes v1.27.14
OS Ubuntu 20.04.6/22.04.5

Config Files

Config file locations may be config dependent.

Device pool config file location (Try '/etc/pcidp/config.json')

Command executed on the host:

$ ls -l /etc/cni/multus/net.d
ls: cannot access '/etc/cni/multus/net.d': No such file or directory
Multus config (Try '/etc/cni/multus/net.d')

Command executed on the host:

$ ls -l /etc/cni/multus/net.d
ls: cannot access '/etc/cni/multus/net.d': No such file or directory
CNI config (Try '/etc/cni/net.d/')

Command executed on the host:

$ ls -l /etc/cni/net.d/
total 20
-rw------- 1 root root  872 12月  5 14:46 00-multus.conf
-rw-r--r-- 1 root root  679 12月  5 14:47 10-calico.conflist
-rw------- 1 root root 2574 12月  6 08:41 calico-kubeconfig
drwxr-xr-x 2 root root 4096 12月  5 14:46 multus.d
drwxr-xr-x 2 root root 4096 10月  3 13:48 whereabouts.d
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)

・kubeadm

SR-IOV Network Custom Resource Definition

The configuration for the NicClusterPolicy is as follows:

$ kubectl get nicclusterpolicies.mellanox.com nic-cluster-policy -o yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
・・・
spec:
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": ["15b3"],
              "deviceIDs": [],
              "drivers": [],
              "ifNames": [],
              "linkTypes": []
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    imagePullSecrets: []
    repository: ghcr.io/mellanox
    version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775
  secondaryNetwork:
    cniPlugins:
      image: plugins
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v1.3.0
    ipamPlugin:
      image: whereabouts
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.6.2
    multus:
      image: multus-cni
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.9.3
  sriovDevicePlugin:
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "devices": ["1014", "101e"],
              "drivers": [],
              "pfNames": [],
              "pciAddresses": [],
              "rootDevices": [],
              "linkTypes": [],
              "isRdma": true
            }
          }
        ]
      }
    image: sriov-network-device-plugin
    imagePullSecrets: []
    repository: ghcr.io/k8snetworkplumbingwg
    version: 2cc723dcbc712290055b763dc9d3c090ba41e929
status:
  appliedStates:
  - name: state-multus-cni
    state: ready
  - name: state-container-networking-plugins
    state: ready
  - name: state-ipoib-cni
    state: ignore
  - name: state-whereabouts-cni
    state: ready
  - name: state-OFED
    state: ignore
  - name: state-SRIOV-device-plugin
    state: ready
  - name: state-RDMA-device-plugin
    state: ready
  - name: state-ib-kubernetes
    state: ignore
  - name: state-nv-ipam-cni
    state: ignore
  - name: state-nic-feature-discovery
    state: ignore
  state: ready

Logs

The following logs are from a node targeted by SR-IOV. The same issue occurs on non-targeted nodes.

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)
$ kubectl logs -n nvidia-network-operator sriov-device-plugin-2lz5z -f
I1206 08:36:58.310502       1 manager.go:57] Using Kubelet Plugin Registry Mode
I1206 08:36:58.311032       1 main.go:46] resource manager reading configs
I1206 08:36:58.311060       1 manager.go:86] raw ResourceList: { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "hostdev", "selectors": { "vendors": ["15b3"], "devices": ["1014", "101e"], "drivers": [], "pfNames": [], "pciAddresses": [], "rootDevices": [], "linkTypes": [], "isRdma": true } } ] }
I1206 08:36:58.311149       1 factory.go:198] *types.NetDeviceSelectors for resource hostdev is [0xc000392d20]
I1206 08:36:58.311157       1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:nvidia.com ResourceName:hostdev DeviceType:netDevice ExcludeTopology:false Selectors:0xc000011db8 AdditionalInfo:map[] SelectorObjs:[0xc000392d20]}]
I1206 08:36:58.311187       1 manager.go:217] validating resource name "nvidia.com/hostdev"
I1206 08:36:58.311193       1 main.go:62] Discovering host devices
I1206 08:36:58.435168       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:0b:00.0	02          	Intel Corporation   	Ethernet Controller X550
I1206 08:36:58.435215       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435219       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.1	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435221       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.2	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435224       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.3	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435226       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.4	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435228       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.5	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435231       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.6	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435233       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:00.7	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435241       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:18:01.0	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.435244       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:29:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435246       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:29:00.1	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435249       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:40:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435252       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:4f:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435254       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:5e:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435258       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:82:00.0	02          	Intel Corporation   	Ethernet Controller E810-C for QSFP
I1206 08:36:58.435261       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:82:00.1	02          	Intel Corporation   	Ethernet Controller E810-C for QSFP
I1206 08:36:58.435266       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9a:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435269       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:aa:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435272       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:aa:00.1	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435274       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:c0:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435277       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:ce:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435279       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:dc:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.435284       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:0b:00.0	02          	Intel Corporation   	Ethernet Controller X550
I1206 08:36:58.435576       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.437685       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.1	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.437862       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.2	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438015       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.3	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438152       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.4	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438277       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.5	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438412       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.6	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438558       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:00.7	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438687       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:18:01.0	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I1206 08:36:58.438820       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:29:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.438975       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:29:00.1	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.439101       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:40:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.439240       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:4f:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.439383       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:5e:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.439559       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:82:00.0	02          	Intel Corporation   	Ethernet Controller E810-C for QSFP
I1206 08:36:58.439708       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:82:00.1	02          	Intel Corporation   	Ethernet Controller E810-C for QSFP
I1206 08:36:58.439849       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9a:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.440022       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:aa:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.440176       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:aa:00.1	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.440331       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:c0:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.440499       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:ce:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.440641       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:dc:00.0	02          	Mellanox Technolo...	MT2910 Family [ConnectX-7]
I1206 08:36:58.440790       1 main.go:68] Initializing resource servers
I1206 08:36:58.440794       1 manager.go:117] number of config: 1
I1206 08:36:58.440797       1 manager.go:121] Creating new ResourcePool: hostdev
I1206 08:36:58.440801       1 manager.go:122] DeviceType: netDevice
W1206 08:36:58.440816       1 pciNetDevice.go:71] RDMA resources for 0000:0b:00.0 not found. Are RDMA modules loaded?
I1206 08:36:58.442032       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.1. <nil>
I1206 08:36:58.443401       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.2. <nil>
I1206 08:36:58.444023       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.3. <nil>
I1206 08:36:58.444543       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.4. <nil>
I1206 08:36:58.445116       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.5. <nil>
I1206 08:36:58.445632       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.6. <nil>
I1206 08:36:58.446304       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:00.7. <nil>
I1206 08:36:58.446958       1 utils.go:81] Devlink query for eswitch mode is not supported for device 0000:18:01.0. <nil>
W1206 08:36:58.449691       1 pciNetDevice.go:71] RDMA resources for 0000:82:00.0 not found. Are RDMA modules loaded?
W1206 08:36:58.449775       1 pciNetDevice.go:71] RDMA resources for 0000:82:00.1 not found. Are RDMA modules loaded?
I1206 08:36:58.453150       1 manager.go:138] initServers(): selector index 0 will register 8 devices
I1206 08:36:58.453167       1 factory.go:111] device added: [identifier: 0000:18:00.1, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453170       1 factory.go:111] device added: [identifier: 0000:18:00.2, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453173       1 factory.go:111] device added: [identifier: 0000:18:00.3, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453175       1 factory.go:111] device added: [identifier: 0000:18:00.4, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453177       1 factory.go:111] device added: [identifier: 0000:18:00.5, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453179       1 factory.go:111] device added: [identifier: 0000:18:00.6, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453181       1 factory.go:111] device added: [identifier: 0000:18:00.7, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453186       1 factory.go:111] device added: [identifier: 0000:18:01.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I1206 08:36:58.453200       1 manager.go:156] New resource server is created for hostdev ResourcePool
I1206 08:36:58.453204       1 main.go:74] Starting all servers...
I1206 08:36:58.453389       1 server.go:254] starting hostdev device plugin endpoint at: nvidia.com_hostdev.sock
I1206 08:36:58.453885       1 server.go:282] hostdev device plugin endpoint started serving
I1206 08:36:58.453927       1 main.go:79] All servers started.
I1206 08:36:58.453930       1 main.go:80] Listening for term signals
I1206 08:36:59.333405       1 server.go:116] Plugin: nvidia.com_hostdev.sock gets registered successfully at Kubelet
I1206 08:36:59.333424       1 server.go:157] ListAndWatch(hostdev) invoked
I1206 08:36:59.333464       1 server.go:170] ListAndWatch(hostdev): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:18:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:18:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
I1206 08:39:41.368880       1 main.go:87] Received signal "terminated", shutting down.
I1206 08:39:41.368971       1 server.go:318] stopping hostdev device plugin server...
I1206 08:39:41.369006       1 server.go:182] ListAndWatch(hostdev): terminate signal received
Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
$ sudo journalctl -u kubelet | grep sriov-device-plugin-2lz5z
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.463195   10422 topology_manager.go:212] "Topology Admit Handler" podUID=62c854d3-f207-4578-93fb-c10114aadf46 podNamespace="nvidia-network-operator" podName="sriov-device-plugin-2lz5z"
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.596513   10422 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"devicesock\" (UniqueName: \"kubernetes.io/host-path/62c854d3-f207-4578-93fb-c10114aadf46-devicesock\") pod \"sriov-device-plugin-2lz5z\" (UID: \"62c854d3-f207-4578-93fb-c10114aadf46\") " pod="nvidia-network-operator/sriov-device-plugin-2lz5z"
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.596545   10422 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"device-info\" (UniqueName: \"kubernetes.io/host-path/62c854d3-f207-4578-93fb-c10114aadf46-device-info\") pod \"sriov-device-plugin-2lz5z\" (UID: \"62c854d3-f207-4578-93fb-c10114aadf46\") " pod="nvidia-network-operator/sriov-device-plugin-2lz5z"
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.596565   10422 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"log\" (UniqueName: \"kubernetes.io/host-path/62c854d3-f207-4578-93fb-c10114aadf46-log\") pod \"sriov-device-plugin-2lz5z\" (UID: \"62c854d3-f207-4578-93fb-c10114aadf46\") " pod="nvidia-network-operator/sriov-device-plugin-2lz5z"
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.596585   10422 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-rp9rc\" (UniqueName: \"kubernetes.io/projected/62c854d3-f207-4578-93fb-c10114aadf46-kube-api-access-rp9rc\") pod \"sriov-device-plugin-2lz5z\" (UID: \"62c854d3-f207-4578-93fb-c10114aadf46\") " pod="nvidia-network-operator/sriov-device-plugin-2lz5z"
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.596606   10422 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"config-volume\" (UniqueName: \"kubernetes.io/configmap/62c854d3-f207-4578-93fb-c10114aadf46-config-volume\") pod \"sriov-device-plugin-2lz5z\" (UID: \"62c854d3-f207-4578-93fb-c10114aadf46\") " pod="nvidia-network-operator/sriov-device-plugin-2lz5z"
12月 06 17:36:57 node2h-1 kubelet[10422]: I1206 17:36:57.596628   10422 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"plugins-registry\" (UniqueName: \"kubernetes.io/host-path/62c854d3-f207-4578-93fb-c10114aadf46-plugins-registry\") pod \"sriov-device-plugin-2lz5z\" (UID: \"62c854d3-f207-4578-93fb-c10114aadf46\") " pod="nvidia-network-operator/sriov-device-plugin-2lz5z"
12月 06 17:37:38 node2h-1 kubelet[10422]: I1206 17:37:38.424618   10422 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="nvidia-network-operator/sriov-device-plugin-2lz5z" podStartSLOduration=41.424578436 podCreationTimestamp="2024-12-06 17:36:57 +0900 JST" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-12-06 17:36:59.33157114 +0900 JST m=+96647.890058047" watchObservedRunningTime="2024-12-06 17:37:38.424578436 +0900 JST m=+96686.983065344"
@zeeke
Copy link
Member

zeeke commented Dec 6, 2024

hi @koh-hr, can you please attach logs from the sriov-network-config-daemon of the same node? That daemon kills/restarts the device plugin when it completes a configuration, and it is probably in a configuration loop.

@koh-hr
Copy link
Author

koh-hr commented Dec 6, 2024

@zeeke
Thank you for the reply! Since the sriov-network-config-daemon was not running, I added the necessary label to the node using the ds node selector and started the Pod. The logs are as follows:

$ kubectl get ds -n nvidia-network-operator -o wide
NAME                                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                              AGE   CONTAINERS                    IMAGES                                                                                              SELECTOR
cni-plugins-ds                                   7         7         7       7            7           <none>                                                                                                     63d   cni-plugins                   ghcr.io/k8snetworkplumbingwg/plugins:v1.3.0                                                         name=cni-plugins
kube-multus-ds                                   7         7         7       7            7           <none>                                                                                                     63d   kube-multus                   ghcr.io/k8snetworkplumbingwg/multus-cni:v3.9.3                                                      name=multus
network-operator-node-feature-discovery-worker   10        10        10      10           10          <none>                                                                                                     63d   worker                        registry.k8s.io/nfd/node-feature-discovery:v0.13.2                                                  app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=node-feature-discovery,role=worker
rdma-shared-dp-ds                                7         7         7       7            7           feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false              63d   rdma-shared-dp                ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775            app=rdma-shared-dp
sriov-device-plugin                              7         7         7       7            7           feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false              10s   kube-sriovdp                  ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:2cc723dcbc712290055b763dc9d3c090ba41e929   name=sriov-device-plugin
sriov-network-config-daemon                      0         0         0       0            0           beta.kubernetes.io/os=linux,network.nvidia.com/operator.mofed.wait=false,node-role.kubernetes.io/worker=   63d   sriov-network-config-daemon   nvcr.io/nvidia/mellanox/sriov-network-operator-config-daemon:network-operator-24.1.0                app=sriov-network-config-daemon
whereabouts                                      7         7         7       7            7           <none>                                                                                                     63d   whereabouts                   ghcr.io/k8snetworkplumbingwg/whereabouts:v0.6.2                                                     name=whereabouts

$ kubectl label node node2h-1 node-role.kubernetes.io/worker=
node/node2h-1 labeled
$ kubectl logs -n nvidia-network-operator sriov-network-config-daemon-cf2jn -f
2024-12-06T10:50:52.230941843Z	INFO	sriov-network-config-daemon	cobra/command.go:940	starting node writer
2024-12-06T10:50:52.251782297Z	INFO	sriov-network-config-daemon	cobra/command.go:940	Running on	{"platform": "Baremetal"}
2024-12-06T10:50:52.267876463Z	ERROR	sriov-network-config-daemon/start.go:260	SendEvent(): Failed to fetch node state, skip SendEvent	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
2024-12-06T10:50:52.267915984Z	INFO	sriov-network-config-daemon/start.go:263	RunOnce()
2024-12-06T10:50:52.267923355Z	INFO	sriov-network-config-daemon/start.go:263	RunOnce(): first poll for nic status
2024-12-06T10:50:52.389190289Z	ERROR	sriov/sriov.go:260	GetNetDevLinkSpeed(): fail to read Link Speed file	{"path": "/sys/class/net/ens6f0/speed", "error": "read /sys/class/net/ens6f0/speed: invalid argument"}
2024-12-06T10:50:52.389424691Z	ERROR	sriov/sriov.go:260	GetNetDevLinkSpeed(): fail to read Link Speed file	{"path": "/sys/class/net/ens6f1/speed", "error": "read /sys/class/net/ens6f1/speed: invalid argument"}
2024-12-06T10:50:52.396986803Z	ERROR	wait/wait.go:109	getNodeState(): Failed to fetch node state, close all connections and retry...	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
2024-12-06T10:51:02.41984047Z	ERROR	wait/wait.go:109	getNodeState(): Failed to fetch node state, close all connections and retry...	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
2024-12-06T10:51:12.419159128Z	ERROR	wait/wait.go:109	getNodeState(): Failed to fetch node state, close all connections and retry...	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
2024-12-06T10:51:22.408352941Z	ERROR	wait/wait.go:109	getNodeState(): Failed to fetch node state, close all connections and retry...	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
2024-12-06T10:51:32.42198733Z	ERROR	wait/wait.go:109	getNodeState(): Failed to fetch node state, close all connections and retry...	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
2024-12-06T10:51:42.419426714Z	ERROR	wait/wait.go:109	getNodeState(): Failed to fetch node state, close all connections and retry...	{"name": "node2h-1", "error": "sriovnetworknodestates.sriovnetwork.openshift.io \"node2h-1\" not found"}
・・・
$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -A
No resources found

It seems the label was applied correctly, but I would appreciate further assistance with any additional troubleshooting steps. Let me know if there are other configurations I should check or if anything else needs to be adjusted.

@zeeke
Copy link
Member

zeeke commented Dec 6, 2024

SriovNetworkNodeState resources are created by the sriov-network-operator, for each node with labels:

"node-role.kubernetes.io/worker": "",
"kubernetes.io/os":               "linux",

https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/5f492e5dbec9b5fcfb08dea5ca8f2687a6a820da/controllers/sriovnetworknodepolicy_controller.go#L111

can you please check the node labels and the sriov-network-operator logs?

@koh-hr
Copy link
Author

koh-hr commented Dec 9, 2024

@zeeke

It seems that the resource was successfully created.
However, the issue with the sriov-device-plugin restarting persists.

ubuntu@jump:~$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -A
NAMESPACE                 NAME       SYNC STATUS   AGE
nvidia-network-operator   node2h-1   Succeeded     2d20h
ubuntu@jump:~$
ubuntu@jump:~$ kubectl describe sriovnetworknodestates.sriovnetwork.openshift.io node2h-1 -n nvidia-network-operator
Name:         node2h-1
Namespace:    nvidia-network-operator
Labels:       <none>
Annotations:  <none>
API Version:  sriovnetwork.openshift.io/v1
Kind:         SriovNetworkNodeState
Metadata:
  Creation Timestamp:  2024-12-06T10:54:42Z
  Generation:          2
  Owner References:
    API Version:           sriovnetwork.openshift.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  SriovNetworkNodePolicy
    Name:                  default
    UID:                   a841810d-f4a2-4669-93f3-cf6fe676ab7e
  Resource Version:        2740379581
  UID:                     7e1f793d-2274-43b1-9773-d6e96d344330
Spec:
  Dp Config Version:  e44d41faac944961ab1dc002408be692
・・・

The logs from the sriov-network-config-daemon are as follows
The interfaces ens6f0 and ens6f1, which are causing errors, are Intel NICs and are not intended for use with SR-IOV.

$ kubectl logs -n nvidia-network-operator sriov-network-config-daemon-cf2jn
・・・
2024-12-09T07:11:06.984070151Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): new generation	{"generation": 2}
2024-12-09T07:11:06.984133631Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): Interface not changed
2024-12-09T07:11:06.984140933Z	INFO	daemon/daemon.go:344	Successfully synced
2024-12-09T07:11:21.983273925Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): new generation	{"generation": 2}
2024-12-09T07:11:21.983335501Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): Interface not changed
2024-12-09T07:11:21.983342108Z	INFO	daemon/daemon.go:344	Successfully synced
2024-12-09T07:11:27.562522367Z	ERROR	sriov/sriov.go:260	GetNetDevLinkSpeed(): fail to read Link Speed file	{"path": "/sys/class/net/ens6f0/speed", "error": "read /sys/class/net/ens6f0/speed: invalid argument"}
2024-12-09T07:11:27.562871543Z	ERROR	sriov/sriov.go:260	GetNetDevLinkSpeed(): fail to read Link Speed file	{"path": "/sys/class/net/ens6f1/speed", "error": "read /sys/class/net/ens6f1/speed: invalid argument"}
2024-12-09T07:11:27.573140181Z	INFO	daemon/writer.go:147	setNodeStateStatus(): status	{"sync-status": "Succeeded", "last-sync-error": ""}
2024-12-09T07:11:36.986560106Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): new generation	{"generation": 2}
2024-12-09T07:11:36.986600914Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): Interface not changed
2024-12-09T07:11:36.986607669Z	INFO	daemon/daemon.go:344	Successfully synced
2024-12-09T07:11:51.984636299Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): new generation	{"generation": 2}
2024-12-09T07:11:51.984678459Z	INFO	daemon/daemon.go:328	nodeStateSyncHandler(): Interface not changed
2024-12-09T07:11:51.98468552Z	INFO	daemon/daemon.go:344	Successfully synced
2024-12-09T07:11:57.707216322Z	ERROR	sriov/sriov.go:260	GetNetDevLinkSpeed(): fail to read Link Speed file	{"path": "/sys/class/net/ens6f0/speed", "error": "read /sys/class/net/ens6f0/speed: invalid argument"}
2024-12-09T07:11:57.70745698Z	ERROR	sriov/sriov.go:260	GetNetDevLinkSpeed(): fail to read Link Speed file	{"path": "/sys/class/net/ens6f1/speed", "error": "read /sys/class/net/ens6f1/speed: invalid argument"}
2024-12-09T07:11:57.717879359Z	INFO	daemon/writer.go:147	setNodeStateStatus(): status	{"sync-status": "Succeeded", "last-sync-error": ""}

@zeeke
Copy link
Member

zeeke commented Dec 9, 2024

Please, attach the full config-daemon logs or search for restart device plugin pod

@koh-hr
Copy link
Author

koh-hr commented Dec 10, 2024

@zeeke
I have checked, but it seems that the target string does not exist.
The logs I have already shared are simply being repeated.

$ kubectl logs -n nvidia-network-operator sriov-network-config-daemon-cf2jn | grep restart
$
$ kubectl logs -n nvidia-network-operator sriov-network-config-daemon-cf2jn | grep device
$
$ kubectl logs -n nvidia-network-operator sriov-network-config-daemon-cf2jn | grep plugin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants