There can be many reasons why a PersistentVolumeClaim
(PVC
) gets stuck in a Pending
state. Some possible reasons and some troubleshooting suggestions are described below.
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lvms-test Pending lvms-vg1 11s
To troubleshoot the issue, inspect the events associated with the PVC. These events can provide valuable insights into any errors or issues encountered during the provisioning process.
$ oc describe pvc lvms-test
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller storageclass.storage.k8s.io "lvms-vg1" not found
If you encounter a storageclass.storage.k8s.io 'lvms-vg1' not found
error, verify the presence of the LVMCluster
resource:
$ oc get lvmcluster -n openshift-storage
NAME AGE
my-lvmcluster 65m
If an LVMCluster
resource is not found, you can create one. See samples/lvm_v1alpha1_lvmcluster.yaml for an example CR that you can modify.
$ oc create -n openshift-storage -f https://github.com/openshift/lvm-operator/raw/main/config/samples/lvm_v1alpha1_lvmcluster.yaml
If an LVMCluster
already exists, check if all the pods from LVMS are in the Running
state in the openshift-storage
namespace:
$ oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
lvms-operator-7b9fb858cb-6nsml 1/1 Running 0 70m
vg-manager-r6zdv 1/1 Running 0 66m
There should be one running instance of lvms-operator
and vg-manager
(per Node).
This error indicates a failure in locating an available disk for LVMS utilization. To investigate further and obtain relevant information, review the status of the LVMCluster
CR:
$ oc describe lvmcluster -n openshift-storage
If you encounter a failure message such as no available devices found
while inspecting the status, establish a direct connection to the host where the problem is occurring. From there, run:
$ lsblk --paths --json -o NAME,ROTA,TYPE,SIZE,MODEL,VENDOR,RO,STATE,KNAME,SERIAL,PARTLABEL,FSTYPE
This prints information about the disks on the host. Review this information to see why a device is not considered available for LVMS utilization. For example, if a device has partlabel bios
or reserved
, or if they are suspended or read-only, or if they have children disks or fstype
set, LVMS considers them unavailable. Check filter.go for the complete list of filters LVMS makes use of.
If you set a device path in the
LVMCluster
CR underspec.storage.deviceClasses.deviceSelector.paths
, make sure the paths match withkname
of the device from thelsblk
output.
You can also review the logs of the vg-manager
pod to see if there are any further issues:
$ oc logs -l app.kubernetes.io/name=vg-manager -n openshift-storage
If you encounter a failure message such as failed to check volume existence
while inspecting the events associated with the PVC
, it might indicate a potential issue related to the underlying volume or disk. This failure message suggests that there is problem with the availability or accessibility of the specified volume. Further investigation is recommended to identify the exact cause and resolve the underlying issue.
$ oc describe pvc lvms-test
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller failed to provision volume with StorageClass "lvms-vg1": rpc error: code = Internal desc = failed to check volume existence
To investigate the issue further, establish a direct connection to the host where the problem is occurring. From there, create a new file on the disk. This will help you to see the underlying problem related to the disk. After resolving the underlying disk problem, if the recurring issue persists despite the resolution, it may be necessary to perform a forced cleanup procedure for LVMS. After completing the cleanup process, re-create the LVMCluster
CR. When you re-create the LVMCluster
CR, all associated objects and resources are recreated, providing a clean starting point for the LVMS deployment. This helps to ensure a reliable and consistent environment.
If PVCs associated with a specific node remain in a Pending
state, it suggests a potential issue with that particular node. To identify the problematic node, you can examine the restart count of the vg-manager
pod. An increased restart count indicates potential problems with the underlying node, which may require further investigation and troubleshooting.
$ oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
vg-manager-r6zdv 1/1 Running 17 (8s ago) 66m
vg-manager-990ut 1/1 Running 0 66m
vg-manager-an118 1/1 Running 0 66m
After resolving the issue with the respective node, if the problem persists and reoccurs, it may be necessary to perform a forced cleanup procedure for LVMS. After completing the cleanup process, re-create the LVMCluster. By re-creating the LVMCluster, all associated objects and resources are recreated, providing a clean starting point for the LVMS deployment. This helps to ensure a reliable and consistent environment.
After resolving any disk or node-related problem, if the recurring issue persists despite the resolution, it may be necessary to perform a forced cleanup procedure for LVMS. This procedure aims to comprehensively address any persistent issues and ensure the proper functioning of the LVMS solution.
-
Remove all the PVCs created using LVMS, and pods using those PVCs.
-
Switch to
openshift-storage
namespace:$ oc project openshift-storage
-
Make sure there are no
LogicalVolume
resources left:$ oc get logicalvolume No resources found
If there are
LogicalVolume
resources present, remove finalizers from the resources and delete them:oc patch logicalvolume <name> -p '{"metadata":{"finalizers":[]}}' --type=merge oc delete logicalvolume <name>
-
Make sure there are no
LVMVolumeGroup
resources left:$ oc get lvmvolumegroup No resources found
If there are any
LVMVolumeGroup
resources left, remove finalizers from these resources and delete them:$ oc patch lvmvolumegroup <name> -p '{"metadata":{"finalizers":[]}}' --type=merge $ oc delete lvmvolumegroup <name>
-
Remove any
LVMVolumeGroupNodeStatus
resources:$ oc patch lvmvolumegroupnodestatus <name> -p '{"metadata":{"finalizers":[]}}' --type=merge $ oc delete lvmvolumegroupnodestatus --all
-
Remove the
LVMCluster
resource:oc delete lvmcluster --all
In some situations, it might be undesirable to perform a forced cleanup, as it may result in data loss of all data in the LVMCluster. You can perform a graceful cleanup with partial data recovery in such cases. This procedure aims to address the issue while preserving the data associated with the PVCs that are not affected. This is most likely to happen in a multi-node environment, but sometimes can also help in single-node situations where one or more, but not all included disks of the LVMCluster are affected.
In case of a failing / infinitely progressing LVMCluster, you can try to perform a graceful cleanup when:
- The volume group in lvm2 is still recognizable on the host system and the disks are still accessible.
- The
vg-manager
pod is stuck in aCrashLoopBackOff
state and the LVMCluster is in a continuous Failed or Progressing state.
Pre-requisites to perform a graceful cleanup with partial recovery of data after node failure (multiple nodes)
In case of a failing / infinitely progressing LVMCluster due to one or many nodes failing, you can try to perform a graceful cleanup when:
- The volume group in lvm2 is still recognizable on the host system and the disks are still accessible on at least one of the nodes selected in the NodeSelector of the LVMCluster.
- The
vg-manager
pod is stuck in aCrashLoopBackOff
state and the LVMCluster is in a continuous Failed or Progressing state on all but that accessible one node.
In other words: make sure at least one node is available with a running vg-manager
pod and a healthy volume group node status.
LVMCluster and its node-daemon vg-manager periodically reconcile changes on the node and attempt to recreate the Volume Group and Thin Pool (if necessary) on the node. If the Volume Group is still recognizable on the host system and the disks are still accessible, you can try to recover from the failure without resetting the LVMCluster by following the normal recovery process in lvm.
-
Login to the node where the
vg-manager
pod is stuck in aCrashLoopBackOff
state. -
Check the status of the Volume Group:
$ vgs <YOUR_VG_NAME_HERE> -o all --reportformat json --units g | jq .report[0].vg[0]
If the Volume Group is still recognizable, you can try to recover from the failure based on its returned attributes:
Here is a healthy example of a volume group:
{ "vg_fmt": "lvm2", "vg_uuid": "mb1eRp-0x1O-jdJc-bFPN-tB2o-poS5-nOnmPi", "vg_name": "vg1", "vg_attr": "wz--n-", "vg_permissions": "writeable", "vg_extendable": "extendable", "vg_exported": "", "vg_autoactivation": "enabled", "vg_partial": "", "vg_allocation_policy": "normal", "vg_clustered": "", "vg_shared": "", "vg_size": "0.97g", "vg_free": "0.97g", "vg_sysid": "", "vg_systemid": "", "vg_lock_type": "", "vg_lock_args": "", "vg_extent_size": "0.00g", "vg_extent_count": "249", "vg_free_count": "249", "max_lv": "0", "max_pv": "0", "pv_count": "1", "vg_missing_pv_count": "0", "lv_count": "0", "snap_count": "0", "vg_seqno": "1", "vg_tags": "", "vg_profile": "", "vg_mda_count": "1", "vg_mda_used_count": "1", "vg_mda_free": "0.00g", "vg_mda_size": "0.00g", "vg_mda_copies": "unmanaged" }
There are quite a few fields available that give hints on possible issues. The most important ones are the
vg_attr
field that should bewz--n-
for a normal volume group. This indicates a normal writeable, zeroing volume group Any deviation from this might indicate a problem with the volume group.vg_free
field that should be (significantly) greater than 1. This indicates the amount of free space (in Gi) in the volume group. If this is 0, the volume group is full (or nearly full with less than 1 Gi available) and no more logical volumes can be created.vg_missing_pv_count
field, which should be 0. This indicates the number of missing physical volumes in the volume group. If this is greater than 0, the volume group is incomplete and might not be able to function properly.
Here is an example of a Volume Group missing a PV because its cable was defective:
// vgs vg1 -o all --reportformat json --units g | jq .report[0].vg[0] // Devices file PVID K1LP093KYNSP2Cpn5fdDeRXyWMzgB0Ja last seen on /dev/vdc not found. // WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. // WARNING: VG vg1 is missing PV K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja (last written to /dev/vdc). { "vg_fmt": "lvm2", "vg_uuid": "SCh4hC-XNnd-9M87-zQ6f-7y8X-Gxjl-fCZXJ6", "vg_name": "vg1", "vg_attr": "wz-pn-", "vg_permissions": "writeable", "vg_extendable": "extendable", "vg_exported": "", "vg_autoactivation": "enabled", "vg_partial": "partial", "vg_allocation_policy": "normal", "vg_clustered": "", "vg_shared": "", "vg_size": "39.99g", "vg_free": "4.00g", "vg_sysid": "", "vg_systemid": "", "vg_lock_type": "", "vg_lock_args": "", "vg_extent_size": "0.00g", "vg_extent_count": "10238", "vg_free_count": "1024", "max_lv": "0", "max_pv": "0", "pv_count": "2", "vg_missing_pv_count": "1", "lv_count": "1", "snap_count": "0", "vg_seqno": "4", "vg_tags": "lvms", "vg_profile": "", "vg_mda_count": "1", "vg_mda_used_count": "1", "vg_mda_free": "0.00g", "vg_mda_size": "0.00g", "vg_mda_copies": "unmanaged" }
-
Allow erasure of missing volumes in the volume group (if necessary due to
vg_missing_pv_count
> 0):If the volume group consisted of multiple physical volumes, it might be that one or more of them are missing, but there are still healthy volumes left. In this case, you can remove the missing volumes from the volume group to allow the volume group to function properly again. This will completely erode any potential of automatic recovery on the failing disk but allows you to activate the volume group again, with all the remaining disks and data on these disks.
Note that it is advisable to deschedule all workloads using the storage class backed by the volume group, and then call:
vgchange --activate n <YOUR_VG_NAME_HERE>
This will ensure that no active writes can occur on the volume group while you are erasing the missing volumes. Sometimes, lvm2 will prohibit you from erasing the missing volumes before deactivating the volumes in the volume group.
Once you have confirmed a disk failure, you can erase the missing volumes from the volume group by calling:
vgreduce --removemissing <YOUR_VG_NAME_HERE>
If all goes well, another verification of the volume group should confirm that the
vg_missing_pv_count
is now 0. This means the existing data on the remaining disks is still accessible and the volume group can be activated again.Note that in some cases, especially during scenarios with data loss, the request might fail like below:
vgreduce --removemissing vg1 WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. WARNING: VG vg1 is missing PV K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja (last written to /dev/vdc). WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. WARNING: Partial LV thin-pool-1 needs to be repaired or removed. WARNING: Partial LV thin-pool-1_tdata needs to be repaired or removed. There are still partial LVs in VG vg1. To remove them unconditionally use: vgreduce --removemissing --force. To remove them unconditionally from mirror LVs use: vgreduce --removemissing --mirrorsonly --force. WARNING: Proceeding to remove empty missing PVs. WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja.
Especially true with thin pool setups that encompass the now broken disks, this might cause additional issues. In this case, you can attempt to recover the pool by activating the lv in partial mode:
lvchange --activate y vg1/thin-pool-1 --activationmode=partial PARTIAL MODE. Incomplete logical volumes will be processed. WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. WARNING: VG vg1 is missing PV K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja (last written to [unknown]). Cannot activate vg1/thin-pool-1_tdata: pool incomplete.
If this fails like above, you can only accept that you must use standard data recovery tools to attempt to scrape data blocks from the disks that are still available. If it succeeds, you can still access the data left on the pool and you should recover it to a separate storage medium before proceeding. After backing up your data, deactivate the lv again and proceed with the next steps.
You can now proceed with one of two methods to recover the data from the volume group:
- Reducing the volume group to only the healthy disks and recovering the data from the remaining disks
- Replacing the failing disk with a new one and recovering the data from the remaining disks
Reducing the volume group to only the healthy disks and recovering the data from the remaining disks
Note that with this method, you will lose the data on the failing disk, but you can recover the data from the remaining disks. You will also not need to issue additional disks to the volume group, as the volume group will be reduced to only healthy disks. This has the benefit of getting the system running immediately but will result in less overall capacity, less failure resiliency, and a necessary patch to the LVMCluster resource.
To reduce the volume group to only the healthy disks, run:
vgreduce --removemissing --force vg1
This will forcefully remove all missing disks from the volume group and make the volume group accessible again.
At this point, it is safe to reschedule your workloads and continue using the volume group. However, the LVMCluster might still be failing, as the Volume Group is expected to contain the now missing device. This can happen when the LVMCluster was created with a deviceSelector in the failing deviceClass and the DeviceDiscoveryPolicy is set to Preconfigured.
To fix this, change the LVMCluster with the following patch:
oc patch lvmcluster my-lvmcluster --type='json' -p='[{"op": "remove", "path": "/spec/storage/deviceClasses/0/deviceSelector/paths/0"}]'
This might be rejected in case the LVMClusters deviceSelector is static. In this case, it is necessary to temporarily disable the ValidatingWebhook that normally ensures that the deviceSelector is valid.
oc patch validatingwebhookconfigurations.admissionregistration.k8s.io lvm-operator-webhook --type='json' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
After you are done with the patch, you can re-enable the webhook by calling
oc patch validatingwebhookconfigurations.admissionregistration.k8s.io lvm-operator-webhook --type='json' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Fail"}]'
After this, you can wait for vg-manager/LVMCluster to reconcile the changes, and the LVMCluster should be healthy again. (this time, of course, without the failing disk)
In this scenario, it is necessary to replace the failing disk with a new one. This will allow you to recover the data from the remaining disks and restore the volume group to its original capacity and resiliency state before the failure even though it will still not recover any data. This method is more time-consuming and requires additional hardware, but will result in the same state as before the failure.
To replace the failing disk with a new one, you can follow the following steps:
-
Check the necessary PVIDs to replace:
pvs Devices file PVID K1LP093KYNSP2Cpn5fdDeRXyWMzgB0Ja last seen on /dev/vdc not found. WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. WARNING: VG vg1 is missing PV K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja (last written to [unknown]). PV VG Fmt Attr PSize PFree /dev/vdb vg1 lvm2 a-- <20.00g 4.00g [unknown] vg1 lvm2 a-m <20.00g 0
In this case, the PVID of the failing disk is
K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja
. -
Register the new disk (ideally available under the same device path as before):
pvcreate --restorefile /etc/lvm/backup/vg1 --uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja /dev/vdc WARNING: Couldn't find device with uuid YForIB-bMtp-6d8V-o3WR-Eqim-paAG-cJMhgV. WARNING: Couldn't find device with uuid K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. WARNING: adding device /dev/vdc with PVID K1LP093KYNSP2Cpn5fdDeRXyWMzgB0Ja which is already used for missing device device_id none. Physical volume "/dev/vdc" successfully created. pvs WARNING: VG vg1 was previously updated while PV /dev/vdc was missing. WARNING: VG vg1 was missing PV /dev/vdc K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. PV VG Fmt Attr PSize PFree /dev/vdb vg1 lvm2 a-- <20.00g 4.00g /dev/vdc vg1 lvm2 a-m <20.00g 0 vgextend --restoremissing vg1 /dev/vdc WARNING: VG vg1 was previously updated while PV /dev/vdc was missing. WARNING: VG vg1 was missing PV /dev/vdc K1LP09-3KYN-SP2C-pn5f-dDeR-XyWM-zgB0Ja. WARNING: VG vg1 was previously updated while PV /dev/vdc was missing. WARNING: updating PV header on /dev/vdc for VG vg1. Volume group "vg1" successfully extended
Note that due to the restoration, the thin pool used up 100% of the new disk instead of the thinPoolConfig from the LVMCluster. If that is undesired, you should manually resize the vg on that pv afterward or specify the desired sizes with vgextend.
It should be mentioned that if the path used in the deviceSelector differs from the path in LVMCluster, you might need to patch the LVMCluster resource by deactivating the webhook the same way as you would when reducing the volume group.
LVMCluster and its node-daemon vg-manager periodically reconcile changes on the node and attempt to recreate the Volume Group and Thin Pool (if necessary) on the node. If the Volume Group is still recognizable on the host system and the disks are still accessible on at least one node, you can follow the following procedure to reduce the LVMCluster to only the healthy node(s) and recover from the failure without resetting the LVMCluster.
-
Modify the nodeSelector of the affected deviceClass to only include the healthy node(s) with the healthy volume group node status.
oc patch lvmcluster <MY_LVMCLUSTER_NAME_HERE> --type='json' -p='[{"op": "replace", "path": "/spec/storage/deviceClasses/0/nodeSelector", "value": {nodeSelectorTerms: [{matchExpressions: [{key: "kubernetes.io/hostname", operator: "In", values: ["<HEALTHY_NODE_NAME_HERE>"]}]}]}}]'}]'
This will remove the failing node(s) from the LVMCluster and only include the healthy node(s) in the LVMCluster.
It might be necessary to temporarily disable the ValidatingWebhook that normally ensures that the nodeSelector is valid.
oc patch validatingwebhookconfigurations.admissionregistration.k8s.io <LVM_CLUSTER_WEBHOOK_HERE> --type='json' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
After you are done with the patch, you can re-enable the webhook by calling
oc patch validatingwebhookconfigurations.admissionregistration.k8s.io <LVM_CLUSTER_WEBHOOK_HERE> --type='json' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Fail"}]'
-
Wait for the LVMCluster to reconcile the changes. The LVMCluster should now only contain the healthy node(s) and the failing node(s) should be removed from the LVMCluster. The LVMCluster should now be Ready again. Note that now pods using the deviceClass / StorageClass backed by the deviceClass will only be scheduled on the healthy node(s) and the failing node(s) will not be used / usable anymore. It is thus recommended to use a different deviceClass for the failing node(s) if you want to use them again in the future and move workloads over after recovering their data. If the node failure was temporary, you can use the same mechanism as described in the Recovery from disk failure without resetting LVMCluster section to re-enable the failing node(s) in the LVMCluster by changing the nodeSelector back to include the failing node(s) again.