Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

Closed
ejweber opened this issue Dec 19, 2023 · 6 comments · Fixed by #240
Closed

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

ejweber opened this issue Dec 19, 2023 · 6 comments · Fixed by #240

Comments

@ejweber
Copy link

ejweber commented Dec 19, 2023

The livenessprobe code expects to try forever to connect with the CSI plugin via csi.sock on startup.

csiConn, err := acquireConnection(context.Background(), metricsManager)
if err != nil {
// connlib should retry forever so a returned error should mean
// the grpc client is misconfigured rather than an error on the network
klog.Fatalf("failed to establish connection to CSI driver: %v", err)
}

However, this commit recently picked up a change in csi-lib-utils that returns an error after only 30 seconds.

According to the associated PR, the goal was to avoid a deadlock in which node-driver-registrar failed permanently to connect to a CSI plugin because it was referencing an old file descriptor.

In this analysis, I described a situation in which this new behavior caused a permanent outage of the Longhorn CSI plugin. Details are there, but essentially:

  • The CSI plugin fails to start for an ephemeral reason and enters a CrashLoopBackOff.
  • livenessprobe fails to connect and enters a CrashLoopBackOff.
  • Eventually, the CSI plugin can start successfully. Since livenessprobe is not running at that time, kubelet kills it, increasing the backoff.
  • Every time livenessprobe starts, the CSI plugin is waiting in backoff, so livenessprobe crashes, increasing the backoff.

IMO, livenessprobe's previous behavior was correct. It should not crash unless it is misconfigured so it is always available to answer kubelet's liveness probes.

Assuming the csi-lib-utils change was necessary, my thinking is that we should recognize the timeout error in livenessprobe and ignore it during initialization. However, I'm not I understand the exact cause of kubernetes-csi/csi-lib-utils#131. Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

cc @ConnorJC3 from the csi-lib-utils PR for any thoughts.

@ConnorJC3
Copy link

Although that change sets a reasonable default for every other sidecar, the livenessprobe is special and as you say probably needs to either actually attempt to connect infinitely or ignore this error.

Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

It could, but I don't think that's as big of a deal. In the node-driver-registrar case, this bug leads to the driver never getting registered on the node - effectively permanently bricking the driver until manual intervention. In the livenessprobe case, the pod will eventually fail the check and be restarted, so the worst case scenario is a temporary delay of driver startup in a very rare race condition.

@jsafrane
Copy link
Contributor

To me it seems the liveness probe container should never crash when it can't talk to the container, it should just report failed probes. Would it be possible to achieve with the current connection lib? It would make sense even to use grpc directly, if our lib is not flexible enough.

cc @msau42

@mauriciopoppe
Copy link
Member

What about the caller sending the option grpc.WithTimeout(time.Second * 30)? livenessprobe wouldn't sent this option, all the other sidecars would send it. Looks like the timeout is also set in ConnectWithoutMetrics

@jsafrane
Copy link
Contributor

jsafrane commented Jan 4, 2024

There is connlib.Connect(..., WithTimeout(0)) that disables the default 30 second timeout. I am testing it in #236.

@ejweber
Copy link
Author

ejweber commented Jan 4, 2024

I missed the possibility of setting it to 0 in #237, so I set it to 5 minutes. Your approach makes more logical sense.

@jsafrane
Copy link
Contributor

jsafrane commented Jan 4, 2024

@ejweber hey, sorry, I missed your PR... it was almost right!

ejweber added a commit to ejweber/longhorn-manager that referenced this issue Feb 7, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
ejweber added a commit to ejweber/longhorn-manager that referenced this issue Feb 9, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
PhanLe1010 pushed a commit to longhorn/longhorn-manager that referenced this issue Feb 9, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
mergify bot pushed a commit to longhorn/longhorn-manager that referenced this issue Feb 15, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
(cherry picked from commit 68ed92b)
mergify bot pushed a commit to longhorn/longhorn-manager that referenced this issue Feb 15, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
(cherry picked from commit 68ed92b)
PhanLe1010 pushed a commit to longhorn/longhorn-manager that referenced this issue Feb 15, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
(cherry picked from commit 68ed92b)
PhanLe1010 pushed a commit to longhorn/longhorn-manager that referenced this issue Feb 15, 2024
Longhorn 7428

This reverts commit fe405e6.
We no longer need these changes because livenessprobe was fixed upstream.
kubernetes-csi/livenessprobe#236

Signed-off-by: Eric Weber <[email protected]>
(cherry picked from commit 68ed92b)
zxh326 added a commit to juicedata/charts that referenced this issue May 8, 2024
upgrade livenessprobe container to v2.12.0 

ref: kubernetes-csi/livenessprobe#236
zxh326 added a commit to juicedata/charts that referenced this issue May 8, 2024
upgrade livenessprobe container to v2.12.0 

ref: kubernetes-csi/livenessprobe#236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants