Clustering issues leading to all nodes being downed #578

fredfp · 2023-08-17T13:05:17Z

I'm reopening here an issue that I reported at the time under the akka repo.

We had a case where an issue on a single node lead to the whole akka-cluster being taken down.

Here's a summary of what happened:

Healthy cluster made of 20ish nodes, running on k8s
Node A: encounters issues, triggers CoordinatedShutdown
Node A: experiences high CPU usage, maybe GC pause
Node A: sees B as unreachable, broadcasts it (B is certainly reachable, but detected as such because of high CPU usage, GC pause, or similar issues)
Cluster state: A Leaving, B seen unreachable by A, all the other nodes are Up
Leader can currently not perform its duties (remove A), reachability status (B seen unreachable by A)
Node A: times out some coordinated shutdown phases. Hypothesis: timed out because leader could not remove A.
Node A: finishes coordinated shutdown nonetheless.
hypothesis - Node A: quarantined associations to other cluster nodes
Nodes B, C, D, E: SBR took decision DownSelfQuarantinedByRemote and is downing [...] including myself
hypothesis - Node B, C, D, E: quarantined associations to other cluster nodes
in a few steps, all remaining cluster nodes down themselves: SBR took decision DownSelfQuarantinedByRemote
the whole cluster is down

Discussions, potential issues:

Considering the behaviour of CoordinatedShutdown (phases can time out and shutdown continues), shouldn't the leader ignore unreachabilities added by a Leaving node and be allowed to perform its duties?
At step 6 above, the Leader was blocked from removing A, but A still continued its shutdown process. The catastrophic ending could have been stopped here.

DownSelfQuarantinedByRemote: @patriknw 's comment seems spot on.
At step 9, nodes B, C, D, E should probably not take into account the Quarantined from a node that is Leaving.

DownSelfQuarantinedByRemote: another case where Patrik's comment also seems to apply, Quarantined from nodes downing themselves because of DownSelfQuarantinedByRemote should probably not be taken into account.

At steps 10 and 12. Any cluster singletons running on affected nodes wouldn't be gracefully shutdown using the configured termination message. This is probably the right thing to do but I'm adding this note here nonetheless.

The text was updated successfully, but these errors were encountered:

fredfp · 2023-08-17T13:09:40Z

I have extra logs that may be useful:

Remote ActorSystem must be restarted to recover from this situation. Reason: Cluster member removed, previous status [Down]

zhenggexia · 2024-05-16T07:34:53Z

I also encountered the same problem, which caused my cluster to keep restarting. Is there a plan to fix it? When is it expected to be repaired?

pjfanning · 2024-05-16T11:50:34Z

@fredfp Can you give us more info on this - akka/akka#31095 (comment)

On the Apache Pekko side, we can read the Akka issues but not the Akka PRs (due to the Akka license not being compatible with Apache Pekko).

The issue appears to be with split brain scenarios from my reading of akka/akka#31095 - specifically DownSelfQuarantinedByRemote events. Is it possible that we should just ignore DownSelfQuarantinedByRemote events when it comes to deciding to shut down the cluster?

fredfp · 2024-05-16T17:42:55Z

@pjfanning I think the issue can happen when a node shutsdown during a partition.

Still, DownSelfQuarantinedByRemote events cannot be ignored. The root cause is that nodes should not know they were quanrantined by others in some harmless cases.

Indeed, some quarantines are harmless (as indicated by the method argument: https://github.com/apache/pekko/blob/main/remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala#L534). And the issue is that such harmless quarantine should not be be communicated to the other side i.e., the quarantined association. However, they currently always are: https://github.com/apache/pekko/blob/main/remote/src/main/scala/org/apache/pekko/remote/artery/InboundQuarantineCheck.scala#L47

zhenggexia · 2024-05-21T08:43:54Z

@pjfanning
Is there a repair plan for this issue? When is it expected to be repaired?

CruelSummerday · 2024-05-21T09:26:06Z

I also experienced the same issue, leading to continuous restarts of my cluster. Is there a scheduled resolution for this? When can we anticipate a fix?

ZDevouring · 2024-05-21T10:10:22Z

@pjfanning Can you suggest a way to fix this bug as soon as possible, thank you very much.

fredfp · 2024-05-21T10:18:12Z

This bug should hit quite seldom, if it happens often it most likely means something is not right with your cluster and you should fix that first in all cases. Especially, make sure:

there's always available CPU for the cluster managment duties (this means GC pauses need to be short)
not to use pekko's internal thread pool for your own workloads
make rolling update slower so that cluster is less unstable during rolling updates.

mmatloka · 2024-05-21T10:22:50Z

This bug should hit quite seldom, if it happens often it most likely means something is not right with your cluster and you should fix that first in all cases. Especially, make sure:

there's always available CPU for the cluster managment duties

not to use pekko's internal thread pool for your own workloads

make rolling update slower so that cluster is less unstable during rolling updates.

The issue appear also in systems with heavy memory usage and long GC pauses. It is worth to check gc strategy, gc settings, gc metrics etc

He-Pin · 2024-05-21T10:27:49Z

how about use the classical transport for now? seems the issue in lives in artery only

zhenggexia · 2024-05-21T12:46:05Z

how about use the classical transport for now? seems the issue in lives in artery only

Running Akka 2.8.5 earlier on k8s resulted in a single node restart leading to cluster down (high memory and CPU)
The above issues did not occur when running Akka 2.8.5 on the k8s cluster
The above issues did not occur when using Akka to access the Nacos registration cluster
Running Pekko 1.0.2 on k8s resulted in a single node restart causing cluster down

He-Pin · 2024-05-21T13:25:11Z

IIRC, Akka 2.8.x requires an BSL :) I don't have an env to reproduce the problem, maybe you can work out a multi-jvm test for that? and still super busy at work:(

zhenggexia · 2024-05-22T04:07:08Z

目前我的k8s集群有26个pod运行，当其中某一个pod因为资源不足重启的时候，常常会导致整个集群挂调，我们处理数据量比较大，资源占用比较高，目前在其他集群上（比如docker运行注册到nacos上），暂时没有出现这个问题

zhenggexia · 2024-07-31T08:54:02Z

Hello, has there been any progress on this issue? Is there a plan for when it will be fixed?😀

pjfanning · 2024-08-08T13:23:09Z

For Kubernetes users, we would suggest using the Kubernetes Lease described here:
https://pekko.apache.org/docs/pekko/current/split-brain-resolver.html#lease

Pekko Management 1.1.0-M1 has a 2nd implementation of the Lease - the legacy one is CRD based while the new one uses Kubernetes native leases.
apache/pekko-management#218

fredfp · 2024-08-08T14:45:24Z

For Kubernetes users, we would suggest using the Kubernetes Lease described here: https://pekko.apache.org/docs/pekko/current/split-brain-resolver.html#lease

That's what we use already and it didn't help in the current case. Do you expect it resolves (or avoids) this issue? I think the lease helps the surviving partition confirm it can indeed stay up, it hoever doesn't help the nodes downing themselves, which is the observed behaviour described above.

Pekko Management 1.1.0-M1 has a 2nd implementation of the Lease - the legacy one is CRD based while the new one uses Kubernetes native leases. apache/pekko-management#218

Thank you for pointing it out, looking forward to it!

pjfanning · 2024-08-08T14:56:01Z

@fredfp It's good to hear that using the Split Brain Resolver with a Kubernetes Lease stops all the nodes from downing themselves. When you lose some of the nodes, are you finding that you have to manually restart them or can Kubernetes handle automatically restarting them using liveness and/or readiness probes?

fredfp · 2024-08-08T15:18:47Z

@fredfp It's good to hear that using the Split Brain Resolver with a Kubernetes Lease stops all the nodes from downing themselves.

Sorry, let me be clearer: using the SBR with a Kubernetes Lease does not stop all the nodes from downing themselves.

When you lose some of the nodes, are you finding that you have to manually restart them or can Kubernetes handle automatically restarting them using liveness and/or readiness probes?

When a node downs itself, the java process (running inside the container) terminates. The container is then restarted by k8s as usual, the liveness/readiness probes do not play a part in that. Does that answer your question?

pjfanning · 2024-10-16T13:28:44Z

I think the main issue is Association.quarantine where the harmless flag is not passed on here:

pekko/remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala

Line 552 in 726ddbf

    
           .publish(GracefulShutdownQuarantinedEvent(UniqueAddress(remoteAddress, u), reason))

Since GracefulShutdownQuarantinedEvent only appears to be used for harmless=true quarantine events, we might be able to find the event subscribers and have them handle GracefulShutdownQuarantinedEvent in a different way to standard QuarantinedEvent instances. For example,

pekko/remote/src/main/scala/org/apache/pekko/remote/artery/InboundQuarantineCheck.scala

Line 31 in 726ddbf

private[remote] class InboundQuarantineCheck(inboundContext: InboundContext)

I found 3 places where harmless=true quarantine events can be kicked off - but there could be more.

https://github.com/search?q=repo%3Aapache%2Fpekko%20%22harmless%20%3D%20true%22&type=code

pjfanning · 2024-10-17T10:36:11Z

I tried yesterday to write a unit test that does artificially causes a harmless quarantine and that examines the results but so far, I haven't reproduced the issue with the cluster shut down. I think having a reproducible case is the real blocker on this issue.

fredfp · 2024-11-07T11:57:53Z

Here's my understanding:

When initially marking an association as quarantined, the Quarantined control message is not sent to the remote when harmless is true:

pekko/remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala

Lines 569 to 572 in 8cb7d25

    
           if (!harmless) { 
        
             // try to tell the other system that we have quarantined it 
        
             sendControl(Quarantined(localAddress, peer)) 
        
           }

now comes InboundQuarantineCheck into play (used in ArteryTransport.inboundSink and .inboundControlSink), it serves 2 purposes: a) drop messages incoming through a quarantined association and b) telling again the remote node it is quarantined using inboundContext.sendControl(association.remoteAddress, Quarantined(...)) in case it somehow didn't already get the message sent in 1.
when a node learns it is quarantined as a result of 2.b above, it will trigger the SBR to down itself via ThisActorSystemQuarantinedEvent, and this is what brings the whole cluster down.
the problem is triggered by 2.b above, which sends Quarantined control message also for harmless quarantines, when this case is carefully avoided in 1. We see in InboundQuarantineCheck that it doesn't rely on the quarantined status being passed via an event, but instead it is accessed directly via env.association and association.associationState.isQuarantined(). At this stage however, we lost whether the quarantine was harmless or not. This extra flag should be kept in the Association state so that it can be recovered in InboundQuarantineCheck.

About reproducing, I'm not sure because it's not clear to me how a node, when shutting down, can quarantine associations to others with harmless=true. However, if that can be done I'd suggest:

start a cluster with 2 nodes A, B.
shutdown A such that it quarantines the association to B with harmless=true
send messages from B to A, this should trigger InboundQuarantineCheck in A to send Quarantined to B (and B shutting down as a result), leading the whole cluster to be down.

pjfanning · 2024-11-08T12:18:05Z

@fredfp I haven't had much time to look at reproducing the issue - I checked in my initial attempt - see #1555

I found an existing test that did quarantining and added a new test. If you have time, would you be able to look at extending that test to cause the shutdown issue?

fredfp mentioned this issue Aug 17, 2023

Clustering issues leading to all nodes being downed. akka/akka#31095

Closed

pjfanning added help wanted Extra attention is needed needs-reproducible-test labels May 16, 2024

pjfanning added this to the 1.1.0-M2 milestone May 16, 2024

pjfanning added the bug Something isn't working label May 16, 2024

He-Pin pinned this issue May 22, 2024

nvollmar unpinned this issue Jun 25, 2024

pjfanning modified the milestones: 1.1.0, 1.1.1 Aug 23, 2024

pjfanning modified the milestones: 1.1.1, 1.1.2 Sep 9, 2024

pjfanning removed this from the 1.1.2 milestone Sep 27, 2024

pjfanning added this to the 1.1.x milestone Sep 27, 2024

pjfanning mentioned this issue Nov 8, 2024

[EXPERIMENT] stub test for harmless=true #1555

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering issues leading to all nodes being downed #578

Clustering issues leading to all nodes being downed #578

fredfp commented Aug 17, 2023 •

edited by pjfanning

Loading

fredfp commented Aug 17, 2023

zhenggexia commented May 16, 2024

pjfanning commented May 16, 2024

fredfp commented May 16, 2024

zhenggexia commented May 21, 2024

CruelSummerday commented May 21, 2024

ZDevouring commented May 21, 2024

fredfp commented May 21, 2024 •

edited

Loading

mmatloka commented May 21, 2024

He-Pin commented May 21, 2024

zhenggexia commented May 21, 2024 •

edited

Loading

He-Pin commented May 21, 2024 •

edited

Loading

zhenggexia commented May 22, 2024

zhenggexia commented Jul 31, 2024

pjfanning commented Aug 8, 2024 •

edited

Loading

fredfp commented Aug 8, 2024

pjfanning commented Aug 8, 2024

fredfp commented Aug 8, 2024

pjfanning commented Oct 16, 2024 •

edited

Loading

pjfanning commented Oct 17, 2024

fredfp commented Nov 7, 2024

pjfanning commented Nov 8, 2024

Clustering issues leading to all nodes being downed #578

Clustering issues leading to all nodes being downed #578

Comments

fredfp commented Aug 17, 2023 • edited by pjfanning Loading

Here's a summary of what happened:

Discussions, potential issues:

fredfp commented Aug 17, 2023

zhenggexia commented May 16, 2024

pjfanning commented May 16, 2024

fredfp commented May 16, 2024

zhenggexia commented May 21, 2024

CruelSummerday commented May 21, 2024

ZDevouring commented May 21, 2024

fredfp commented May 21, 2024 • edited Loading

mmatloka commented May 21, 2024

He-Pin commented May 21, 2024

zhenggexia commented May 21, 2024 • edited Loading

He-Pin commented May 21, 2024 • edited Loading

zhenggexia commented May 22, 2024

zhenggexia commented Jul 31, 2024

pjfanning commented Aug 8, 2024 • edited Loading

fredfp commented Aug 8, 2024

pjfanning commented Aug 8, 2024

fredfp commented Aug 8, 2024

pjfanning commented Oct 16, 2024 • edited Loading

pjfanning commented Oct 17, 2024

fredfp commented Nov 7, 2024

pjfanning commented Nov 8, 2024

fredfp commented Aug 17, 2023 •

edited by pjfanning

Loading

fredfp commented May 21, 2024 •

edited

Loading

zhenggexia commented May 21, 2024 •

edited

Loading

He-Pin commented May 21, 2024 •

edited

Loading

pjfanning commented Aug 8, 2024 •

edited

Loading

pjfanning commented Oct 16, 2024 •

edited

Loading