Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXPERIMENT] stub test for harmless=true #1555

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

pjfanning
Copy link
Contributor

@pjfanning pjfanning commented Nov 8, 2024

  • relates to Clustering issues leading to all nodes being downed #578
  • basic tests that now watch for the quarantine event and with an experimental change to try to suppress that quarantine event when harmless=true
  • this suppression is non-default and can be enabled with the config pekko.remote.artery.propagate-harmless-quarantine-events = off

@fredfp
Copy link

fredfp commented Nov 12, 2024

we may need to modify the test harmless=true test to send a message from one cluster member to the other to cause the shutdown issue

Without an active SBR, no node will be shutdown: it is the SBR that downs itself when receiving ThisActorSystemQuarantinedEvent. Without a cluster setup in the test (and as such without SBR running), we need to watch the ThisActorSystemQuarantinedEvent event instead, which is what I did to check the bug exists (test passes if the bug exists):

"eliminate quarantined association when not used (harmless=true)" in withAssociation {
  (remoteSystem, remoteAddress, _, localArtery, localProbe) =>
    remoteSystem.eventStream.subscribe(testActor, classOf[ThisActorSystemQuarantinedEvent]) // event to watch out for, indicator of the issue

    val remoteEcho = remoteSystem.actorSelection("/user/echo").resolveOne(remainingOrDefault).futureValue

    val localAddress = RARP(system).provider.getDefaultAddress

    val localEchoRef = remoteSystem.actorSelection(RootActorPath(localAddress) / localProbe.ref.path.elements).resolveOne(remainingOrDefault).futureValue
    remoteEcho.tell("ping", localEchoRef)
    localProbe.expectMsg("ping")

    val association = localArtery.association(remoteAddress)
    val remoteUid = futureUniqueRemoteAddress(association).futureValue.uid
    localArtery.quarantine(remoteAddress, Some(remoteUid), "HarmlessTest", harmless = true)
    association.associationState.isQuarantined(remoteUid) shouldBe true

    eventually {
      remoteEcho.tell("ping", localEchoRef) // trigger sending message from remote to local, which will trigger local to wrongfully notify remote that it is quarantined
      expectMsgType[ThisActorSystemQuarantinedEvent] // this is what remote emits when it learns it is quarantined by local. This is not correct and is what (with SBR enabled) triggers killing the node.
    }
}

@pjfanning
Copy link
Contributor Author

I added the new test case but I am aware that it needs to be moved to the cluster or cluster-tests projects and the Split Brain Resolver added. I am busy on other tasks so don't expect to get back to this for a while.

@fredfp
Copy link

fredfp commented Nov 12, 2024

What would it add to move the test to the cluster or cluster-tests projects? To me this is a bug of the remote module and is better tested here. For sure, you could test consequences of the bug in cluster, but the root cause is here. Is that what you have in mind: to also cover/test the consequences?

@pjfanning
Copy link
Contributor Author

I've added a change to InboundQuarantineCheck based on #578 (comment). This may not be the best solution but it seems to help in this one test case.

@fredfp
Copy link

fredfp commented Nov 13, 2024

It seems good to me like that, thank you!

@pjfanning
Copy link
Contributor Author

@raboof @mdedetrich @jrudolph what do you think about the runtime change? We could add a config to users to control if the new runtime check is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants