add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events #1555

pjfanning · 2024-11-08T12:13:49Z

relates to Clustering issues leading to all nodes being downed #578
basic tests that now watch for the quarantine event and with an experimental change to try to suppress that quarantine event when harmless=true
this suppression is non-default and can be enabled with the config pekko.remote.artery.propagate-harmless-quarantine-events = off

fredfp · 2024-11-12T09:26:44Z

we may need to modify the test harmless=true test to send a message from one cluster member to the other to cause the shutdown issue

Without an active SBR, no node will be shutdown: it is the SBR that downs itself when receiving ThisActorSystemQuarantinedEvent. Without a cluster setup in the test (and as such without SBR running), we need to watch the ThisActorSystemQuarantinedEvent event instead, which is what I did to check the bug exists (test passes if the bug exists):

"eliminate quarantined association when not used (harmless=true)" in withAssociation {
  (remoteSystem, remoteAddress, _, localArtery, localProbe) =>
    remoteSystem.eventStream.subscribe(testActor, classOf[ThisActorSystemQuarantinedEvent]) // event to watch out for, indicator of the issue

    val remoteEcho = remoteSystem.actorSelection("/user/echo").resolveOne(remainingOrDefault).futureValue

    val localAddress = RARP(system).provider.getDefaultAddress

    val localEchoRef = remoteSystem.actorSelection(RootActorPath(localAddress) / localProbe.ref.path.elements).resolveOne(remainingOrDefault).futureValue
    remoteEcho.tell("ping", localEchoRef)
    localProbe.expectMsg("ping")

    val association = localArtery.association(remoteAddress)
    val remoteUid = futureUniqueRemoteAddress(association).futureValue.uid
    localArtery.quarantine(remoteAddress, Some(remoteUid), "HarmlessTest", harmless = true)
    association.associationState.isQuarantined(remoteUid) shouldBe true

    eventually {
      remoteEcho.tell("ping", localEchoRef) // trigger sending message from remote to local, which will trigger local to wrongfully notify remote that it is quarantined
      expectMsgType[ThisActorSystemQuarantinedEvent] // this is what remote emits when it learns it is quarantined by local. This is not correct and is what (with SBR enabled) triggers killing the node.
    }
}

pjfanning · 2024-11-12T11:43:02Z

I added the new test case but I am aware that it needs to be moved to the cluster or cluster-tests projects and the Split Brain Resolver added. I am busy on other tasks so don't expect to get back to this for a while.

fredfp · 2024-11-12T14:12:35Z

What would it add to move the test to the cluster or cluster-tests projects? To me this is a bug of the remote module and is better tested here. For sure, you could test consequences of the bug in cluster, but the root cause is here. Is that what you have in mind: to also cover/test the consequences?

pjfanning · 2024-11-12T22:45:54Z

I've added a change to InboundQuarantineCheck based on #578 (comment). This may not be the best solution but it seems to help in this one test case.

fredfp · 2024-11-13T08:12:12Z

It seems good to me like that, thank you!

pjfanning · 2024-11-20T13:42:42Z

@raboof @mdedetrich @jrudolph what do you think about the runtime change? We could add a config to users to control if the new runtime check is enabled.

raboof

I finally got a chance to review this change.

The fact that 'harmless' quarantines are not propagated, but are returned in InboundQuarantineCheck, indeed looks like a bug.

I think there's a good possibility that this caused the scenario in #578 and that it'd be worth releasing this new behavior. I'd even lean towards making the new behavior the default, but I'm also OK with being conservative and first testing with the reporters of #578.

Whether this fix completely prevents situations like the one described in #578 is harder to say with confidence: there's still quite a few situations that can cause non-harmless quarantines, and it's possible there's still situations where a single misbehaving node may 'take out' its surrounding peers.

raboof · 2025-01-02T14:20:54Z

remote/src/main/mima-filters/1.1.x.backwards.excludes/quarantine.backwards.excludes

+ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp.this")
+ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp.apply")
+ProblemFilters.exclude[IncompatibleSignatureProblem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp.unapply")
+ProblemFilters.exclude[MissingTypesProblem]("org.apache.pekko.remote.artery.AssociationState$QuarantinedTimestamp$")


This class is in private[remote] context, so indeed these changes are safe.

I think you could use a wildcard:

ProblemFilters.exclude[Problem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp*")

pjfanning · 2025-01-02T17:02:48Z

Thanks for the review. I was running the tests a few weeks ago and I'm not sure that they work. @fredfp tested this a few weeks ago and said it helped but I don't yet think this is ready - until the new tests are improved.

Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala

He-Pin · 2025-01-02T18:32:43Z

A workmate told me this ,but as I am not using the cluster at work, so thanks for take care of this hard task.

I knew a team is using akka/pekko cluster with a centralized nodes server, we called it vipserver, which will decided which nodes are on the same cluster as a single source of truth instead of this gossip thing, where a wrong message spreading the cluster can take the whole cluster down.

Another reason I'm not using clustering at work is because the chaos monkey, where the sre team will schedule some randomly network partition, and I think that will always require a reboot by the application owner, I am a little lazy here.

pjfanning · 2025-01-02T19:43:07Z

The tests seem to be working for me today. If reviewers are amenable, we could merge this to main and 1.1.x branches and document that the config exists to enable an experimental fix for #578.

He-Pin · 2025-08-27T10:25:33Z

@pjfanning @raboof I think we should backport this to 1.1.0, for people who can't upgrade netty but need this.

pjfanning · 2025-08-27T10:28:48Z

I can live with this being backported. It is a reasonably big change for a patch release but it is disabled by default so users who upgrade to 1.1.6 won't get this change unless they update their config settings.

pjfanning · 2025-08-27T10:30:41Z

@He-Pin doesn't this only affect users who use Artery remoting - and you say elsewhere that you are blocked by Netty issues?

He-Pin · 2025-08-27T10:33:40Z

@pjfanning Yes, she is using Artery, transport, but to fix this, she has to upgrade to pekko 1.2, and then Netty 4.2 cause another problem :(

…harmless' quarantine events (apache#1555) * stub test for harmless=true Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala * add quarantinedButHarmless check for tests * new test case * Update OutboundIdleShutdownSpec.scala * try to not shutdown when quarantine is harmless * Update OutboundIdleShutdownSpec.scala * Create quarantine.backwards.excludes * Update quarantine.backwards.excludes * update log message * try to add config * Update ArterySettings.scala * add tests * Update OutboundIdleShutdownSpec.scala * rework test

…harmless' quarantine events (apache#1555) * stub test for harmless=true Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala * add quarantinedButHarmless check for tests * new test case * Update OutboundIdleShutdownSpec.scala * try to not shutdown when quarantine is harmless * Update OutboundIdleShutdownSpec.scala * Create quarantine.backwards.excludes * Update quarantine.backwards.excludes * update log message * try to add config * Update ArterySettings.scala * add tests * Update OutboundIdleShutdownSpec.scala * rework test (cherry picked from commit ec5e33f)

…harmless' quarantine events (#1555) (#2100) * stub test for harmless=true Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala * add quarantinedButHarmless check for tests * new test case * Update OutboundIdleShutdownSpec.scala * try to not shutdown when quarantine is harmless * Update OutboundIdleShutdownSpec.scala * Create quarantine.backwards.excludes * Update quarantine.backwards.excludes * update log message * try to add config * Update ArterySettings.scala * add tests * Update OutboundIdleShutdownSpec.scala * rework test (cherry picked from commit ec5e33f) Co-authored-by: PJ Fanning <[email protected]>

He-Pin · 2025-08-28T07:46:15Z

@fredfp Did you try this release, would you like to share the result?

fredfp · 2025-08-28T09:58:30Z

We didn't try the fix (as it's only included in a milestone releases). I'd be very happy to turn the needed flag on if this made it into 1.1.6. However only time will tell as we don't get that kind of cluster crash very often.

He-Pin · 2025-08-29T11:11:28Z

@fredfp Thanks, there is a backport to 1.1.6 and 1.2.0 will be released soon too.

fredfp · 2025-09-04T13:13:38Z

Over the last couple of days, we've successfully deployed pekko 1.2.0 with pekko.remote.artery.propagate-harmless-quarantine-events = off in 2 clusters. Both seem stable, and more stable than before i.e., no more nodes downed with DownSelfQuarantinedByRemote. A great success so far, thank you!

He-Pin · 2025-09-04T15:35:02Z

@fredfp Thank you for that udpate~~, great sharing.

He-Pin · 2025-09-05T07:55:18Z

@fredfp Are you using the pekko-management at the sametime?

fredfp · 2025-09-05T13:05:43Z

Yes, we rely on pekko-management to create and then join the cluster.

He-Pin · 2025-09-06T12:58:16Z

One of my workmates said, her cluster node is killed by k8s, not sure why :(, k8s say the management port is unreachable, and the kill the node.

pjfanning · 2025-09-06T13:09:17Z

One of my workmates said, her cluster node is killed by k8s, not sure why :(, k8s say the management port is unreachable, and the kill the node.

@He-Pin this code has nothing to do with kubernetes. If you have pekko-management issues, it would be better to raise issues in the pekko-management project.

He-Pin · 2025-09-06T13:10:27Z

@pjfanning I knew, I just asking @fredfp if he knew that issue too.

pjfanning marked this pull request as draft November 8, 2024 12:13

pjfanning mentioned this pull request Nov 8, 2024

Clustering issues leading to all nodes being downed #578

Open

pjfanning force-pushed the harmless branch from eaec74f to 38dbe25 Compare November 8, 2024 12:28

pjfanning force-pushed the harmless branch from 14efb38 to 3837aea Compare December 4, 2024 19:03

raboof reviewed Jan 2, 2025

View reviewed changes

pjfanning added 14 commits January 2, 2025 18:04

stub test for harmless=true

aa23dad

Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala

add quarantinedButHarmless check for tests

3aa7747

new test case

37c8c9a

Update OutboundIdleShutdownSpec.scala

dd77a35

try to not shutdown when quarantine is harmless

b471302

Update OutboundIdleShutdownSpec.scala

423f3cc

Create quarantine.backwards.excludes

df969a6

Update quarantine.backwards.excludes

ea82dec

update log message

4dd39b6

try to add config

6a70f2e

Update ArterySettings.scala

27e3b3c

add tests

fc1429b

Update OutboundIdleShutdownSpec.scala

642d408

rework test

7c34c67

pjfanning force-pushed the harmless branch from ba3c8fd to 7c34c67 Compare January 2, 2025 17:04

pjfanning marked this pull request as ready for review January 2, 2025 19:40

pjfanning changed the title ~~[EXPERIMENT] stub test for harmless=true~~ add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events Jan 2, 2025

raboof approved these changes Jan 3, 2025

View reviewed changes

pjfanning merged commit ec5e33f into apache:main Jan 4, 2025
9 checks passed

pjfanning deleted the harmless branch January 4, 2025 10:04

pjfanning added this to the 1.2.0 milestone Jan 4, 2025

He-Pin added the cluster label Aug 27, 2025

He-Pin mentioned this pull request Aug 27, 2025

Extract classical transport to dedicated modules and provide both Netty 4.1 and Netty 4.2 implementations? #2099

Open

He-Pin mentioned this pull request Aug 27, 2025

backport add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events (#1555) #2100

Merged

pjfanning mentioned this pull request Aug 27, 2025

add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events #2101

Closed

pjfanning mentioned this pull request Sep 4, 2025

change pekko.remote.artery.propagate-harmless-quarantine-events config default to off #2141

Closed

pjfanning mentioned this pull request Nov 5, 2025

change propagate-harmless-quarantine-events default to off #2430

Merged

add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events #1555

add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events #1555

Uh oh!

Conversation

pjfanning commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fredfp commented Nov 12, 2024

Uh oh!

pjfanning commented Nov 12, 2024

Uh oh!

fredfp commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjfanning commented Nov 12, 2024

Uh oh!

fredfp commented Nov 13, 2024

Uh oh!

pjfanning commented Nov 20, 2024

Uh oh!

raboof left a comment

Choose a reason for hiding this comment

Uh oh!

raboof Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

pjfanning commented Jan 2, 2025

Uh oh!

He-Pin commented Jan 2, 2025

Uh oh!

pjfanning commented Jan 2, 2025

Uh oh!

Uh oh!

He-Pin commented Aug 27, 2025

Uh oh!

pjfanning commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjfanning commented Aug 27, 2025

Uh oh!

He-Pin commented Aug 27, 2025

Uh oh!

He-Pin commented Aug 28, 2025

Uh oh!

fredfp commented Aug 28, 2025

Uh oh!

He-Pin commented Aug 29, 2025

Uh oh!

fredfp commented Sep 4, 2025

Uh oh!

He-Pin commented Sep 4, 2025

Uh oh!

He-Pin commented Sep 5, 2025

Uh oh!

fredfp commented Sep 5, 2025

Uh oh!

He-Pin commented Sep 6, 2025

Uh oh!

pjfanning commented Sep 6, 2025

Uh oh!

He-Pin commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pjfanning commented Nov 8, 2024 •

edited

Loading

fredfp commented Nov 12, 2024 •

edited

Loading

pjfanning commented Aug 27, 2025 •

edited

Loading