Replies: 2 comments 2 replies
-
|
maxFailovers limitation of retry policy should apply to one request. By going through the GrpcOmTransport.java and GrpcOMFailoverProxyProvider.java roughly, I believe there is improvement room for the retry behavior here, just as @greenwich mentioned. @greenwich , could you like to submit a PR if you have the fix? |
Beta Was this translation helpful? Give feedback.
2 replies
-
|
cc @rakeshadr |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We recently had a production incident with Ozone
2.0 that caused one of the S3G instances to become stuck and unable to fail over to a new leader OM.
Summary
Our ozone cluster is running with kube; we have a bunch of kube nodes, each node has one S3g and one DN running. Some kube nodes additionally have one OM or one SCM instance running. We have three OMs: om0, om1, and om2.
So, for some reason, one of the kube nodes with S3g, DN, and om1 (leader) running went into a non-Ready state for a few minutes (so om1 was still running but didn't serve any traffic). That caused om2 to take over the leadership. A few seconds later, om1 returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that failover attempts mode. Restarting that failing S3g helped resolve the issues.
Investigation
Later, the investigation showed the following:
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.javaline 93:private int failoverCount = 0;- All threads share this counter; it never resets.GrpcOmTransport.shouldRetry(258) we runaction = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);Is it intentional? Is it safe to do that?OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction, we still use that globalfailoverCountcheckingif (failovers < maxFailovers)(258), which always returnsreturn RetryAction.FAIL;(263) once we reached themaxFailoversfailoverCountper request or per thread instead of making it a global flag? Or should we reset it?performFailoverDoneset inhadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.javaprivate boolean performFailoverDone(91), for example (see the inline comments)GrpcOmTransport.shouldRetry:I believe our suboptimal configuration caused these race conditions; however, they may still occur even with the default configuration.
To reproduce my prod issue, I created a small tool (actually a test) that runs the mock OMs (om0, om1, om2), mimics my prod failover to om1->om2, and then bombards it with requests, printing results to the console.
Results are interesting:
So, om2 (the leader) is never tried at all.
RunGrpcFailoverTest.java
Could anyone please have a look at my points above and comment on them?
Beta Was this translation helpful? Give feedback.
All reactions