[HDDS-5525] Datanode snapshot can be installed while pre-finalize actions are running #9777
Unanswered
ilyavolodinNSU
asked this question in
FAQ
Replies: 1 comment
-
|
Fantastic writeup! @errose28 ? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, Apache Community!
I would like to ask you a question about the task HDDS-5525 and clarify my understanding and ask for feedback if I am mistaken somewhere.
ContainerStateMachine#loadSnapshot() was only a side effect of an incorrect initialization chain in Apache Ratis 2.1.0. Specifically, creating OzoneContainer inside the DatanodeStateMachine constructor triggered XceiverServerRatis#notifyGroupAdd(), which in turn led to ContainerStateMachine#initialize(), where loadSnapshot() is executed.
In other words, the exception described in HDDS-5513 was caused by Ratis behavior, which was later fixed in RATIS-1465 (commit 53a3eaa).
The exception from HDDS-5513 can be easily reproduced if:
After upgrading Ratis to the version containing the fix mentioned above, this reproduction is no longer possible.
This is not from the perspective of state machine consistency, but from the perspective of the public API surface. Therefore, the fix introduced in HDDS-5513 (commit d405ebf) still appears to be relevant.
Am I correct in understanding that “involving container data” refers to interactions similar to those triggered via triggerHeartbeat()?
If yes, then the problem is understandable, but the solution is less clear, since pre-finalize actions are executed before the main DatanodeStateMachine loop and before context.execute() (as discussed earlier).
If the reference is to loadSnapshot(), then given the demonstrated lack of direct involvement in a true data race (again, the root cause was Ratis), the problem becomes even harder to conceptualize.
It would be very helpful to see a concrete example of such a scenario, because at the moment it is not entirely clear what specific situation is being referred to, especially considering the apparent disconnect between loadSnapshot() and the race condition described in point (1).
If I am misunderstanding anything, please correct me.
Beta Was this translation helpful? Give feedback.
All reactions