-
Notifications
You must be signed in to change notification settings - Fork 4.1k
mmaintegration: add StoreStatus to mma status translation plumbing #160623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola reviewed 11 files and all commit messages, and made 8 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @tbg and @wenyihu6).
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma/store_status_shedding.txt line 39 at r1 (raw file):
# Assert s4 (dead) should have 0 replicas. assertion type=stat stat=replicas ticks=6 exact_bound=0 stores=(4)
is this assertion evaluated after the 6m eval below?
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma/store_status_shedding.txt line 48 at r1 (raw file):
leases#1: thrash_pct: [s1=0%, s2=0%, s3=0%, s4=0%] (sum=0%) replicas#1: first: [s1=40, s2=40, s3=40, s4=0] (stddev=17.32, mean=30.00, sum=120) replicas#1: last: [s1=35, s2=40, s3=40, s4=5] (stddev=14.58, mean=30.00, sum=120)
why does s4 have 5 replicas?
never mind: I see this is fixed by the second commit
pkg/kv/kvserver/allocator/mmaprototype/allocator.go line 65 at r1 (raw file):
// UpdateStoreStatus updates the health and disposition for the stores in storeStatuses according to the statuses in storeStatuses. // Stores not known to the allocator are ignored with logging. // TODO(wenyihu6): if this is too expensive, we should only update status for stores that have changed.
why would this be expensive given it is just setting some struct fields?
Regarding expense, I suspect we'll get very far with removing all the memory allocations we have scattered all over the MMA code (or integration code, like the map allocation in StorePool.GetStoreStatuses), some with todos, and many without.
pkg/kv/kvserver/allocator/mmaprototype/allocator.go line 66 at r1 (raw file):
// Stores not known to the allocator are ignored with logging. // TODO(wenyihu6): if this is too expensive, we should only update status for stores that have changed. UpdateStoreStatus(ctx context.Context, storeStatuses map[roachpb.StoreID]Status)
nit: plural would be better, UpdateStoresStatus
pkg/kv/kvserver/mmaintegration/store_status.go line 31 at r1 (raw file):
|------------------|-----------------|-------------------|---------------------|------------------------------------------------------------------------------------------------| | Dead | HealthDead | Shedding | Shedding | Store is gone: shed everything (matches SMA) | Decommissioning | HealthOK | Shedding | Shedding | Store is leaving cluster: shed everything (SMA allows leases, MMA more aggressive)
So when does SMA starts shedding leases for decommissioning nodes/stores?
pkg/kv/kvserver/mmaintegration/store_status.go line 32 at r1 (raw file):
| Dead | HealthDead | Shedding | Shedding | Store is gone: shed everything (matches SMA) | Decommissioning | HealthOK | Shedding | Shedding | Store is leaving cluster: shed everything (SMA allows leases, MMA more aggressive) | Unknown | HealthUnknown | Refusing | Refusing | State is unknown: don't add but don't remove either (SMA sheds leases, MMA more conservative)
Should MMA also be shedding leases?
pkg/kv/kvserver/mmaintegration/store_status.go line 62 at r1 (raw file):
mmaprototype.ReplicaDispositionShedding, ) case storepool.StoreStatusUnknown:
nit: the ordering of cases doesn't match the one in the table above.
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenyihu6 made 6 comments and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola and @tbg).
pkg/kv/kvserver/allocator/mmaprototype/allocator.go line 65 at r1 (raw file):
Previously, sumeerbhola wrote…
why would this be expensive given it is just setting some struct fields?
Regarding expense, I suspect we'll get very far with removing all the memory allocations we have scattered all over the MMA code (or integration code, like the map allocation in
StorePool.GetStoreStatuses), some with todos, and many without.
Makes sense, I removed the TODO. Was mostly concerned about map allocation along with calling into the store pool every minute (and potentially more when we move away from replicate + lease queue entirely).
pkg/kv/kvserver/allocator/mmaprototype/allocator.go line 66 at r1 (raw file):
Previously, sumeerbhola wrote…
nit: plural would be better,
UpdateStoresStatus
Renamed.
pkg/kv/kvserver/mmaintegration/store_status.go line 31 at r1 (raw file):
Previously, sumeerbhola wrote…
So when does SMA starts shedding leases for decommissioning nodes/stores?
I believe they handle that indirectly by removing replicas (lease queue does not actively shed lease). For these two cases where SMA behaves differently, I’m not entirely sure what the best approach is. I just went with my intuition.
pkg/kv/kvserver/mmaintegration/store_status.go line 32 at r1 (raw file):
Previously, sumeerbhola wrote…
Should MMA also be shedding leases?
Ditto - for these two cases where SMA behaves differently, I’m not entirely sure what the best approach is. I just went with my intuition. We could change them.
pkg/kv/kvserver/mmaintegration/store_status.go line 62 at r1 (raw file):
Previously, sumeerbhola wrote…
nit: the ordering of cases doesn't match the one in the table above.
Done.
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma/store_status_shedding.txt line 39 at r1 (raw file):
Previously, sumeerbhola wrote…
is this assertion evaluated after the 6m eval below?
Yes, it fires after the evaluation.
|
@tbg re: comments on the other pr
Correct, disposition was right. I mostly followed what SMA already does. If the only remaining target for upreplication is draining, it would be handled by replicate queue but not mma as of now. But replicate queue wouldn't do that for upreplication given isStoreReadyForRoutineReplicaTransfer filers out draining stores.
Yes, you are right. I updated the comment to include the definition.
Removed.
Good call, added panics.
Good catch, removed. |
This commit adds the translation layer from StoreStatus to mma's (Health, Disposition) model. sma currently relies on StorePool methods (GetStoreList, LiveAndDeadReplicas) which internally compute StoreStatus using: NodeLivenessFunc (membership + health) combined with other signals (throttling, suspect state) to determine StoreStatus. To preserve sma's behavior, mma reuses StorePool's status() method and translates it to its own (Health, Disposition) model rather than re-deriving health independently. Alternatives considered (and rejected): 1. Query NodeLiveness directly in mma: NodeLiveness operates at the node level, while StorePool tracks per-store state so store status on the same node can diverge based on gossip timing and store specific signals. In addition, NodeLiveness does not include other store signals such as throttling (snapshot backpressure) and suspect status (recently unavailable) which are currently used by sma to filter candidates when making lease/replica placement decisions. 2. Periodically poll storepool from mma Statuses are plumbed before ComputeChanges() instead of periodically in another goroutine.It is more complex, may be stale and less efficient. mma currently only needs updated health statuses for ComputeChanges. Note that the translation goes through allocator sync, not directly in mmaprototype, to avoid importing storepool there and keep layering clean.
This commit updates asis's mma store rebalancer to refresh store status before calling ComputeChanges(), matching production behavior.
|
I’ll go ahead and merge this. The translation table is still up for debate, so we can revisit it as needed. TFTRs! |
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola made 2 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg and @wenyihu6).
pkg/kv/kvserver/mmaintegration/store_status.go line 31 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
I believe they handle that indirectly by removing replicas (lease queue does not actively shed lease). For these two cases where SMA behaves differently, I’m not entirely sure what the best approach is. I just went with my intuition.
I see -- it won't first shed leases since it has to shed all replicas and as part of shedding the replica is can transfer the lease. And that may mean a better location for the lease. Additionally, while it has a replica, it may as well accept a lease if it is the best location, since it is healthy.
It seems to me that the SMA behavior makes sense, and we should change it to (OK, Shedding).
pkg/kv/kvserver/mmaintegration/store_status.go line 32 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Ditto - for these two cases where SMA behaves differently, I’m not entirely sure what the best approach is. I just went with my intuition. We could change them.
Do we see production issues where there are false positive transitions to Unknown? If not, shedding leases seems prudent for availability, given they are cheap to move.
|
My comments can be addressed in a subsequent PR. |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating backport branch refs/heads/blathers/backport-release-26.1-160623: POST https://api.github.com/repos/wenyihu6/cockroach/git/refs: 403 Resource not accessible by integration [] Backport to branch 26.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Part of: #156776
Epic: CRDB-55052
mmaintegration: add StoreStatus to mma status translation plumbing
This commit adds the translation layer from StoreStatus to mma's (Health,
Disposition) model.
sma currently relies on StorePool methods (GetStoreList, LiveAndDeadReplicas)
which internally compute StoreStatus using: NodeLivenessFunc (membership +
health) combined with other signals (throttling, suspect state) to determine
StoreStatus.
To preserve sma's behavior, mma reuses StorePool's status() method and
translates it to its own (Health, Disposition) model rather than re-deriving
health independently.
Alternatives considered (and rejected):
Query NodeLiveness directly in mma: NodeLiveness operates at the node level,
while StorePool tracks per-store state so store status on the same node can
diverge based on gossip timing and store specific signals. In addition,
NodeLiveness does not include other store signals such as throttling
(snapshot backpressure) and suspect status (recently unavailable) which are
currently used by sma to filter candidates when making lease/replica
placement decisions.
Periodically poll storepool from mma Statuses are plumbed before
ComputeChanges() instead of periodically in another goroutine.It is more
complex, may be stale and less efficient. mma currently only needs updated
health statuses for ComputeChanges.
Note that the translation goes through allocator sync, not directly in
mmaprototype, to avoid importing storepool there and keep layering clean.
asim: make store rebalancer refresh store status
This commit updates asis's mma store rebalancer to refresh store status before
calling ComputeChanges(), matching production behavior.
mmaprototype: rename from UpdateStoreStatus to UpdateStoresStatuses