mmaintegration: add StoreStatus to mma status translation plumbing #160462

wenyihu6 · 2026-01-05T14:36:12Z

storepool: export GetStoreStatuses for mma health signal plumbing

Previously, storepool's StoreStatus was not exported and only used internally by
storepool. This commit exports StoreStatus so that mma integration can consume
the same store-level signals that SMA relies and translate them into mma's
(Health, Disposition) model.

An alternative approach would be to have store pool handle the translation of
the store status to mma's (Health, Disposition) model. But this would couple
storepool to mma and blur ownership; storepool should remain as the source of
truth for store status. MMA owns the interpretation.

mmaintegration: add StoreStatus to mma status translation plumbing

This commit adds the translation layer from StoreStatus to mma's (Health,
Disposition) model.

sma currently relies on StorePool methods (GetStoreList, LiveAndDeadReplicas)
which internally compute StoreStatus using: NodeLivenessFunc (membership +
health) combined with other signals (throttling, suspect state) to determine
StoreStatus.

To preserve sma's behavior, mma reuses StorePool's status() method and
translates it to its own (Health, Disposition) model rather than re-deriving
health independently.

Alternatives considered (and rejected):

Query NodeLiveness directly in mma: NodeLiveness operates at the node level,
while StorePool tracks per-store state so store status on the same node can
diverge based on gossip timing and store specific signals. In addition,
NodeLiveness does not include other store signals such as throttling
(snapshot backpressure) and suspect status (recently unavailable) which are
currently used by sma to filter candidates when making lease/replica
placement decisions.
Periodically poll storepool from mma Statuses are plumbed before
ComputeChanges() instead of periodically in another goroutine.It is more
complex, may be stale and less efficient. mma currently only needs updated
health statuses for ComputeChanges.

Note that the translation goes through allocator sync, not directly in
mmaprototype, to avoid importing storepool there and keep layering clean.

The translation follows this mapping:

StorePool Status	MMA Health	Lease Disposition	Replica Disposition	Rationale
Dead	HealthDead	Shedding	Shedding	Store is gone: shed everything
Unknown	HealthUnknown	Refusing	Refusing	State is unknown: don't add but don't remove either
Decommissioning	HealthOK	Shedding	Shedding	Store is leaving cluster: shed everything
Draining	HealthOK	Shedding	Refusing	Store is draining: shed leases, accept replicas
Throttled	HealthOK	OK	Refusing	Healthy but overlpaded: accept leases but not replicas
Suspect	HealthUnhealthy	Shedding	Refusing	Recently unavailable: shed leases for safety and don't accept replicas
Available	HealthOK	OK	OK	Healthy store: accept all

asim: make store rebalancer refresh store status

This commit updates asis's mma store rebalancer to refresh store status before
calling ComputeChanges(), matching production behavior.

Previously, storepool's StoreStatus was not exported and only used internally by storepool. This commit exports StoreStatus so that mma integration can consume the same store-level signals that SMA relies and translate them into mma's (Health, Disposition) model. An alternative approach would be to have store pool handle the translation of the store status to mma's (Health, Disposition) model. But this would couple storepool to mma and blur ownership; storepool should remain as the source of truth for store status. MMA owns the interpretation.

cockroach-teamcity · 2026-01-05T14:36:32Z

This change is

wenyihu6 · 2026-01-05T14:37:42Z

@tbg still needs unit test but putting this up to check on the high-level design

This commit adds the translation layer from StoreStatus to mma's (Health, Disposition) model. sma currently relies on StorePool methods (GetStoreList, LiveAndDeadReplicas) which internally compute StoreStatus using: NodeLivenessFunc (membership + health) combined with other signals (throttling, suspect state) to determine StoreStatus. To preserve sma's behavior, mma reuses StorePool's status() method and translates it to its own (Health, Disposition) model rather than re-deriving health independently. Alternatives considered (and rejected): 1. Query NodeLiveness directly in mma: NodeLiveness operates at the node level, while StorePool tracks per-store state so store status on the same node can diverge based on gossip timing and store specific signals. In addition, NodeLiveness does not include other store signals such as throttling (snapshot backpressure) and suspect status (recently unavailable) which are currently used by sma to filter candidates when making lease/replica placement decisions. 2. Periodically poll storepool from mma Statuses are plumbed before ComputeChanges() instead of periodically in another goroutine.It is more complex, may be stale and less efficient. mma currently only needs updated health statuses for ComputeChanges. Note that the translation goes through allocator sync, not directly in mmaprototype, to avoid importing storepool there and keep layering clean. The translation follows this mapping: | StorePool Status | MMA Health | Lease Disposition | Replica Disposition | Rationale | |------------------|-----------------|-------------------|---------------------|-----------------------------------------------------------------------| | Dead | HealthDead | Shedding | Shedding | Store is gone: shed everything | | Unknown | HealthUnknown | Refusing | Refusing | State is unknown: don't add but don't remove either | | Decommissioning | HealthOK | Shedding | Shedding | Store is leaving cluster: shed everything | | Draining | HealthOK | Shedding | Refusing | Store is draining: shed leases, accept replicas | | Throttled | HealthOK | OK | Refusing | Healthy but overlpaded: accept leases but not replicas | | Suspect | HealthUnhealthy | Shedding | Refusing | Recently unavailable: shed leases for safety and don't accept replicas| | Available | HealthOK | OK | OK | Healthy store: accept all |

This commit updates asis's mma store rebalancer to refresh store status before calling ComputeChanges(), matching production behavior.

tbg

Looks good!

@tbg reviewed 12 files and all commit messages, and made 7 comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6).

-- commits line 52 at r2:
The rationale is inconsistent with the replica disposition. I'm guessing the Disposition is correct? When a store is draining, it shouldn't accept replicas. Worth thinking through the case in which a store that is the only remaining target for upreplication is draining: it still doesn't make sense to place a replica there I suppose.

-- commits line 53 at r2:
Worth double checking, but my memory is that "throttled" is when a store refuses a snapshot, right? Or is this triggered for other reasons as well? The "Healthy but overloaded" description doesn't really capture this. Let's call out specifically what "throttled" means here and also in the actual code where the translation happens.

-- commits line 54 at r2:
Isn't shedding too aggressive? Remind me how a store gets into "suspect" state according to the store pool?

pkg/kv/kvserver/mmaintegration/store_status.go line 14 at r2 (raw file):

// translateStorePoolStatusToMMA translates a StorePool status to MMA's (health,
// disposition) model.

Reminder to update this table with the one in the PR description (I'm okay with removing the one from the PR and just keeping this one as the source of truth)

pkg/kv/kvserver/mmaintegration/store_status.go line 70 at r2 (raw file):

		)
	default:
		// Unknown status - treat as unavailable.

Panic, at least in crdb test build?

pkg/kv/kvserver/allocator/storepool/store_pool.go line 899 at r2 (raw file):

}

// GetStoreStatus returns the store status for the given store ID.

Is this used?

wenyihu6 · 2026-01-06T15:57:48Z

Going to split this pr up - first one here #160555.

wenyihu6 · 2026-01-07T19:58:25Z

TFTR! I will reply these in the new PR #160623.

wenyihu6 · 2026-01-07T20:37:50Z

Closing in favor of #160623.

wenyihu6 changed the title ~~storepool: export GetStoreStatuses for mma health signal plumbing~~ mmaintegration: add StoreStatus to mma status translation plumbing Jan 5, 2026

wenyihu6 requested a review from tbg January 5, 2026 14:55

wenyihu6 added 3 commits January 5, 2026 12:53

asim: make store rebalancer refresh store status

e72e52d

This commit updates asis's mma store rebalancer to refresh store status before calling ComputeChanges(), matching production behavior.

fixup! storepool: export GetStoreStatuses for mma health signal plumbing

c114675

wenyihu6 force-pushed the newhealth branch from 0ffbb83 to c114675 Compare January 5, 2026 17:53

tbg reviewed Jan 6, 2026

View reviewed changes

wenyihu6 closed this Jan 7, 2026

wenyihu6 deleted the newhealth branch January 8, 2026 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmaintegration: add StoreStatus to mma status translation plumbing #160462

mmaintegration: add StoreStatus to mma status translation plumbing #160462

Uh oh!

wenyihu6 commented Jan 5, 2026 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jan 5, 2026

Uh oh!

wenyihu6 commented Jan 5, 2026

Uh oh!

tbg left a comment

Uh oh!

wenyihu6 commented Jan 6, 2026

Uh oh!

wenyihu6 commented Jan 7, 2026

Uh oh!

wenyihu6 commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mmaintegration: add StoreStatus to mma status translation plumbing #160462

mmaintegration: add StoreStatus to mma status translation plumbing #160462

Uh oh!

Conversation

wenyihu6 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jan 5, 2026

Uh oh!

wenyihu6 commented Jan 5, 2026

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Jan 6, 2026

Uh oh!

wenyihu6 commented Jan 7, 2026

Uh oh!

wenyihu6 commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenyihu6 commented Jan 5, 2026 •

edited

Loading