Skip to content

Commit 06c6a78

Browse files
craig[bot]sumeerbhola
andcommitted
Merge #150249
150249: mmaprototype: port over prototype improvements r=tbg a=wenyihu6 **mmaprototype: improve multi-dimensional rebalancing logic** - Improved commentary on sortTargetCandidateSetAndPick. - Most significant is the reworking of the logic in clusterState.canShedAndAddLoad, to be more aggressive when all stores are overloaded along different dimensions. The logic now is arguably more principled than before: in addition to ensuring that the overloaded dimension is not becoming worse in the target than the source (existing logic) it removes the aggregate summary logic that was stopping some rebalancing. Instead it looks at the individual resource dimensions (other than the overloaded dimension), and checks that the fraction increase in those dimensions in the target is significantly smaller than the fraction increase in the overloaded dimension. This should prevent thrashing wrt the same range being moved back to the source. Some test result changes: mma_one_voter_skewed_cpu_skewed_write now ends with no store in an overload state: [n1s1,t1h28m23s,mmaid=258] 77721 evaluating s1: node load loadNoChange, store load loadNoChange, worst dim CPURate [n1s1,t1h28m23s,mmaid=258] 77722 evaluating s2: node load loadNormal, store load loadNormal, worst dim CPURate mma_skewed_cpu_skewed_write_more_ranges converges much faster, even over the original 60m duration of the simulation (I've increased the duration to 90m to make it fully converge). mma_skewed_cpu_skewed_write: Two nodes are overloadSlow along WriteBandwidth. They can't shed to s2, s5, s6 since those will also become overloaded along WriteBandwidth while the src will become underloaded. They don't attempt to shed to s1 since s1 is loadNoChange along CPU, so is in a later equivalence class based on aggregate load. This may be a deficiency of sortTargetCandidateSetAndPick. [n6s6,t59m59.5s,mmaid=452] 59570 evaluating s2: node load loadNormal, store load loadNormal, worst dim CPURate [n6s6,t59m59.5s,mmaid=452] 59571 evaluating s3: node load loadNormal, store load overloadSlow, worst dim WriteBandwidth [n6s6,t59m59.5s,mmaid=452] 59573 evaluating s4: node load loadNormal, store load overloadSlow, worst dim WriteBandwidth [n6s6,t59m59.5s,mmaid=452] 59575 evaluating s5: node load loadNormal, store load loadNormal, worst dim CPURate [n6s6,t59m59.5s,mmaid=452] 59576 evaluating s6: node load loadLow, store load loadNormal, worst dim WriteBandwidth [n6s6,t59m59.5s,mmaid=452] 59577 evaluating s1: node load loadNoChange, store load loadNoChange, worst dim CPURate Epic: none Release note: None --- **mmaprototype: canShedAndAddLoad must not make target overloadUrgent** Epic: none Release note: none --- **mmaprototype: reduce minWriteBandwidthGranularity to 128KiB** Epic: none Release note: none --- **mmaprototype: when ignoreHigherThanLoadThreshold is set, extend beyond first equivalence class in sortTargetCandidateSetAndPick** This behavior is needed to handle cases where the first equivalence class has no candidatest that can accept the load, but later ones can, because they have lower load in the overloadedDim. Epic: none Release note: none --- **mmaprototype: fix bug in sortTargetCandidateSetAndPick** The intention of the code was to be structured around sets representing equivalence classes, such that a later set is only considered if none of the earlier sets had a member that was discarded and had pending changes. Prior to this change, this criteria was arbitrarily applied in the middle of a set. So if two stores {s1, s2} were in the same set and s1 was discarded and had pending changes we would not consider s2. Now we will include s2 and stop when the next set starts. Epic: none Release note: none Co-authored-by: sumeerbhola <[email protected]>
2 parents 7fca691 + e189ba7 commit 06c6a78

13 files changed

+578
-451
lines changed

pkg/kv/kvserver/allocator/mmaprototype/allocator_state.go

Lines changed: 91 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -512,7 +512,11 @@ func (a *allocatorState) rebalanceStores(
512512
ss.NodeID, ss.StoreID, rangeID, candsPL)
513513
continue
514514
}
515-
// Have candidates.
515+
// Have candidates. We set ignoreLevel to
516+
// ignoreHigherThanLoadThreshold since this is the only allocator that
517+
// can shed leases for this store, and lease shedding is cheap, and it
518+
// will only add CPU to the target store (so it is ok to ignore other
519+
// dimensions on the target).
516520
targetStoreID := sortTargetCandidateSetAndPick(
517521
ctx, candsSet, sls.sls, ignoreHigherThanLoadThreshold, CPURate, a.rand)
518522
if targetStoreID == 0 {
@@ -1022,15 +1026,30 @@ func (i ignoreLevel) SafeFormat(s interfaces.SafePrinter, verb rune) {
10221026
// something that can handle the multidimensional load of this range, (b)
10231027
// random picking in a large set results in different stores with allocators
10241028
// making different decisions, which reduces thrashing. The con is that we may
1025-
// not select the candidate that is the very best.
1029+
// not select the candidate that is the very best. Normally, we only consider
1030+
// candidates in the best equivalence class defined by the loadSummary
1031+
// aggregated across all dimensions, which is already coarse as mentioned
1032+
// above. However, when ignoreHigherThanLoadThreshold is set and an
1033+
// overloadedDim is provided, we extend beyond the first equivalence class, to
1034+
// consider all candidates that are underloaded in the overloadedDim.
10261035
//
10271036
// The caller must not exclude any candidates based on load or
10281037
// maxFractionPendingIncrease. That filtering must happen here. Depending on
10291038
// the value of ignoreLevel, only candidates < loadThreshold may be
10301039
// considered.
10311040
//
10321041
// overloadDim, if not set to NumLoadDimensions, represents the dimension that
1033-
// is overloaded in the source.
1042+
// is overloaded in the source. It is used to narrow down the candidates to
1043+
// those that are most underloaded in that dimension, when all the candidates
1044+
// have an aggregate load summary (across all dimensions) that is >=
1045+
// loadNoChange. This function guarantees that when overloadedDim is set, all
1046+
// candidates returned will be < loadNoChange in that dimension.
1047+
//
1048+
// overloadDim will be set to NumLoadDimensions when the source is not
1049+
// shedding due to overload (say due to (impending) failure). In this case the
1050+
// caller should set loadThreshold to overloadSlow and ignoreLevel to
1051+
// ignoreHigherThanLoadThreshold, to maximize the probability of finding a
1052+
// candidate.
10341053
func sortTargetCandidateSetAndPick(
10351054
ctx context.Context,
10361055
cands candidateSet,
@@ -1096,22 +1115,41 @@ func sortTargetCandidateSetAndPick(
10961115
// Consider the series of sets of candidates that have the same sls. The
10971116
// only reason we will consider a set later than the first one is if the
10981117
// earlier sets get fully discarded solely because of nls and have no
1099-
// pending changes.
1100-
lowestLoad := cands.candidates[0].sls
1118+
// pending changes, or because of ignoreHigherThanLoadThreshold.
1119+
lowestLoadSet := cands.candidates[0].sls
1120+
currentLoadSet := lowestLoadSet
11011121
discardedCandsHadNoPendingChanges := true
11021122
for _, cand := range cands.candidates {
1103-
if cand.sls > lowestLoad {
1104-
if j == 0 && discardedCandsHadNoPendingChanges {
1123+
if cand.sls > currentLoadSet {
1124+
if !discardedCandsHadNoPendingChanges {
1125+
// Never go to the next set if we have discarded candidates that have
1126+
// pending changes. We will wait for those to have no pending changes
1127+
// before we consider later sets.
1128+
break
1129+
}
1130+
currentLoadSet = cand.sls
1131+
}
1132+
if cand.sls > lowestLoadSet {
1133+
if j == 0 {
11051134
// This is the lowestLoad set being considered now.
1106-
lowestLoad = cand.sls
1107-
} else {
1135+
lowestLoadSet = cand.sls
1136+
} else if ignoreLevel < ignoreHigherThanLoadThreshold || overloadedDim == NumLoadDimensions {
11081137
// Past the lowestLoad set. We don't care about these.
11091138
break
11101139
}
1140+
// Else ignoreLevel >= ignoreHigherThanLoadThreshold && overloadedDim !=
1141+
// NumLoadDimensions, so keep going and consider all candidates with
1142+
// cand.sls <= loadThreshold.
1143+
}
1144+
if cand.sls > loadThreshold {
1145+
break
11111146
}
11121147
candDiscardedByNLS := cand.nls > loadThreshold ||
1113-
(cand.nls == loadThreshold && ignoreLevel != ignoreHigherThanLoadThreshold)
1114-
if candDiscardedByNLS || cand.maxFractionPendingIncrease >= maxFractionPendingThreshold {
1148+
(cand.nls == loadThreshold && ignoreLevel < ignoreHigherThanLoadThreshold)
1149+
candDiscardedByOverloadDim := overloadedDim != NumLoadDimensions &&
1150+
cand.dimSummary[overloadedDim] >= loadNoChange
1151+
if candDiscardedByNLS || candDiscardedByOverloadDim ||
1152+
cand.maxFractionPendingIncrease >= maxFractionPendingThreshold {
11151153
// Discard this candidate.
11161154
if cand.maxFractionPendingIncrease > epsilon && discardedCandsHadNoPendingChanges {
11171155
discardedCandsHadNoPendingChanges = false
@@ -1127,41 +1165,56 @@ func sortTargetCandidateSetAndPick(
11271165
log.VInfof(ctx, 2, "sortTargetCandidateSetAndPick: no candidates due to load")
11281166
return 0
11291167
}
1168+
lowestLoadSet = cands.candidates[0].sls
1169+
highestLoadSet := cands.candidates[j-1].sls
11301170
cands.candidates = cands.candidates[:j]
1131-
// The set of candidates we will consider all have lowestLoad.
1171+
// The set of candidates we will consider all have load <= loadThreshold.
1172+
// They may all be lowestLoad, or we may have allowed additional candidates
1173+
// because of ignoreHigherThanLoadThreshold and a specified overloadedDim.
1174+
// When the overloadedDim is specified, all these candidates will be <
1175+
// loadNoChange in that dimension.
11321176
//
1133-
// If this set has load >= loadNoChange, we have a set that we would not
1134-
// ordinarily consider as candidates. But we are willing to shed to from
1135-
// overloadUrgent => {overloadSlow, loadNoChange} or overloadSlow =>
1136-
// loadNoChange, when absolutely necessary. This necessity is defined by the
1137-
// fact that we didn't have any candidate in an earlier or this set that was
1138-
// ignored because of pending changes. Because if a candidate was ignored
1139-
// because of pending work, we want to wait for that pending work to finish
1140-
// and then see if we can transfer to those. Note that we used the condition
1141-
// cand.maxFractionPendingIncrease>epsilon and not
1142-
// cand.maxFractionPendingIncrease>=maxFractionPendingThreshold when setting
1143-
// discardedCandsHadNoPendingChanges. This is an additional conservative
1144-
// choice, since pending added work is slightly inflated in size, and we
1145-
// want to have a true picture of all of these potential candidates before
1146-
// we start using the ones with load >= loadNoChange.
1147-
if lowestLoad > loadThreshold {
1148-
log.VInfof(ctx, 2, "sortTargetCandidateSetAndPick: no candidates due to exceeding loadThreshold")
1149-
return 0
1177+
// If this set has some members that are load >= loadNoChange, we have a set
1178+
// that we would not ordinarily consider as candidates. But we are willing
1179+
// to shed to from overloadUrgent => {overloadSlow, loadNoChange} or
1180+
// overloadSlow => loadNoChange, when absolutely necessary. This necessity
1181+
// is defined by the fact that we didn't have any candidate in an earlier or
1182+
// this set that was ignored because of pending changes. Because if a
1183+
// candidate was ignored because of pending work, we want to wait for that
1184+
// pending work to finish and then see if we can transfer to those. Note
1185+
// that we used the condition cand.maxFractionPendingIncrease>epsilon and
1186+
// not cand.maxFractionPendingIncrease>=maxFractionPendingThreshold when
1187+
// setting discardedCandsHadNoPendingChanges. This is an additional
1188+
// conservative choice, since pending added work is slightly inflated in
1189+
// size, and we want to have a true picture of all of these potential
1190+
// candidates before we start using the ones with load >= loadNoChange.
1191+
if lowestLoadSet > loadThreshold {
1192+
panic("candidates should not have lowestLoad > loadThreshold")
11501193
}
1151-
if lowestLoad == loadThreshold && ignoreLevel != ignoreHigherThanLoadThreshold {
1194+
// INVARIANT: lowestLoad <= loadThreshold.
1195+
if lowestLoadSet == loadThreshold && ignoreLevel < ignoreHigherThanLoadThreshold {
11521196
log.VInfof(ctx, 2, "sortTargetCandidateSetAndPick: no candidates due to equal to loadThreshold")
11531197
return 0
11541198
}
1199+
// INVARIANT: lowestLoad < loadThreshold ||
1200+
// (lowestLoad <= loadThreshold && ignoreLevel >= ignoreHigherThanLoadThreshold).
1201+
11551202
// < loadNoChange is fine. We need to check whether the following cases can continue.
11561203
// [loadNoChange, loadThreshold), or loadThreshold && ignoreHigherThanLoadThreshold.
1157-
if lowestLoad >= loadNoChange &&
1204+
if lowestLoadSet >= loadNoChange &&
11581205
(!discardedCandsHadNoPendingChanges || ignoreLevel == ignoreLoadNoChangeAndHigher) {
11591206
log.VInfof(ctx, 2, "sortTargetCandidateSetAndPick: no candidates due to loadNoChange")
11601207
return 0
11611208
}
1162-
// Candidates have equal load value and sorted by non-decreasing
1163-
// leasePreferenceIndex. Eliminate ones that have
1164-
// notMatchedLeasePreferenceIndex.
1209+
if lowestLoadSet != highestLoadSet {
1210+
slices.SortFunc(cands.candidates, func(a, b candidateInfo) int {
1211+
return cmp.Or(
1212+
cmp.Compare(a.leasePreferenceIndex, b.leasePreferenceIndex),
1213+
cmp.Compare(a.StoreID, b.StoreID))
1214+
})
1215+
}
1216+
// Candidates are sorted by non-decreasing leasePreferenceIndex. Eliminate
1217+
// ones that have notMatchedLeasePreferenceIndex.
11651218
j = 0
11661219
for _, cand := range cands.candidates {
11671220
if cand.leasePreferenceIndex == notMatchedLeasePreferencIndex {
@@ -1174,8 +1227,10 @@ func sortTargetCandidateSetAndPick(
11741227
return 0
11751228
}
11761229
cands.candidates = cands.candidates[:j]
1177-
if lowestLoad >= loadNoChange && overloadedDim != NumLoadDimensions {
1178-
// Sort candidates from lowest to highest along overloaded dimension.
1230+
if lowestLoadSet != highestLoadSet || (lowestLoadSet >= loadNoChange && overloadedDim != NumLoadDimensions) {
1231+
// Sort candidates from lowest to highest along overloaded dimension. We
1232+
// limit when we do this, since this will further restrict the pool of
1233+
// candidates and in general we don't want to restrict the pool.
11791234
slices.SortFunc(cands.candidates, func(a, b candidateInfo) int {
11801235
return cmp.Compare(a.dimSummary[overloadedDim], b.dimSummary[overloadedDim])
11811236
})

0 commit comments

Comments
 (0)