Skip to content

Commit 88a26d6

Browse files
jimwwalkerdaverigby
authored andcommitted
MB-29928: Implement auto controller logic for the defragmenter
With changes in 7.0 to memory tracking, we now have visibility of an individual bucket's fragmentation, whereas pre 7.0 we only had visibility of the entire process. This commit makes use of the bucket fragmentation to calculate the sleep interval of the defragger, the overall idea being that as a bucket's defragmentation gets worse, the sleep time reduces. The defragger is then running more frequently, visiting more items and bringing the fragmentation down. The commit introduces two new modes of automatic calculation. The reason for this is that the second, PID mode, is more experimental. Ultimately once it's had some soak time, one mode can remain in code. The two modes are as follows and can be selected in the bucket config (a future patch makes them runtime switchable via cbepctl). 1) auto - Use a 'static' and predictable calculation for converting fragmentation into a reduction in sleep time. 2) auto_pid - Use a PID controller to calculate reductions in fragmentation. This is less predictable as real time is a factor in the calculation, scheduling delays etc... results in unpredictable outputs. The existing mode (just use defragmenter_interval) is named "static". Both modes of auto controller work by taking the bucket fragmentation as a percentage and then using the bucket's low-water mark creating a 'score' which is then used for determining how the sleep interval maybe calculated. The result is that when fragmentation maybe high, but rss is actually small (lots of headroom before low-water mark) the score is low, whilst as we approach the low-water mark the score increases. E.g. fragmentation 23% (allocated:500, rss:650), then with a low-water mark of n the value used in calculations (score): n | score 600 | 23 (rss > low-water) 1000 | 14.95 2000 | 7.4 3000 | 4.98 5000 | 2.99 A spreadsheet with numerous scenarios and the score can be found here: https://docs.google.com/spreadsheets/d/1W72N2vbrfa5xOVFmS0e3tpFCcEyd8kPk8fqMNmuM1k8/edit#gid=0 auto: This mode takes the score and a range. Below the range and the maximum sleep is used, above the range and the minimum sleep is used. When the score is within the range we find how far in the range the score is, e.g. 20% and map that to be 20% between min and max sleep. Here the following configuration parameters are being used: * defragmenter_auto_min_sleep 0.0 * defragmenter_auto_max_sleep 10.0 * defragmenter_auto_lower_threshold 0.07 * defragmenter_auto_upper_threshold 0.25 auto_pid: This mode uses a single configurable threshold and when the score exceeds that threshold the PID calculates an output. The returned sleep time is the maximum - output, but capped at the configuration minimum. The PID itself is configured at runtime and the commit uses values for P, I, D and dt based on examination of the "pathogen" performance test and use of the `pid_runner` program which allows for some examination of P, I and D. The assumption is that fragmentation doesn't increase quickly, hence the I and dt term forces the PID to only recalculate every 10 seconds with a 'slow' output. Here the following configuration parameters are being used: * defragmenter_auto_min_sleep 0.0 * defragmenter_auto_max_sleep 10.0 * defragmenter_auto_lower_threshold 0.07 * defragmenter_auto_pid_p 0.3 * defragmenter_auto_pid_i 0.0000197 * defragmenter_auto_pid_d 0.0 * defragmenter_auto_pid_dt 30000 These values have been used in the pid_runner test and were chosen based on the observation that fragmentation in real workloads increases slowly. The pathogen test is useful for testing defragmentation, but may not be truly representative of real fragmentation growth, for example that test achieves fragmentation greater than 35% in a very short time, but is operating on a small amount of data, mem_used ranges from ~200MB to ~600MB. First dt: With the observation that fragmentation generally increases slowly The dt term controls the rate at which the PID reads the Process Variable (PV or in our case scored fragmentation) and reacts. Thus 30 seconds will elapse before the PID computes a new output value. If the PV were changing at faster rates, the dt term would be reduced. P I D values: Using pid_runner (in its committed state) a number of scenarios were compared where the PV is at a fixed percentage above the SP. These scenarios guided the current values of P I and D. For example when the PV is 1.1x of SP it would take the PID ~20 hours to reduce the sleep interval to min (0.0). When the PV is 2.6x of SP it would take the PID 75 minutes to reduce the sleep interval to min (0.0). PV x | time to min sleep 1.1 | 20h:8m:31s 1.2 | 10h:4m:31s 1.5 | 4h:1m:31s 1.8 | 2h:31m:1s 2.0 | 2h:1m:1s 2.3 | 1h:33m:1s 2.6 | 1h:15m:31s 2.9 | 1h:3m:31s 3.0 | 1h:0m:31s 3.3 | 0h:52m:31s 3.5 | 0h:48m:31s A final note on the use of a PID. Typical use of a PID would be in systems where the 'process variable' can be influenced in positive and negative ways. E.g. a temperature could be controlled by heating or not heating (or forced cooling). In our use-case we can influence fragmentation down (by running the defragger), but we cannot raise fragmentation to the set-point. i.e. our use of a PID cannot maintain a level of fragmentation. This is why in the code, once the fragmentation (score) drops below the lower threshold, the PID just resets and the max sleep is used. Change-Id: Ia67d789dc38e0c649d2e7cf8cea945f8f67b711e Reviewed-on: http://review.couchbase.org/c/kv_engine/+/155961 Tested-by: Build Bot <[email protected]> Reviewed-by: Dave Rigby <[email protected]>
1 parent 6d42ccb commit 88a26d6

File tree

9 files changed

+786
-20
lines changed

9 files changed

+786
-20
lines changed

engines/ep/CMakeLists.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,9 @@ add_executable(gencode tools/gencode.cc)
9292
kv_enable_pch(gencode)
9393
add_executable(genconfig tools/genconfig.cc)
9494
kv_enable_pch(genconfig)
95+
96+
add_executable(pid_runner tools/pid_runner.cc $<TARGET_OBJECTS:ep_objs>)
97+
9598
if (WIN32)
9699
# windows need getopt
97100
target_link_libraries(gencode PRIVATE platform)
@@ -128,6 +131,15 @@ target_link_libraries(kvstore_gen PRIVATE
128131
${LIBEVENT_LIBRARIES}
129132
)
130133

134+
target_link_libraries(pid_runner
135+
ep-engine_collections
136+
mcd_executor
137+
mcbp
138+
mcd_time
139+
mcd_tracing
140+
xattr
141+
${EP_STORAGE_LIBS})
142+
131143
ADD_CUSTOM_COMMAND(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/src/stats-info.c
132144
${CMAKE_CURRENT_BINARY_DIR}/src/stats-info.h
133145
COMMAND
@@ -350,6 +362,7 @@ ADD_LIBRARY(ep_objs OBJECT
350362
src/mutation_log_entry.cc
351363
src/paging_visitor.cc
352364
src/persistence_callback.cc
365+
src/pid_controller.cc
353366
src/pre_link_document_context.cc
354367
src/pre_link_document_context.h
355368
src/progress_tracker.cc

engines/ep/configuration.json

Lines changed: 64 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -315,10 +315,9 @@
315315
"dcp_noop_mandatory_for_v5_features": {
316316
"default": "true",
317317
"descr": "Forces clients to enable noop for v5 features",
318-
"dynamic": true,
318+
"dynamic": true,
319319
"type": "bool"
320320
},
321-
322321
"defragmenter_enabled": {
323322
"default": "true",
324323
"descr": "True if defragmenter task is enabled",
@@ -328,7 +327,7 @@
328327
"defragmenter_interval": {
329328
"default": "10.0",
330329
"descr": "How often defragmenter task should be run (in seconds).",
331-
"dynamic": true,
330+
"dynamic": true,
332331
"type": "float"
333332
},
334333
"defragmenter_age_threshold": {
@@ -346,14 +345,75 @@
346345
"defragmenter_chunk_duration": {
347346
"default": "20",
348347
"descr": "Maximum time (in ms) defragmentation task will run for before being paused (and resumed at the next defragmenter_interval).",
349-
"dynamic": true,
348+
"dynamic": true,
350349
"type": "size_t",
351350
"validator": {
352351
"range": {
353352
"min": 1
354353
}
355354
}
356355
},
356+
"defragmenter_mode" : {
357+
"default": "auto_pid",
358+
"descr": "Determines how the defragmenter controls its sleep interval. When static defragmenter_interval is used. When auto_linear, scale the sleep time using a scored defragmentation when it falls between defragmenter_auto_lower_trigger and defragmenter_auto_upper_trigger. When auto_pid use a PID controller to computer reductions in the sleep interval when scored fragmentation is above defragmenter_auto_lower_trigger.",
359+
"dynamic": false,
360+
"type": "std::string",
361+
"validator": {
362+
"enum": [
363+
"static",
364+
"auto_linear",
365+
"auto_pid"
366+
]
367+
}
368+
},
369+
"defragmenter_auto_lower_threshold" : {
370+
"default": "0.07",
371+
"descr": "When mode is not static and scored fragmentation is above this value, a sleep time between defragmenter_auto_min_sleep and defragmenter_auto_max_sleep will be used",
372+
"dynamic": false,
373+
"type": "float"
374+
},
375+
"defragmenter_auto_upper_threshold" : {
376+
"default": "0.25",
377+
"descr": "When mode is auto_linear and scored fragmentation is above this value, the defragmenter will use defragmenter_auto_min_sleep",
378+
"dynamic": false,
379+
"type": "float"
380+
},
381+
"defragmenter_auto_max_sleep" : {
382+
"default": "10.0",
383+
"descr": "The maximum sleep that the auto controller can set",
384+
"dynamic": false,
385+
"type": "float"
386+
},
387+
"defragmenter_auto_min_sleep" : {
388+
"default": "0.0",
389+
"descr": "The minimum sleep that the auto controller can set",
390+
"dynamic": false,
391+
"type": "float"
392+
},
393+
"defragmenter_auto_pid_p" : {
394+
"default": "0.3",
395+
"descr": "The p term for the PID controller",
396+
"dynamic": false,
397+
"type": "float"
398+
},
399+
"defragmenter_auto_pid_i" : {
400+
"default": "0.0000197",
401+
"descr": "The i term for the PID controller",
402+
"dynamic": false,
403+
"type": "float"
404+
},
405+
"defragmenter_auto_pid_d" : {
406+
"default": "0.0",
407+
"descr": "The d term for the PID controller",
408+
"dynamic": false,
409+
"type": "float"
410+
},
411+
"defragmenter_auto_pid_dt" : {
412+
"default": "30000",
413+
"descr": "The dt (interval) term for the PID controller. Value represents milliseconds",
414+
"dynamic": false,
415+
"type": "size_t"
416+
},
357417
"durability_timeout_task_interval": {
358418
"default": "25",
359419
"descr": "Interval (in ms) between subsequent runs of the DurabilityTimeoutTask",

engines/ep/src/defragmenter.cc

Lines changed: 122 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,22 +25,40 @@ DefragmenterTask::DefragmenterTask(EventuallyPersistentEngine* e,
2525
EPStats& stats_)
2626
: GlobalTask(e, TaskId::DefragmenterTask, 0, false),
2727
stats(stats_),
28-
epstore_position(engine->getKVBucket()->startPosition()) {
28+
epstore_position(engine->getKVBucket()->startPosition()),
29+
pid(engine->getConfiguration().getDefragmenterAutoLowerThreshold(),
30+
engine->getConfiguration().getDefragmenterAutoPidP(),
31+
engine->getConfiguration().getDefragmenterAutoPidI(),
32+
engine->getConfiguration().getDefragmenterAutoPidD(),
33+
std::chrono::milliseconds{
34+
engine->getConfiguration().getDefragmenterAutoPidDt()}) {
2935
}
3036

3137
bool DefragmenterTask::run() {
3238
TRACE_EVENT0("ep-engine/task", "DefragmenterTask");
39+
std::chrono::duration<double> sleepTime;
3340
if (engine->getConfiguration().isDefragmenterEnabled()) {
34-
defrag();
41+
sleepTime = defrag();
42+
} else {
43+
sleepTime = std::chrono::duration<double>{
44+
engine->getConfiguration().getDefragmenterInterval()};
3545
}
36-
snooze(getSleepTime());
46+
snooze(sleepTime.count());
3747
if (engine->getEpStats().isShutdown) {
3848
return false;
3949
}
4050
return true;
4151
}
4252

43-
void DefragmenterTask::defrag() {
53+
std::chrono::duration<double> DefragmenterTask::defrag() {
54+
auto currentFragStats = cb::ArenaMalloc::getFragmentationStats(
55+
engine->getArenaMallocClient());
56+
57+
auto sleepAndRun = calculateSleepTimeAndRunState(currentFragStats);
58+
if (!sleepAndRun.runDefragger) {
59+
return sleepAndRun.sleepTime;
60+
}
61+
4462
// Get our pause/resume visitor. If we didn't finish the previous pass,
4563
// then resume from where we last were, otherwise create a new visitor
4664
// starting from the beginning.
@@ -61,11 +79,9 @@ void DefragmenterTask::defrag() {
6179
ss << " resuming from " << epstore_position << ", ";
6280
ss << prAdapter->getHashtablePosition() << ".";
6381
}
64-
auto fragStats = cb::ArenaMalloc::getFragmentationStats(
65-
engine->getArenaMallocClient());
6682
ss << " Using chunk_duration=" << getChunkDuration().count() << " ms."
6783
<< " mem_used=" << stats.getEstimatedTotalMemoryUsed() << ", "
68-
<< fragStats;
84+
<< currentFragStats;
6985
EP_LOG_DEBUG("{}", ss.str());
7086
}
7187

@@ -119,20 +135,22 @@ void DefragmenterTask::defrag() {
119135
std::chrono::microseconds duration =
120136
std::chrono::duration_cast<std::chrono::microseconds>(end -
121137
start);
122-
auto fragStats = cb::ArenaMalloc::getFragmentationStats(
123-
engine->getArenaMallocClient());
138+
124139
ss << " Took " << duration.count() << " us."
125140
<< " moved " << visitor.getDefragCount() << "/"
126141
<< visitor.getVisitedCount() << " visited documents."
127142
<< " mem_used=" << stats.getEstimatedTotalMemoryUsed() << ", "
128-
<< fragStats << ". Sleeping for " << getSleepTime() << " seconds.";
143+
<< cb::ArenaMalloc::getFragmentationStats(
144+
engine->getArenaMallocClient())
145+
<< ". Sleeping for " << sleepAndRun.sleepTime.count() << " seconds.";
129146
EP_LOG_DEBUG("{}", ss.str());
130147
}
131148

132149
// Delete(reset) visitor if it finished.
133150
if (completed) {
134151
prAdapter.reset();
135152
}
153+
return sleepAndRun.sleepTime;
136154
}
137155

138156
void DefragmenterTask::stop() {
@@ -154,8 +172,17 @@ std::chrono::microseconds DefragmenterTask::maxExpectedDuration() const {
154172
return getChunkDuration() * 10;
155173
}
156174

157-
double DefragmenterTask::getSleepTime() const {
158-
return engine->getConfiguration().getDefragmenterInterval();
175+
DefragmenterTask::SleepTimeAndRunState
176+
DefragmenterTask::calculateSleepTimeAndRunState(
177+
const cb::FragmentationStats& fragStats) {
178+
if (engine->getConfiguration().getDefragmenterMode() == "auto_linear") {
179+
return calculateSleepLinear(fragStats);
180+
} else if (engine->getConfiguration().getDefragmenterMode() == "auto_pid") {
181+
return calculateSleepPID(fragStats);
182+
}
183+
return {std::chrono::duration<double>{
184+
engine->getConfiguration().getDefragmenterInterval()},
185+
true};
159186
}
160187

161188
size_t DefragmenterTask::getAgeThreshold() const {
@@ -197,3 +224,86 @@ std::chrono::milliseconds DefragmenterTask::getChunkDuration() const {
197224
DefragmentVisitor& DefragmenterTask::getDefragVisitor() {
198225
return dynamic_cast<DefragmentVisitor&>(prAdapter->getHTVisitor());
199226
}
227+
228+
float DefragmenterTask::getScoredFragmentation(
229+
const cb::FragmentationStats& fragStats) const {
230+
auto lowWater = stats.mem_low_wat.load();
231+
auto rss = fragStats.getResidentBytes() > lowWater
232+
? lowWater
233+
: fragStats.getResidentBytes();
234+
return fragStats.getFragmentationRatio() * (double(rss) / double(lowWater));
235+
}
236+
237+
DefragmenterTask::SleepTimeAndRunState DefragmenterTask::calculateSleepLinear(
238+
const cb::FragmentationStats& fragStats) {
239+
auto score = getScoredFragmentation(fragStats);
240+
bool runDefragger = true;
241+
242+
const auto& conf = engine->getConfiguration();
243+
double rv = 0.0;
244+
auto maxSleep = conf.getDefragmenterAutoMaxSleep();
245+
auto minSleep = conf.getDefragmenterAutoMinSleep();
246+
auto lower = conf.getDefragmenterAutoLowerThreshold();
247+
auto upper = conf.getDefragmenterAutoUpperThreshold();
248+
249+
// Is the 'score' in the range where we will look to reduce sleep by
250+
// some amount in relation to how 'bad' the score is?
251+
if (score > lower && score < upper) {
252+
// Calculate the error (distance from lower)
253+
auto error = (score - lower);
254+
255+
// How many % of our error range is that?
256+
auto ePerc = (error / (upper - lower)) * 100.0;
257+
258+
// And now find the % of the sleep range
259+
auto t = ((maxSleep - minSleep) / 100) * ePerc;
260+
261+
// Finally we will return maxSleep - t. As t gets larger the sleep time
262+
// is smaller
263+
rv = maxSleep - t;
264+
} else if (score < lower) {
265+
rv = maxSleep;
266+
runDefragger = false;
267+
} else {
268+
rv = minSleep;
269+
}
270+
271+
return {std::chrono::duration<double>{rv}, runDefragger};
272+
}
273+
274+
DefragmenterTask::SleepTimeAndRunState DefragmenterTask::calculateSleepPID(
275+
const cb::FragmentationStats& fragStats) {
276+
auto score = getScoredFragmentation(fragStats);
277+
const auto& conf = engine->getConfiguration();
278+
auto maxSleep = conf.getDefragmenterAutoMaxSleep();
279+
auto minSleep = conf.getDefragmenterAutoMinSleep();
280+
281+
// If fragmentation goes below our set-point (SP), we can't continue to use
282+
// the PID. More general usage and it would be used to "speed up/slow down"
283+
// to reach the SP. We can't now force defragmentation up, we're just happy
284+
// it's below the SP. In this case reset and when we go over again begin
285+
// the ramping
286+
if (score < conf.getDefragmenterAutoLowerThreshold()) {
287+
// Reset the PID ready for the next time fragmentation increases
288+
pid.reset();
289+
return {std::chrono::duration<double>{maxSleep}, false};
290+
}
291+
292+
// Above setpoint, use the PID to calculate a correction. This will return
293+
// a negative value
294+
auto correction = stepPid(score);
295+
296+
// Add the negative to produce a sleep time
297+
auto rv = maxSleep + correction;
298+
299+
// Don't go below the minimum sleep
300+
if (rv < minSleep) {
301+
rv = minSleep;
302+
}
303+
304+
return {std::chrono::duration<double>{rv}, true};
305+
}
306+
307+
float DefragmenterTask::stepPid(float pv) {
308+
return pid.step(pv);
309+
}

0 commit comments

Comments
 (0)