You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+Summary+
Adds a new event-driven mode for aborting SyncWrites which have
exceeded their durability timeout. This has a much lower idle overhead
compared to the current polling method.
The default mode remains "polling", subsequent patch will change the
default to "event-driven".
+Background+
When SyncWrites were introduced in 6.5.0, each SyncWrite requst has a
timeout associated with it - if the SyncWrite cannot be completed
(Committed or Aborted) within that time, then abort it and inform the
client that it was not successful.
This was implemented in simple (naive?) polling - have a per-Bucket
NonIO task which is scheduled to run every 25ms (by default), and when it
runs check every vBucket for any pending SyncWrites which have now
exceeded their timeout.
Functionally this works fine, however it is relatively expensive -
every 25ms we must iterate across every vBucket on every Bucket, and
call into the DurabiltyMonitor to check for SyncWrites which should be
timed out. This is the case irrespective of if there are any
SyncWrites which are overdue; or even if there are any SyncWrites at
all.
For example, an idle node with 10 Buckets shows 35% CPU utilization -
the vast majority of which is in NonIO threads running the
DurabilityTimeoutTask.
This is obviously undesirable - and the issue scales with even larger
bucket counts.
+Solution+
To reduce the idle CPU usage, change from a polling to an event-driven
model - have a per-vBucket task which is scheduled to run only when
the next SyncWrite for that vBucket is due to timeout. We only need 1
task per vBucket (and not 1 per SyncWrite) because SyncWrites (within
a vBucket) must always complete in-order; therefore we only need to
consider the timeout of the oldest SyncWrite in the
ActiveDurabiltyMonitor for a given vBucket.
This task will only be executed _if_ the next SyncWrite isn't
otherwise Committed before the timeout - when the SyncWrite is
Commited the task will be re-scheduled to run when the _next_
SyncWrite is due - or cancelled if there are no more SyncWrites in
progress (for the vBucket).
As such, the CPU cost for SyncWrite timeout handling when the Bucket
is idle goes to zero - nothing is executed.
There are some additional costs with the event-driven approach:
1. Additional CPU cost whenever the ActiveDM::trackedWrites container
changes (specifically when the head changes), as we must reschedule
or cancel the new per-vBucket task. However that is less than 1
microsecond with the default FollyExecutorPool; so likely dwarfed
by the other activity around adding / Committing SyncWrites.
2. Additional memory footprint for 1024 Tasks instead of 1 (per
Bucket). Note that this is relatively insignificant - each
ExpiredSWCallback task is 96 Bytes, so we have only increased each
Bucket by at-most 96KB (only active vBuckets have a
ExpiredSWCallback).
Change-Id: Ia70a68f4d1551a3407c8bdbb56e91eb5f5f995e2
Reviewed-on: http://review.couchbase.org/c/kv_engine/+/130419
Reviewed-by: Ben Huddleston <[email protected]>
Reviewed-by: Paolo Cocchi <[email protected]>
Tested-by: Build Bot <[email protected]>
Copy file name to clipboardExpand all lines: engines/ep/configuration.json
+17-1Lines changed: 17 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -467,11 +467,27 @@
467
467
"dynamic": true,
468
468
"type": "size_t"
469
469
},
470
+
"durability_timeout_mode": {
471
+
"default": "polling",
472
+
"descr": "How should durability timeouts be scheduled? polling=periodic task running every 'durability_timeout_task_interval'; event-driven=per-VBucket tasks scheduled based on when next SyncWrite will time out.",
473
+
"dynamic": false,
474
+
"type": "std::string",
475
+
"validator": {
476
+
"enum": [
477
+
"polling",
478
+
"event-driven"
479
+
]
480
+
}
481
+
},
470
482
"durability_timeout_task_interval": {
471
483
"default": "25",
472
484
"descr": "Interval (in ms) between subsequent runs of the DurabilityTimeoutTask",
0 commit comments