Skip to content

Commit 48e6c93

Browse files
authored
RFC: Periodic full compaction (#110) (#110)
* RFC: Periodic Full Compaction (#110) Signed-off-by: Alex Feinberg <[email protected]> * More writeup. Signed-off-by: Alex Feinberg <[email protected]> * Fix spelling. Signed-off-by: Alex Feinberg <[email protected]> * Grammar. Signed-off-by: Alex Feinberg <[email protected]> * Reformat Signed-off-by: Alex Feinberg <[email protected]> * Add Tony's suggestion. Signed-off-by: Alex Feinberg <[email protected]> * Document metrics, add demo/test. Signed-off-by: Alex Feinberg <[email protected]> * Rewrite a paragraph. Signed-off-by: Alex Feinberg <[email protected]> * Add future work section on disk metrics. Signed-off-by: Alex Feinberg <[email protected]> --------- Signed-off-by: Alex Feinberg <[email protected]>
1 parent 23a29b6 commit 48e6c93

File tree

2 files changed

+391
-0
lines changed

2 files changed

+391
-0
lines changed
54.2 KB
Loading
Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
# Periodic Full Compaction
2+
3+
Author: [Alex Feinberg](https://github.com/afeinberg)
4+
5+
Reviewers: [Connor](https://github.com/Connor1996),
6+
[Tony](https://github.com/tonyxuqqi), [Andy](https://github.com/v01dstar),
7+
others
8+
9+
## Introduction
10+
11+
**Periodic full compaction** is a scheduled task that starts at specified times
12+
of the day on each TiKV node and compacts the column families of all regions *at
13+
all levels including the bottommost (L6)*. In order to reduce to impact running
14+
such a heavy-weight task would have on the cluster's ability to serve traffic,
15+
full compaction is incremental: before the next range (presently a region) is
16+
compacted, we check if the load is below a certain threshold; if the threshold
17+
is exceeded, we pause the task until the load is again below a certain
18+
threshold.
19+
20+
## Motivation
21+
22+
Presently, we have no way to periodically execute a RocksDB compaction across
23+
all levels. This has a number of implications: compaction filters (used to
24+
remove tombstones) only run when the bottom-most (L6) compaction is executed; a
25+
system with a high number of deletes that experiences a heavy read-only (but
26+
little or no writes) might thus have accumulated tombstones markers that are not
27+
deleted.
28+
29+
Using `tikv-ctl compact-cluster` is not suitable for this goal: while it executes
30+
a full compaction, it may impact online user traffic which makes it non-viable
31+
for production usage without downtime. Periodic full compaction provides a way to
32+
achieve the same goal in a more controllable way.
33+
34+
## Detailed design
35+
36+
### Periodic scheduling
37+
38+
#### Configuration
39+
40+
To enable periodic full compaction, specify the hours during which we wish to
41+
schedule full compaction to run in tikv's configuration and a maximum CPU
42+
utilization threshold. CPU utilization is calculated by using process stats in
43+
`/proc` over a 10-minute window. (See *Conditions for full compaction to run*
44+
below.)
45+
46+
>
47+
> `tikv.toml` setting to run compaction at 03:00 and 23:00
48+
> (3am and 11pm respectively) in the tikv nodes' local timezone if CPU
49+
> usage is below 90%:
50+
>
51+
> ```toml
52+
>[raftstore]
53+
>periodic-full-compact-start-max-cpu = 0.9
54+
>periodic-full-compact-start-times = ["03:00", "23:00"]
55+
>```
56+
57+
### Executing a `PeriodicFullCompact` task
58+
59+
>`compact.rs`:
60+
>
61+
>```rust
62+
>pub enum Task {
63+
> PeriodicFullCompact {
64+
> ranges: Vec<(Key, Key)>,
65+
> compact_load_controller: FullCompactController,
66+
> },
67+
>```
68+
69+
#### Choosing ranges
70+
71+
We use ranges defined by the start and end keys of all of the store's regions as
72+
increments:
73+
74+
> See `StoreFsmDelegate::regions_for_full_compact` in `store.rs`:
75+
>
76+
>```rust
77+
> /// Use ranges assigned to each region as increments for full compaction.
78+
> fn ranges_for_full_compact(&self) -> Vec<(Vec<u8>, Vec<u8>)> {
79+
>```
80+
81+
#### Controlling full compaction
82+
83+
>`compact.rs`:
84+
>
85+
>```rust
86+
>pub struct FullCompactController {
87+
> /// Initial delay between retries for ``FullCompactController::pause``.
88+
> pub initial_pause_duration_secs: u64,
89+
> /// Max delay between retries.
90+
> pub max_pause_duration_secs: u64,
91+
> /// Predicate function to evaluate that indicates if we can proceed with
92+
> /// full compaction.
93+
> pub incremental_compaction_pred: CompactPredicateFn,
94+
>}
95+
>```
96+
97+
##### Conditions for full compaction to run
98+
99+
###### Compact predicate function (`CompactPredicateFn`)
100+
101+
>
102+
> `CompactPredicateFn` is defined in `compact.rs` as an `Fn()` that returns true
103+
> if it is safe to start compaction or compact the next range in `ranges.`
104+
>
105+
>```rust
106+
>type CompactPredicateFn = Box<dyn Fn() -> bool + Send + Sync>;
107+
>```
108+
>
109+
110+
###### Using `CompactPredicateFn`
111+
112+
We evaluate the compaction predicate function in the following cases:
113+
114+
1. Before starting a full compaction task.
115+
116+
See `StoreFsmDelegate::on_full_compact_tick` in `store.rs`, where we return
117+
early if the predicate returns false.
118+
119+
```rust
120+
// Do not start if the load is high.
121+
if !compact_predicate_fn() {
122+
return;
123+
}
124+
```
125+
126+
2. After finishing an incremental full compaction for a range, if more ranges remain.
127+
128+
See `CompactRunner::full_compact` in `compact.rs`, where we pause (see
129+
*Pausing* below) if the predicate returns false:
130+
131+
```rust
132+
if let Some(next_range) = ranges.front() {
133+
if !(compact_controller.incremental_compaction_pred)() {
134+
// ...
135+
compact_controller.pause().await?;
136+
```
137+
138+
###### Load-based `CompactPredicateFn` implementation
139+
140+
> This is returned by ```StoreFsmDelegate::is_low_load_for_full_compact``` in
141+
> `store.rs` which checks that the raftstore is > busy and checks if the CPU
142+
> usage within over the last 10 minute window is within the threshold specified
143+
> by `periodic-full-compact-start-max-cpu`.
144+
>
145+
> ```rust
146+
> fn is_low_load_for_full_compact(&self) -> impl Fn() -> bool {
147+
>```
148+
149+
##### Pausing
150+
151+
Full compaction tasks are intended to be long-running and may spend up to 15
152+
minutes at a time waiting for `CompactPredicateFn` to evaluate to true. As in
153+
other places in tikv when we need pause or sleep, we call
154+
```GLOBAL_TIMER_HANDLE.delay``` in an async context.
155+
156+
> `compact.rs`
157+
>
158+
>```rust
159+
>impl FullCompactController {
160+
> pub async fn pause(&self) -> Result<(), Error> {
161+
> let mut duration_secs = self.initial_pause_duration_secs;
162+
> loop {
163+
> box_try!(
164+
> GLOBAL_TIMER_HANDLE
165+
> .delay(std::time::Instant::now() + Duration::from_secs(duration_secs))
166+
> .compat()
167+
> .await
168+
> );
169+
> if (self.incremental_compaction_pred)() {
170+
> break;
171+
> };
172+
> duration_secs = self
173+
> .max_pause_duration_secs
174+
> .max(duration_secs * 2);
175+
> }
176+
> Ok(())
177+
> }
178+
>```
179+
180+
##### Using background worker to execute compaction
181+
182+
Since `FullCompactController::pause` is asynchronous (see *Pausing* above),
183+
`PeriodicFullCompact` tasks are scheduled using the background worker pool. This
184+
means that other cleanup and compaction tasks can while full compaction is
185+
paused.
186+
187+
> `store.rs`
188+
>
189+
>```rust
190+
>impl RaftBatchSystem {
191+
> pub fn spawn(
192+
> &mut self,
193+
> // ...
194+
> background_worker: Worker,
195+
> // ...
196+
> ) -> Result<()> {
197+
> // ...
198+
> let bg_remote = background_worker.remote();
199+
> // ...
200+
> let compact_runner = CompactRunner::new(engines.kv.clone(), bg_remote);
201+
> // ...
202+
>```
203+
>
204+
> Using `Remote::spawn` to asynchronously execute full compaction in
205+
> `yatp_pool::FuturePool`.
206+
>
207+
>```rust
208+
>impl Runnable for CompactRunner {
209+
> fn run(&mut self, task: Task) {
210+
> match task {
211+
> Task::PeriodicFullCompact {
212+
> ranges,
213+
> compact_load_controller,
214+
> } => {
215+
> // ...
216+
> let engine = self.engine.clone();
217+
> self.remote.spawn(async move { // NOTE the use of `self.remote`
218+
> if let Err(e) = Self::full_compact(engine, ranges, compact_load_controller).await {
219+
> // ...
220+
>```
221+
222+
Note that `CompactRunner::full_compact` is an `async fn`, yet it invokes
223+
RocksDB's manual compaction API which blocks the current thread: this is
224+
supported by `FuturePool` and happens in other places in our code.
225+
226+
### Full compaction of a range
227+
228+
We use `CompactExt::compact_range` to perform the compaction of each region,
229+
which calls `compact_range_cf` on all the column families. Note that
230+
`exclusive_manual` is `false` and `subcompactions` is `1` - meaning all
231+
sub-compactions are executed on one RocksDb thread - to limit resource usage.
232+
233+
> From `full_compact` in `compact.rs`:
234+
>
235+
>```rust
236+
> box_try!(engine.compact_range(
237+
> range.0, range.1,
238+
> false, // non-exclusive
239+
> 1, // number of threads threads
240+
> ));
241+
>```
242+
>
243+
> From `CompactExt` in `components/engine_traits/src/compact.rs`:
244+
>
245+
>```rust
246+
> fn compact_range(
247+
> &self,
248+
> start_key: Option<&[u8]>,
249+
> end_key: Option<&[u8]>,
250+
> exclusive_manual: bool,
251+
> max_subcompactions: u32, // Controls the number of engine worker threads.
252+
> ) -> Result<()> {
253+
> for cf in self.cf_names() {
254+
> self.compact_range_cf(cf, start_key, end_key, exclusive_manual, max_subcompactions)?;
255+
> }
256+
>```
257+
>
258+
> The `RocksEngine` implementation of `compact_range_cf`. See [Manual
259+
> Compaction](https://github.com/facebook/rocksdb/wiki/Manual-Compaction) in
260+
> RocksDb documentation for more info.
261+
>
262+
> ```rust
263+
> let mut compact_opts = CompactOptions::new();
264+
> compact_opts.set_exclusive_manual_compaction(exclusive_manual);
265+
> compact_opts.set_max_subcompactions(max_subcompactions as i32);
266+
> db.compact_range_cf_opt(handle, &compact_opts, start_key, end_key);
267+
>```
268+
269+
### Implementation
270+
271+
| Sub-task | Status |
272+
| - | - |
273+
| Periodic schedule full compaction |**Merged** [tikv/tikv#12729](https://github.com/tikv/tikv/pull/15853)|
274+
| Incrementalism, pausing | **Merged** [tikv/tikv#15995](https://github.com/tikv/tikv/pull/15995)|
275+
276+
#### Alternatives considered
277+
278+
##### Compacting by file or level instead of a range
279+
280+
*Not applicable*: doing so would not guarantee that the compaction filters are able to run.
281+
282+
###### Using a rate limiter to control load during full compaction
283+
284+
*Not applicable*: compaction happens in increments of regions (`512MB` at a
285+
time), which would not work with the current token-bucket-based rate limiter
286+
APIs.
287+
288+
##### Using metrics other than CPU load
289+
290+
*Future work*. This is feasible but would need additional implementation and
291+
tuning load. See `io_load.rs`
292+
293+
### Metrics
294+
295+
| Metric | Description |
296+
|--------|-------------|
297+
| `tikv_storage_full_compact_duration_seconds` | Bucketed histogram of periodic full compaction run duration |
298+
| `tikv_storage_full_compact_increment_duration_seconds` | Bucketed histogram of full compaction *increments* |
299+
| `tikv_storage_full_compact_pause_duration_seconds` | Bucketed histogram of full compaction pauses |
300+
| `tikv_storage_process_stat_cpu_usage` | CPU useage over a 10 minute window |
301+
302+
## Demonstration
303+
304+
### Setup
305+
306+
#### Configure periodic full compaction
307+
308+
```toml
309+
[raftstore]
310+
periodic-full-compact-start-max-cpu = 0.33
311+
periodic-full-compact-start-times = ["11:00", "22:00", "23:00"]
312+
```
313+
314+
> **Note:** set `periodic-full-compact-start-max-cpu` to observe pauses.
315+
316+
#### Test periodic full compaction
317+
318+
##### Populate a table
319+
320+
``` sql
321+
create database compact_test;
322+
-- Query OK, 0 rows affected (0.10 sec)=
323+
use compact_test;
324+
-- Database changed
325+
create table t1(f1 integer primary key auto_increment, f2 integer);
326+
-- Query OK, 0 rows affected (0.09 sec)
327+
insert into t1(f2) values(1),(2),(3);
328+
-- Query OK, 3 rows affected (0.01 sec)
329+
-- 3 Duplicates: 0 Warnings: 0
330+
331+
-- repeat below command N times
332+
insert into t1(f2) select f1+f2 from t1;
333+
-- Query OK, 3 rows affected (0.00 sec)
334+
-- Records: 3 Duplicates: 0 Warnings: 0
335+
-- ...
336+
--
337+
-- Query OK, 6291456 rows affected (36.25 sec)
338+
-- Records: 6291456 Duplicates: 0 Warnings: 0
339+
```
340+
341+
##### Generate deletes
342+
343+
```sql
344+
delete from t1 where f1 in (select f1 from t1 where mod(f2,3) = 0);
345+
--- Query OK, 6428311 rows affected (32.15 sec)
346+
```
347+
348+
##### Observe metrics and logs
349+
350+
```log
351+
[2024/01/10 11:27:07.280 -08:00] [INFO] [compact.rs:236] ["full compaction started"] [thread_id=0x5]
352+
[2024/01/10 11:27:45.910 -08:00] [INFO] [compact.rs:291] ["full compaction finished"] [time_takes=38.630410698s] [thread_id=0x5]
353+
```
354+
355+
![Metrics showing full compaction running without a pause](../media/periodic-full-compaction-1.png)
356+
> **Note that** in the screenshot above, CPU usage was below 33%: this allowed
357+
> full compaction to run without any pauses between increments.
358+
359+
## Future work
360+
361+
### Additional load criteria
362+
363+
* Incorporate other load statistics besides the CPU such as disk or network I/O.
364+
* Incorporating disk seek time, throughput, utilization, and/or file-sync
365+
latency statistics specifically will further limit the impact of full
366+
compaction runs on read and write latency. **Note that** the existing
367+
implementation suggests using `raftstore.periodic-full-compact-start-times`
368+
to configure full compaction to only start during off-peak periods.
369+
370+
### Smarter range selection
371+
372+
Possibly options:
373+
374+
* Do not compact an entire region at all once
375+
* Compact the ranges with most versions first
376+
377+
### Stopping compaction
378+
379+
* Add a mechanism to monitor which full compactions are in progress.
380+
* Manually pausing: allow all or some manual compaction tasks to be paused for a
381+
specified amount of time.
382+
* Manually stopping: allow setting a flag for an individual full compaction task
383+
that would terminate the task as soon as possible instead of starting the next
384+
increment.
385+
386+
### Manual invocation
387+
388+
* Support manually starting a full compaction for the entire store. Can be done
389+
via the CLI.
390+
* Support manually compacting all or some regions of a given table. Would need
391+
to be integrated into TiDB syntax.

0 commit comments

Comments
 (0)