|
| 1 | +# Periodic Full Compaction |
| 2 | + |
| 3 | +Author: [Alex Feinberg](https://github.com/afeinberg) |
| 4 | + |
| 5 | +Reviewers: [Connor](https://github.com/Connor1996), |
| 6 | +[Tony](https://github.com/tonyxuqqi), [Andy](https://github.com/v01dstar), |
| 7 | +others |
| 8 | + |
| 9 | +## Introduction |
| 10 | + |
| 11 | +**Periodic full compaction** is a scheduled task that starts at specified times |
| 12 | +of the day on each TiKV node and compacts the column families of all regions *at |
| 13 | +all levels including the bottommost (L6)*. In order to reduce to impact running |
| 14 | +such a heavy-weight task would have on the cluster's ability to serve traffic, |
| 15 | +full compaction is incremental: before the next range (presently a region) is |
| 16 | +compacted, we check if the load is below a certain threshold; if the threshold |
| 17 | +is exceeded, we pause the task until the load is again below a certain |
| 18 | +threshold. |
| 19 | + |
| 20 | +## Motivation |
| 21 | + |
| 22 | +Presently, we have no way to periodically execute a RocksDB compaction across |
| 23 | +all levels. This has a number of implications: compaction filters (used to |
| 24 | +remove tombstones) only run when the bottom-most (L6) compaction is executed; a |
| 25 | +system with a high number of deletes that experiences a heavy read-only (but |
| 26 | +little or no writes) might thus have accumulated tombstones markers that are not |
| 27 | +deleted. |
| 28 | + |
| 29 | +Using `tikv-ctl compact-cluster` is not suitable for this goal: while it executes |
| 30 | +a full compaction, it may impact online user traffic which makes it non-viable |
| 31 | +for production usage without downtime. Periodic full compaction provides a way to |
| 32 | +achieve the same goal in a more controllable way. |
| 33 | + |
| 34 | +## Detailed design |
| 35 | + |
| 36 | +### Periodic scheduling |
| 37 | + |
| 38 | +#### Configuration |
| 39 | + |
| 40 | +To enable periodic full compaction, specify the hours during which we wish to |
| 41 | +schedule full compaction to run in tikv's configuration and a maximum CPU |
| 42 | +utilization threshold. CPU utilization is calculated by using process stats in |
| 43 | +`/proc` over a 10-minute window. (See *Conditions for full compaction to run* |
| 44 | +below.) |
| 45 | + |
| 46 | +> |
| 47 | +> `tikv.toml` setting to run compaction at 03:00 and 23:00 |
| 48 | +> (3am and 11pm respectively) in the tikv nodes' local timezone if CPU |
| 49 | +> usage is below 90%: |
| 50 | +> |
| 51 | +> ```toml |
| 52 | +>[raftstore] |
| 53 | +>periodic-full-compact-start-max-cpu = 0.9 |
| 54 | +>periodic-full-compact-start-times = ["03:00", "23:00"] |
| 55 | +>``` |
| 56 | +
|
| 57 | +### Executing a `PeriodicFullCompact` task |
| 58 | +
|
| 59 | +>`compact.rs`: |
| 60 | +> |
| 61 | +>```rust |
| 62 | +>pub enum Task { |
| 63 | +> PeriodicFullCompact { |
| 64 | +> ranges: Vec<(Key, Key)>, |
| 65 | +> compact_load_controller: FullCompactController, |
| 66 | +> }, |
| 67 | +>``` |
| 68 | +
|
| 69 | +#### Choosing ranges |
| 70 | +
|
| 71 | +We use ranges defined by the start and end keys of all of the store's regions as |
| 72 | +increments: |
| 73 | +
|
| 74 | +> See `StoreFsmDelegate::regions_for_full_compact` in `store.rs`: |
| 75 | +> |
| 76 | +>```rust |
| 77 | +> /// Use ranges assigned to each region as increments for full compaction. |
| 78 | +> fn ranges_for_full_compact(&self) -> Vec<(Vec<u8>, Vec<u8>)> { |
| 79 | +>``` |
| 80 | +
|
| 81 | +#### Controlling full compaction |
| 82 | +
|
| 83 | +>`compact.rs`: |
| 84 | +> |
| 85 | +>```rust |
| 86 | +>pub struct FullCompactController { |
| 87 | +> /// Initial delay between retries for ``FullCompactController::pause``. |
| 88 | +> pub initial_pause_duration_secs: u64, |
| 89 | +> /// Max delay between retries. |
| 90 | +> pub max_pause_duration_secs: u64, |
| 91 | +> /// Predicate function to evaluate that indicates if we can proceed with |
| 92 | +> /// full compaction. |
| 93 | +> pub incremental_compaction_pred: CompactPredicateFn, |
| 94 | +>} |
| 95 | +>``` |
| 96 | +
|
| 97 | +##### Conditions for full compaction to run |
| 98 | +
|
| 99 | +###### Compact predicate function (`CompactPredicateFn`) |
| 100 | +
|
| 101 | +> |
| 102 | +> `CompactPredicateFn` is defined in `compact.rs` as an `Fn()` that returns true |
| 103 | +> if it is safe to start compaction or compact the next range in `ranges.` |
| 104 | +> |
| 105 | +>```rust |
| 106 | +>type CompactPredicateFn = Box<dyn Fn() -> bool + Send + Sync>; |
| 107 | +>``` |
| 108 | +> |
| 109 | +
|
| 110 | +###### Using `CompactPredicateFn` |
| 111 | +
|
| 112 | +We evaluate the compaction predicate function in the following cases: |
| 113 | +
|
| 114 | +1. Before starting a full compaction task. |
| 115 | +
|
| 116 | + See `StoreFsmDelegate::on_full_compact_tick` in `store.rs`, where we return |
| 117 | + early if the predicate returns false. |
| 118 | +
|
| 119 | + ```rust |
| 120 | + // Do not start if the load is high. |
| 121 | + if !compact_predicate_fn() { |
| 122 | + return; |
| 123 | + } |
| 124 | + ``` |
| 125 | +
|
| 126 | +2. After finishing an incremental full compaction for a range, if more ranges remain. |
| 127 | + |
| 128 | + See `CompactRunner::full_compact` in `compact.rs`, where we pause (see |
| 129 | + *Pausing* below) if the predicate returns false: |
| 130 | + |
| 131 | + ```rust |
| 132 | + if let Some(next_range) = ranges.front() { |
| 133 | + if !(compact_controller.incremental_compaction_pred)() { |
| 134 | + // ... |
| 135 | + compact_controller.pause().await?; |
| 136 | + ``` |
| 137 | + |
| 138 | +###### Load-based `CompactPredicateFn` implementation |
| 139 | + |
| 140 | +> This is returned by ```StoreFsmDelegate::is_low_load_for_full_compact``` in |
| 141 | +> `store.rs` which checks that the raftstore is > busy and checks if the CPU |
| 142 | +> usage within over the last 10 minute window is within the threshold specified |
| 143 | +> by `periodic-full-compact-start-max-cpu`. |
| 144 | +> |
| 145 | +> ```rust |
| 146 | +> fn is_low_load_for_full_compact(&self) -> impl Fn() -> bool { |
| 147 | +>``` |
| 148 | +
|
| 149 | +##### Pausing |
| 150 | +
|
| 151 | +Full compaction tasks are intended to be long-running and may spend up to 15 |
| 152 | +minutes at a time waiting for `CompactPredicateFn` to evaluate to true. As in |
| 153 | +other places in tikv when we need pause or sleep, we call |
| 154 | +```GLOBAL_TIMER_HANDLE.delay``` in an async context. |
| 155 | +
|
| 156 | +> `compact.rs` |
| 157 | +> |
| 158 | +>```rust |
| 159 | +>impl FullCompactController { |
| 160 | +> pub async fn pause(&self) -> Result<(), Error> { |
| 161 | +> let mut duration_secs = self.initial_pause_duration_secs; |
| 162 | +> loop { |
| 163 | +> box_try!( |
| 164 | +> GLOBAL_TIMER_HANDLE |
| 165 | +> .delay(std::time::Instant::now() + Duration::from_secs(duration_secs)) |
| 166 | +> .compat() |
| 167 | +> .await |
| 168 | +> ); |
| 169 | +> if (self.incremental_compaction_pred)() { |
| 170 | +> break; |
| 171 | +> }; |
| 172 | +> duration_secs = self |
| 173 | +> .max_pause_duration_secs |
| 174 | +> .max(duration_secs * 2); |
| 175 | +> } |
| 176 | +> Ok(()) |
| 177 | +> } |
| 178 | +>``` |
| 179 | +
|
| 180 | +##### Using background worker to execute compaction |
| 181 | +
|
| 182 | +Since `FullCompactController::pause` is asynchronous (see *Pausing* above), |
| 183 | +`PeriodicFullCompact` tasks are scheduled using the background worker pool. This |
| 184 | +means that other cleanup and compaction tasks can while full compaction is |
| 185 | +paused. |
| 186 | +
|
| 187 | +> `store.rs` |
| 188 | +> |
| 189 | +>```rust |
| 190 | +>impl RaftBatchSystem { |
| 191 | +> pub fn spawn( |
| 192 | +> &mut self, |
| 193 | +> // ... |
| 194 | +> background_worker: Worker, |
| 195 | +> // ... |
| 196 | +> ) -> Result<()> { |
| 197 | +> // ... |
| 198 | +> let bg_remote = background_worker.remote(); |
| 199 | +> // ... |
| 200 | +> let compact_runner = CompactRunner::new(engines.kv.clone(), bg_remote); |
| 201 | +> // ... |
| 202 | +>``` |
| 203 | +> |
| 204 | +> Using `Remote::spawn` to asynchronously execute full compaction in |
| 205 | +> `yatp_pool::FuturePool`. |
| 206 | +> |
| 207 | +>```rust |
| 208 | +>impl Runnable for CompactRunner { |
| 209 | +> fn run(&mut self, task: Task) { |
| 210 | +> match task { |
| 211 | +> Task::PeriodicFullCompact { |
| 212 | +> ranges, |
| 213 | +> compact_load_controller, |
| 214 | +> } => { |
| 215 | +> // ... |
| 216 | +> let engine = self.engine.clone(); |
| 217 | +> self.remote.spawn(async move { // NOTE the use of `self.remote` |
| 218 | +> if let Err(e) = Self::full_compact(engine, ranges, compact_load_controller).await { |
| 219 | +> // ... |
| 220 | +>``` |
| 221 | +
|
| 222 | +Note that `CompactRunner::full_compact` is an `async fn`, yet it invokes |
| 223 | +RocksDB's manual compaction API which blocks the current thread: this is |
| 224 | +supported by `FuturePool` and happens in other places in our code. |
| 225 | +
|
| 226 | +### Full compaction of a range |
| 227 | +
|
| 228 | +We use `CompactExt::compact_range` to perform the compaction of each region, |
| 229 | +which calls `compact_range_cf` on all the column families. Note that |
| 230 | +`exclusive_manual` is `false` and `subcompactions` is `1` - meaning all |
| 231 | +sub-compactions are executed on one RocksDb thread - to limit resource usage. |
| 232 | +
|
| 233 | +> From `full_compact` in `compact.rs`: |
| 234 | +> |
| 235 | +>```rust |
| 236 | +> box_try!(engine.compact_range( |
| 237 | +> range.0, range.1, |
| 238 | +> false, // non-exclusive |
| 239 | +> 1, // number of threads threads |
| 240 | +> )); |
| 241 | +>``` |
| 242 | +> |
| 243 | +> From `CompactExt` in `components/engine_traits/src/compact.rs`: |
| 244 | +> |
| 245 | +>```rust |
| 246 | +> fn compact_range( |
| 247 | +> &self, |
| 248 | +> start_key: Option<&[u8]>, |
| 249 | +> end_key: Option<&[u8]>, |
| 250 | +> exclusive_manual: bool, |
| 251 | +> max_subcompactions: u32, // Controls the number of engine worker threads. |
| 252 | +> ) -> Result<()> { |
| 253 | +> for cf in self.cf_names() { |
| 254 | +> self.compact_range_cf(cf, start_key, end_key, exclusive_manual, max_subcompactions)?; |
| 255 | +> } |
| 256 | +>``` |
| 257 | +> |
| 258 | +> The `RocksEngine` implementation of `compact_range_cf`. See [Manual |
| 259 | +> Compaction](https://github.com/facebook/rocksdb/wiki/Manual-Compaction) in |
| 260 | +> RocksDb documentation for more info. |
| 261 | +> |
| 262 | +> ```rust |
| 263 | +> let mut compact_opts = CompactOptions::new(); |
| 264 | +> compact_opts.set_exclusive_manual_compaction(exclusive_manual); |
| 265 | +> compact_opts.set_max_subcompactions(max_subcompactions as i32); |
| 266 | +> db.compact_range_cf_opt(handle, &compact_opts, start_key, end_key); |
| 267 | +>``` |
| 268 | +
|
| 269 | +### Implementation |
| 270 | +
|
| 271 | + | Sub-task | Status | |
| 272 | + | - | - | |
| 273 | + | Periodic schedule full compaction |**Merged** [tikv/tikv#12729](https://github.com/tikv/tikv/pull/15853)| |
| 274 | + | Incrementalism, pausing | **Merged** [tikv/tikv#15995](https://github.com/tikv/tikv/pull/15995)| |
| 275 | +
|
| 276 | +#### Alternatives considered |
| 277 | +
|
| 278 | +##### Compacting by file or level instead of a range |
| 279 | +
|
| 280 | +*Not applicable*: doing so would not guarantee that the compaction filters are able to run. |
| 281 | +
|
| 282 | +###### Using a rate limiter to control load during full compaction |
| 283 | +
|
| 284 | +*Not applicable*: compaction happens in increments of regions (`512MB` at a |
| 285 | +time), which would not work with the current token-bucket-based rate limiter |
| 286 | +APIs. |
| 287 | +
|
| 288 | +##### Using metrics other than CPU load |
| 289 | +
|
| 290 | +*Future work*. This is feasible but would need additional implementation and |
| 291 | +tuning load. See `io_load.rs` |
| 292 | +
|
| 293 | +### Metrics |
| 294 | +
|
| 295 | +| Metric | Description | |
| 296 | +|--------|-------------| |
| 297 | +| `tikv_storage_full_compact_duration_seconds` | Bucketed histogram of periodic full compaction run duration | |
| 298 | +| `tikv_storage_full_compact_increment_duration_seconds` | Bucketed histogram of full compaction *increments* | |
| 299 | +| `tikv_storage_full_compact_pause_duration_seconds` | Bucketed histogram of full compaction pauses | |
| 300 | +| `tikv_storage_process_stat_cpu_usage` | CPU useage over a 10 minute window | |
| 301 | +
|
| 302 | +## Demonstration |
| 303 | +
|
| 304 | +### Setup |
| 305 | +
|
| 306 | +#### Configure periodic full compaction |
| 307 | +
|
| 308 | +```toml |
| 309 | +[raftstore] |
| 310 | +periodic-full-compact-start-max-cpu = 0.33 |
| 311 | +periodic-full-compact-start-times = ["11:00", "22:00", "23:00"] |
| 312 | +``` |
| 313 | +
|
| 314 | +> **Note:** set `periodic-full-compact-start-max-cpu` to observe pauses. |
| 315 | +
|
| 316 | +#### Test periodic full compaction |
| 317 | + |
| 318 | +##### Populate a table |
| 319 | + |
| 320 | +``` sql |
| 321 | +create database compact_test; |
| 322 | +-- Query OK, 0 rows affected (0.10 sec)= |
| 323 | +use compact_test; |
| 324 | +-- Database changed |
| 325 | +create table t1(f1 integer primary key auto_increment, f2 integer); |
| 326 | +-- Query OK, 0 rows affected (0.09 sec) |
| 327 | +insert into t1(f2) values(1),(2),(3); |
| 328 | +-- Query OK, 3 rows affected (0.01 sec) |
| 329 | +-- 3 Duplicates: 0 Warnings: 0 |
| 330 | + |
| 331 | +-- repeat below command N times |
| 332 | +insert into t1(f2) select f1+f2 from t1; |
| 333 | +-- Query OK, 3 rows affected (0.00 sec) |
| 334 | +-- Records: 3 Duplicates: 0 Warnings: 0 |
| 335 | +-- ... |
| 336 | +-- |
| 337 | +-- Query OK, 6291456 rows affected (36.25 sec) |
| 338 | +-- Records: 6291456 Duplicates: 0 Warnings: 0 |
| 339 | +``` |
| 340 | + |
| 341 | +##### Generate deletes |
| 342 | + |
| 343 | +```sql |
| 344 | +delete from t1 where f1 in (select f1 from t1 where mod(f2,3) = 0); |
| 345 | +--- Query OK, 6428311 rows affected (32.15 sec) |
| 346 | +``` |
| 347 | + |
| 348 | +##### Observe metrics and logs |
| 349 | + |
| 350 | +```log |
| 351 | +[2024/01/10 11:27:07.280 -08:00] [INFO] [compact.rs:236] ["full compaction started"] [thread_id=0x5] |
| 352 | +[2024/01/10 11:27:45.910 -08:00] [INFO] [compact.rs:291] ["full compaction finished"] [time_takes=38.630410698s] [thread_id=0x5] |
| 353 | +``` |
| 354 | + |
| 355 | + |
| 356 | +> **Note that** in the screenshot above, CPU usage was below 33%: this allowed |
| 357 | +> full compaction to run without any pauses between increments. |
| 358 | +
|
| 359 | +## Future work |
| 360 | + |
| 361 | +### Additional load criteria |
| 362 | + |
| 363 | +* Incorporate other load statistics besides the CPU such as disk or network I/O. |
| 364 | + * Incorporating disk seek time, throughput, utilization, and/or file-sync |
| 365 | + latency statistics specifically will further limit the impact of full |
| 366 | + compaction runs on read and write latency. **Note that** the existing |
| 367 | + implementation suggests using `raftstore.periodic-full-compact-start-times` |
| 368 | + to configure full compaction to only start during off-peak periods. |
| 369 | + |
| 370 | +### Smarter range selection |
| 371 | + |
| 372 | +Possibly options: |
| 373 | + |
| 374 | +* Do not compact an entire region at all once |
| 375 | +* Compact the ranges with most versions first |
| 376 | + |
| 377 | +### Stopping compaction |
| 378 | + |
| 379 | +* Add a mechanism to monitor which full compactions are in progress. |
| 380 | +* Manually pausing: allow all or some manual compaction tasks to be paused for a |
| 381 | + specified amount of time. |
| 382 | +* Manually stopping: allow setting a flag for an individual full compaction task |
| 383 | + that would terminate the task as soon as possible instead of starting the next |
| 384 | + increment. |
| 385 | + |
| 386 | +### Manual invocation |
| 387 | + |
| 388 | +* Support manually starting a full compaction for the entire store. Can be done |
| 389 | + via the CLI. |
| 390 | +* Support manually compacting all or some regions of a given table. Would need |
| 391 | + to be integrated into TiDB syntax. |
0 commit comments