Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
306 changes: 306 additions & 0 deletions pkg/kv/kvserver/kvadmission/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
// Copyright 2025 The Cockroach Authors.
//
// Use of this software is governed by the CockroachDB Software License
// included in the /LICENSE file.

// Package kvadmission is the integration layer between KV and admission
// control.
//
// # Overview
//
// This package provides admission control for all KV layer work. The primary
// entry point is Controller.AdmitKVWork, which is called before processing any
// BatchRequest. Depending on the request type and properties, work may pass
// through multiple admission control queues before execution.
//
// # Request Flow
//
// When a BatchRequest arrives, AdmitKVWork determines which admission control
// mechanism(s) to apply based on the request properties. A single request may
// pass through multiple queues in sequence:
//
// 1. Replication Admission Control (RACv2) - For replicated writes
// Token-based flow control integrated with Raft replication.
// Applies when kvflowcontrol is enabled and work is not bypassing admission.
//
// 2. Store Admission Queue - For leaseholder writes
// Token-based IO admission control per store, monitoring Pebble LSM health.
// Applies to writes when RACv2 is disabled or not admitted.
//
// 3. Elastic CPU Work Queue - For CPU-intensive background work
// Token-based admission with cooperative scheduling for elastic work.
// Applies to export requests and low-priority internal bulk operations.
//
// 4. KV Work Queue (slots) - For regular KV operations
// Slot-based concurrency control, dynamically adjusted based on CPU load.
// This is the default path for work not covered by the above mechanisms.
//
// # Request Property Mapping
//
// The workInfoForBatch function extracts admission control properties from
// incoming requests:
//
// ## Tenant ID
//
// The tenant ID determines fairness scheduling within work queues:
//
// - For non-system tenants: Always uses the authenticated requestTenantID.
// - For system tenant requests: Uses rangeTenantID (if set and non-admin)
// to attribute work to the tenant owning the data. This enables proper
// accounting when the system tenant operates on behalf of other tenants.
// Controlled by kvadmission.use_range_tenant_id_for_non_admin.enabled.
//
// ## Priority
//
// Priority is extracted from BatchRequest.AdmissionHeader.Priority and uses
// the admissionpb.WorkPriority enum:
//
// - HighPri (127): High-priority system work
// - LockingUserHighPri (100): User high-priority transactions with locks
// - UserHighPri (50): High-priority user queries
// - LockingNormalPri (10): Normal-priority transactions with locks
// - NormalPri (0): Default priority for most user work
// - BulkNormalPri (-30): User-initiated bulk operations (backups, changefeeds)
// and necessary background maintenance (index backfill, MVCC GC)
// - UserLowPri (-50): Low-priority user work
// - BulkLowPri (-100): Internal system maintenance (TTL deletion, schema
// change cleanup, table metadata updates, SQL activity stats)
// - LowPri (-128): Lowest priority background work
//
// Priority affects ordering within a tenant's work queue. Higher priority work
// can starve lower priority work within the same tenant.
//
// ## Priority to Admission Queue Mapping
//
// Priorities map to admission control concepts in two ways:
//
// 1. WorkClass: Determines Store IO token requirements (for writes)
// - ElasticWorkClass (priority < NormalPri): Requires L0 tokens, elastic
// L0 tokens, AND disk bandwidth tokens
// - RegularWorkClass (priority >= NormalPri): Requires only L0 tokens
//
// 2. Admission Path: The sequence of admission queues work passes through
// - Store IO: Token-based per-store IO admission (writes only currently)
// - CPU: Elastic CPU work queue, uses CPU time tokens (e.g., 100ms grants)
// and cooperative yielding when work exceeds its allotted time
// - Slots: KV work queue, uses concurrency slots (e.g., max 8 concurrent ops)
// that are held for the duration of the operation
//
// The mapping varies by request type and priority (defaults shown):
//
// ┌──────────────────┬────────┬───────────┬──────────────────────────────┐
// │ Priority │ Value │ WorkClass │ Admission Path │
// ├──────────────────┼────────┼───────────┼──────────────────────────────┤
// │ Export requests³ │ -30 │ Elastic │ IO(Elastic) → CPU(100ms)¹ │
// ├──────────────────┼────────┼───────────┼──────────────────────────────┤
// │ LowPri │ -128 │ Elastic │ IO(Elastic) → CPU(10ms)² │
// ├──────────────────┼────────┼───────────┼──────────────────────────────┤
// │ BulkLowPri │ -100 │ Elastic │ IO(Elastic) → CPU(10ms)² │
// ├──────────────────┼────────┼───────────┼──────────────────────────────┤
// │ UserLowPri │ -50 │ Elastic │ IO(Elastic) → Slots(1) │
// ├──────────────────┼────────┼───────────┼──────────────────────────────┤
// │ BulkNormalPri │ -30 │ Elastic │ IO(Elastic) → Slots(1) │
// ├──────────────────┼────────┼───────────┼──────────────────────────────┤
// │ NormalPri+ │ >= 0 │ Regular │ IO(Regular) → Slots(1) │
// └──────────────────┴────────┴───────────┴──────────────────────────────┘
//
// Notes:
// - IO(Elastic) requires 3 token types: L0, elastic L0, and disk bandwidth
// (see "Disk Bandwidth Tokens" section for details)
// - IO(Regular) requires 1 token type: L0 only
// - Store IO currently only applies to writes; read bandwidth limiting is planned
// - CPU(Xms) grants X milliseconds of CPU time; work cooperatively yields when
// it exceeds its grant
// - Slots(1) requests 1 concurrency slot; typically ~8 total slots available,
// dynamically adjusted based on CPU load
//
// Footnotes:
//
// ¹ kvadmission.export_request_elastic_control.enabled (default: true)
// When disabled, export requests become IO(Elastic) → Slots(1).
//
// ² kvadmission.elastic_control_bulk_low_priority.enabled (default: true)
// When disabled, BulkLowPri and LowPri become IO(Elastic) → Slots(1) instead
// of IO(Elastic) → CPU(10ms).
//
// ³ Export requests (used by BACKUP) are always assigned BulkNormalPri (-30)
// for admission control, regardless of the backup job's user priority.
//
// Additionally, kvadmission.elastic_cpu.duration_per_export_request (default: 100ms)
// and kvadmission.elastic_cpu.duration_per_low_pri_read (default: 10ms) control
// the CPU time tokens granted to elastic CPU work.
//
// Key observations:
//
// - All work follows the IO → {CPU|Slots} pattern, with Store IO admission
// currently applying only to writes (read bandwidth limiting planned).
// - Export requests bypass normal priority-based routing to use IO → CPU.
// - The boundary at NormalPri (0) separates ElasticWorkClass (can tolerate
// delays) from RegularWorkClass (latency-sensitive).
// - IO(Elastic) is more restrictive than IO(Regular): 3 token types vs 1.
// The additional disk bandwidth token requirement provides backpressure
// protection, limiting elastic writes to ~80% of provisioned disk bandwidth.
// - Only BulkLowPri and LowPri end at the CPU queue (by default). UserLowPri
// and BulkNormalPri end at Slots despite being ElasticWorkClass.
//
// ## Bypass Conditions
//
// Work may bypass admission control entirely under these conditions:
//
// - Admin requests: ba.IsAdmin() returns true (e.g., splits, merges)
// - Source is OTHER: AdmissionHeader.Source == OTHER
// - LeaseInfo requests: Used as health probes by circuit breakers
// - Legacy bulk-only mode: When KVBulkOnlyAdmissionControlEnabled is set,
// work at NormalPri or above bypasses admission
//
// ## Create Time
//
// The admission create time (AdmissionHeader.CreateTime) tracks request
// queueing latency. If unset and work is not bypassing admission, it's set to
// the current timestamp. This enables admission control to prioritize older
// requests and track queueing delays.
//
// # Admission Control Mechanisms
//
// ## KV Work Queue (Slots)
//
// The default path for most KV operations uses slot-based admission:
//
// - Slots represent concurrency limits (e.g., 8 concurrent KV operations)
// - Dynamically adjusted by kvSlotAdjuster based on CPU load (runnable goroutines)
// - Work blocks until a slot is available
// - Slot is explicitly returned via AdmittedKVWorkDone
// - Enables tracking of work completion for better admission decisions
//
// Used for: All KV work except writes (which use store queue) and elastic work.
//
// ## Store Admission Queue (Tokens)
//
// Per-store token-based IO admission control:
//
// - Tokens represent estimated disk IO cost
// - Dynamically adjusted based on Pebble LSM health metrics:
// - L0 file count and compaction debt
// - Write stalls and flush backlog
// - Disk bandwidth utilization
// - Tokens are consumed but not returned (unlike slots)
// - Actual bytes written are reported for token adjustment
//
// Used for: Leaseholder writes when RACv2 is disabled or doesn't admit.
//
// ### Disk Bandwidth Tokens
//
// In addition to L0 tokens, Store Admission uses disk bandwidth tokens to limit
// elastic work based on the underlying storage's provisioned bandwidth:
//
// Token Budget (per 15s interval):
// elastic_write_tokens = (provisioned_bandwidth × max_util × 15s) - observed_reads
//
// Where:
// - provisioned_bandwidth: Storage capacity (e.g., 1 GB/s from disk specs)
// - max_util: kvadmission.store.elastic_disk_bandwidth_max_util (default: 0.8)
// - observed_reads: Actual read bytes from OS (e.g., Linux /proc/diskstats)
//
// Key properties:
//
// - One token = one byte of LSM ingestion capacity (not raw disk bytes)
// - Write amplification is modeled: 1MB write may consume 10MB of tokens
// - Reads are observed (not admitted) and reduce the write budget
// - Regular work requires only L0 tokens
// - Elastic work requires L0 tokens AND disk bandwidth tokens
// - The max_util cap (80%) is static - elastic work won't exceed this even
// when regular work is idle, providing headroom for burst regular traffic
//
// Example calculation (15s interval, 1 GB/s disk, 80% max util, 2 GB reads):
//
// Budget = (1 GB/s × 0.8 × 15s) - 2 GB = 12 GB - 2 GB = 10 GB for elastic writes
//
// ## Replication Admission Control (RACv2)
//
// Below-raft replication flow control:
//
// - Token-based with per-stream flow tokens
// - Integrated with Raft replication protocol
// - Provides backpressure before proposals enter the Raft log
// - Tokens are returned when followers apply and acknowledge entries
//
// Used for: Replicated writes when kvflowcontrol.enabled is true.
//
// ## Elastic CPU Work Queue (Tokens)
//
// Cooperative scheduling for elastic, CPU-intensive work:
//
// - Token represents allotted CPU time (default 100ms for exports)
// - Work checks ElasticCPUWorkHandle.OverLimit() in tight loops
// - When over limit, work yields and re-admits with a new handle
// - Total elastic CPU capacity adjusts based on scheduler latency
// - Provides latency isolation between elastic and regular work
//
// Used for:
// - Export requests (backups) when kvadmission.export_request_elastic_control.enabled
// - Internal low-priority reads (TTL) at BulkLowPri when
// kvadmission.elastic_control_bulk_low_priority.enabled
//
// # Admission Queue Ordering
//
// For writes (non-heartbeat), admission happens in this order:
//
// 1. RACv2 flow control (if enabled and not bypassing)
// 2. Store IO admission (if RACv2 disabled or didn't admit)
// 3. KV slots or Elastic CPU (depending on request type)
//
// For reads and other operations:
//
// 1. Elastic CPU admission (if export or low-priority bulk)
// 2. KV slots admission (otherwise)
//
// Each stage may block waiting for tokens/slots. Work bypassing admission
// still goes through the queues for accounting but doesn't block.
//
// # Example: Normal Read Request
//
// A BatchRequest with a Get operation at NormalPri:
//
// 1. Tenant ID: Extracted from requestTenantID (or rangeTenantID for system tenant)
// 2. Priority: NormalPri (0) from AdmissionHeader
// 3. Bypass: false (not an admin request)
// 4. Admission: kvAdmissionQ.Admit() - blocks until a slot is available
// 5. Execution: KV work executes
// 6. Completion: AdmittedKVWorkDone() returns the slot
//
// # Example: Write Request
//
// A BatchRequest with a Put operation at NormalPri:
//
// 1. Tenant ID: Extracted as above
// 2. Priority: NormalPri (0)
// 3. Bypass: false
// 4. RACv2 Admission: kvflowHandle.Admit() - waits for replication flow tokens
// 5. Store Admission: Skipped (RACv2 admitted)
// 6. KV Admission: kvAdmissionQ.Admit() - waits for a KV slot
// 7. Execution: KV work executes, Raft replication happens
// 8. Completion: AdmittedKVWorkDone() releases slot and IO tokens
//
// # Example: Export Request
//
// A backup export request at BulkNormalPri:
//
// 1. Tenant ID: Extracted as above
// 2. Priority: BulkNormalPri (-30)
// 3. Bypass: false
// 4. Elastic CPU Admission: ElasticCPUWorkQueue.Admit() - waits for 100ms of CPU time
// 5. Execution: Export loop checks OverLimit() periodically and yields when exceeded
// 6. Completion: AdmittedKVWorkDone() returns elastic CPU handle
//
// # Integration Points
//
// The Controller is initialized in server.go during node startup:
//
// - kvAdmissionController receives the KVWork WorkQueue from GrantCoordinator
// - elasticCPUGrantCoordinator is the separate elastic CPU coordinator
// - storeGrantCoords provides per-store IO admission queues
// - kvflowHandles tracks replication flow control handles per range
//
// Callers invoke AdmitKVWork before processing any BatchRequest and must call
// AdmittedKVWorkDone after completion (if a non-nil handle is returned).
package kvadmission
8 changes: 4 additions & 4 deletions pkg/kv/kvserver/kvadmission/kvadmission.go
Original file line number Diff line number Diff line change
Expand Up @@ -222,10 +222,10 @@ type controllerImpl struct {

// Admission control queues and coordinators. All three should be nil or
// non-nil.
kvAdmissionQ *admission.WorkQueue
storeGrantCoords *admission.StoreGrantCoordinators
elasticCPUGrantCoordinator *admission.ElasticCPUGrantCoordinator
kvflowHandles kvflowcontrol.ReplicationAdmissionHandles
kvAdmissionQ *admission.WorkQueue // foreground CPU admission control
storeGrantCoords *admission.StoreGrantCoordinators // store-level (IO) admission control
elasticCPUGrantCoordinator *admission.ElasticCPUGrantCoordinator // background (elastic) CPU admission control
kvflowHandles kvflowcontrol.ReplicationAdmissionHandles // quorum-replicated flow control

settings *cluster.Settings
every log.EveryN
Expand Down
Loading