|
| 1 | +--- |
| 2 | +name: profiling-statement-fingerprints |
| 3 | +description: Ranks and analyzes statement fingerprints using aggregated SQL statistics from crdb_internal.statement_statistics to identify slow, resource-intensive, or error-prone query patterns. Use when investigating historical performance trends, identifying optimization opportunities, or diagnosing recurring slowness without DB Console access. |
| 4 | +compatibility: Requires SQL access with VIEWACTIVITY or VIEWACTIVITYREDACTED cluster privilege. Uses crdb_internal.statement_statistics (production-safe). Execution statistics fields are sampled and may be sparse. |
| 5 | +metadata: |
| 6 | + author: cockroachdb |
| 7 | + version: "1.0" |
| 8 | +--- |
| 9 | + |
| 10 | +# Profiling Statement Fingerprints |
| 11 | + |
| 12 | +Analyzes historical statement performance patterns using aggregated SQL statistics to identify slow, resource-intensive, or error-prone query fingerprints. Uses `crdb_internal.statement_statistics` for time-windowed analysis of latency, CPU, contention, admission delays, and failure rates - entirely via SQL without requiring DB Console access. |
| 13 | + |
| 14 | +**Complement to triaging-live-sql-activity:** This skill analyzes historical patterns; for immediate triage of currently running queries, see [triaging-live-sql-activity](../triaging-live-sql-activity/SKILL.md). |
| 15 | + |
| 16 | +## When to Use This Skill |
| 17 | + |
| 18 | +- Identify slowest statement fingerprints over past hours/days/weeks |
| 19 | +- Find queries with high CPU consumption, contention, or admission waits |
| 20 | +- Investigate performance regressions or plan changes |
| 21 | +- Locate full table scans or missing indexes via index recommendations |
| 22 | +- Analyze resource consumption by application or database |
| 23 | +- SQL-only historical analysis without DB Console access |
| 24 | + |
| 25 | +**For immediate incident response:** Use [triaging-live-sql-activity](../triaging-live-sql-activity/SKILL.md) to triage currently running queries and cancel runaway work. |
| 26 | + |
| 27 | +## Prerequisites |
| 28 | + |
| 29 | +- SQL connection to CockroachDB cluster |
| 30 | +- `VIEWACTIVITY` or `VIEWACTIVITYREDACTED` cluster privilege for cluster-wide visibility |
| 31 | +- Statement statistics collection enabled (default): `sql.stats.automatic_collection.enabled = true` |
| 32 | + |
| 33 | +**Check collection status:** |
| 34 | +```sql |
| 35 | +SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should return: true |
| 36 | +``` |
| 37 | + |
| 38 | +See [triaging-live-sql-activity permissions reference](../triaging-live-sql-activity/references/permissions.md) for RBAC setup (same privileges). |
| 39 | + |
| 40 | +## Core Concepts |
| 41 | + |
| 42 | +### Statement Fingerprints vs Live Queries |
| 43 | + |
| 44 | +**Statement fingerprint:** Normalized SQL pattern with parameterized constants (e.g., `SELECT * FROM users WHERE id = $1` vs `SELECT * FROM users WHERE id = 123`) |
| 45 | + |
| 46 | +**Key differences:** |
| 47 | +- **Time scope:** Historical hourly buckets vs real-time current state |
| 48 | +- **Granularity:** Aggregated pattern statistics vs individual execution instances |
| 49 | + |
| 50 | +### Time-Series Bucketing |
| 51 | + |
| 52 | +**aggregated_ts:** Hourly UTC buckets (e.g., `2026-02-21 14:00:00` = 14:00-14:59 executions) |
| 53 | +**Data retention:** Default ~7 days (check `sql.stats.persisted_rows.max`) |
| 54 | +**Best practice:** Always filter by time window: `WHERE aggregated_ts > now() - INTERVAL '24 hours'` |
| 55 | + |
| 56 | +### Aggregated vs Sampled Metrics |
| 57 | + |
| 58 | +| Metric Category | JSON Path | Scope | Use Case | |
| 59 | +|-----------------|-----------|-------|----------| |
| 60 | +| **Aggregated** | `statistics.statistics.*` | All executions | Latency, row counts, execution counts | |
| 61 | +| **Sampled** | `statistics.execution_statistics.*` | ~10% sample | CPU, contention, admission wait, memory/disk | |
| 62 | + |
| 63 | +**Critical:** Always check sampled metrics presence: `WHERE (statistics->'execution_statistics'->>'cnt') IS NOT NULL` |
| 64 | + |
| 65 | +### JSON Field Extraction |
| 66 | + |
| 67 | +**Operators:** |
| 68 | +- `->`: Extract JSON object (returns JSON) |
| 69 | +- `->>`: Extract as text (returns text) |
| 70 | +- `::TYPE`: Cast to specific type |
| 71 | + |
| 72 | +**Examples:** |
| 73 | +```sql |
| 74 | +metadata->>'db' -- Database name |
| 75 | +(statistics->'statistics'->>'cnt')::INT -- Execution count |
| 76 | +(statistics->'statistics'->'runLat'->>'mean')::FLOAT8 -- Mean latency (seconds) |
| 77 | +(statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 / 1e9 -- CPU (convert nanos to seconds) |
| 78 | +``` |
| 79 | + |
| 80 | +**Units:** Latency = seconds, CPU/admission = nanoseconds (÷ 1e9), Memory/disk = bytes (÷ 1048576 for MB) |
| 81 | + |
| 82 | +See [JSON field reference](references/json-field-reference.md) for complete schema. |
| 83 | + |
| 84 | +## Core Diagnostic Queries |
| 85 | + |
| 86 | +### Query 1: Top Statements by Mean Run Latency |
| 87 | + |
| 88 | +```sql |
| 89 | +SELECT |
| 90 | + fingerprint_id, |
| 91 | + metadata->>'db' AS database, |
| 92 | + metadata->>'query' AS query_text, |
| 93 | + (statistics->'statistics'->>'cnt')::INT AS execution_count, |
| 94 | + (statistics->'statistics'->'runLat'->>'mean')::FLOAT8 AS mean_run_lat_seconds, |
| 95 | + (statistics->'statistics'->'runLat'->>'max')::FLOAT8 AS max_run_lat_seconds, |
| 96 | + (metadata->>'fullScan')::BOOL AS full_scan, |
| 97 | + metadata->'index_recommendations' AS index_recommendations, |
| 98 | + aggregated_ts |
| 99 | +FROM crdb_internal.statement_statistics |
| 100 | +WHERE aggregated_ts > now() - INTERVAL '24 hours' |
| 101 | + AND (statistics->'statistics'->'runLat'->>'mean')::FLOAT8 > 1.0 -- > 1 second mean latency |
| 102 | +ORDER BY (statistics->'statistics'->'runLat'->>'mean')::FLOAT8 DESC |
| 103 | +LIMIT 20; |
| 104 | +``` |
| 105 | + |
| 106 | +**Focus:** Slowest queries; check `full_scan` and `index_recommendations` for optimization opportunities. |
| 107 | + |
| 108 | +### Query 2: Admission Control Impact |
| 109 | + |
| 110 | +```sql |
| 111 | +SELECT |
| 112 | + fingerprint_id, |
| 113 | + metadata->>'db' AS database, |
| 114 | + metadata->>'query' AS query_text, |
| 115 | + (statistics->'statistics'->>'cnt')::INT AS execution_count, |
| 116 | + (statistics->'execution_statistics'->'admissionWaitTime'->>'mean')::FLOAT8 / 1e9 AS mean_admission_wait_seconds, |
| 117 | + (statistics->'statistics'->'runLat'->>'mean')::FLOAT8 AS mean_run_lat_seconds, |
| 118 | + aggregated_ts |
| 119 | +FROM crdb_internal.statement_statistics |
| 120 | +WHERE aggregated_ts > now() - INTERVAL '24 hours' |
| 121 | + AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL |
| 122 | + AND (statistics->'execution_statistics'->'admissionWaitTime'->>'mean')::FLOAT8 > 0 |
| 123 | +ORDER BY (statistics->'execution_statistics'->'admissionWaitTime'->>'mean')::FLOAT8 DESC |
| 124 | +LIMIT 20; |
| 125 | +``` |
| 126 | + |
| 127 | +**Interpretation:** High admission wait = cluster at resource limits (CPU, memory, I/O). Ratio > 1.0 (wait > runtime) indicates severe queueing. |
| 128 | + |
| 129 | +### Query 3: Plan Hash Diversity |
| 130 | + |
| 131 | +```sql |
| 132 | +SELECT |
| 133 | + fingerprint_id, |
| 134 | + metadata->>'db' AS database, |
| 135 | + metadata->>'query' AS query_text, |
| 136 | + COUNT(DISTINCT plan_hash) AS distinct_plan_count, |
| 137 | + array_agg(DISTINCT plan_hash ORDER BY plan_hash) AS plan_hashes, |
| 138 | + SUM((statistics->'statistics'->>'cnt')::INT) AS total_executions |
| 139 | +FROM crdb_internal.statement_statistics |
| 140 | +WHERE aggregated_ts > now() - INTERVAL '7 days' |
| 141 | +GROUP BY fingerprint_id, metadata->>'db', metadata->>'query' |
| 142 | +HAVING COUNT(DISTINCT plan_hash) > 1 |
| 143 | +ORDER BY COUNT(DISTINCT plan_hash) DESC, SUM((statistics->'statistics'->>'cnt')::INT) DESC |
| 144 | +LIMIT 20; |
| 145 | +``` |
| 146 | + |
| 147 | +**Interpretation:** Multiple plans indicate instability from schema changes, statistics updates, or routing changes. Performance can vary significantly between plans. |
| 148 | + |
| 149 | +### Query 4: High Contention Statements |
| 150 | + |
| 151 | +```sql |
| 152 | +SELECT |
| 153 | + fingerprint_id, |
| 154 | + metadata->>'db' AS database, |
| 155 | + metadata->>'app' AS application, |
| 156 | + substring(metadata->>'query', 1, 150) AS query_preview, |
| 157 | + (statistics->'statistics'->>'cnt')::INT AS execution_count, |
| 158 | + (statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 / 1e9 AS mean_contention_seconds, |
| 159 | + ROUND( |
| 160 | + ((statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 / 1e9) / |
| 161 | + NULLIF((statistics->'statistics'->'runLat'->>'mean')::FLOAT8, 0) * 100, 2 |
| 162 | + ) AS contention_pct_of_runtime, |
| 163 | + aggregated_ts |
| 164 | +FROM crdb_internal.statement_statistics |
| 165 | +WHERE aggregated_ts > now() - INTERVAL '24 hours' |
| 166 | + AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL |
| 167 | + AND (statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 > 0 |
| 168 | +ORDER BY (statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 DESC |
| 169 | +LIMIT 20; |
| 170 | +``` |
| 171 | + |
| 172 | +**Interpretation:** >20% contention = transaction conflicts, hot row access. Remediate with batching, transaction boundary changes, or schema redesign. |
| 173 | + |
| 174 | +### Query 5: High CPU Consumers |
| 175 | + |
| 176 | +```sql |
| 177 | +SELECT |
| 178 | + fingerprint_id, |
| 179 | + metadata->>'db' AS database, |
| 180 | + substring(metadata->>'query', 1, 150) AS query_preview, |
| 181 | + (statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 / 1e9 AS mean_cpu_seconds, |
| 182 | + (statistics->'statistics'->>'cnt')::INT AS total_executions, |
| 183 | + ROUND( |
| 184 | + ((statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 / 1e9) * |
| 185 | + (statistics->'statistics'->>'cnt')::INT, 2 |
| 186 | + ) AS estimated_total_cpu_seconds, |
| 187 | + (metadata->>'fullScan')::BOOL AS full_scan, |
| 188 | + aggregated_ts |
| 189 | +FROM crdb_internal.statement_statistics |
| 190 | +WHERE aggregated_ts > now() - INTERVAL '24 hours' |
| 191 | + AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL |
| 192 | + AND (statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 > 0 |
| 193 | +ORDER BY estimated_total_cpu_seconds DESC |
| 194 | +LIMIT 20; |
| 195 | +``` |
| 196 | + |
| 197 | +**Focus:** `estimated_total_cpu_seconds` shows cluster impact. High mean CPU often correlates with `full_scan = true`. |
| 198 | + |
| 199 | +### Query 6: Memory and Disk Spill Detection |
| 200 | + |
| 201 | +```sql |
| 202 | +SELECT |
| 203 | + fingerprint_id, |
| 204 | + metadata->>'db' AS database, |
| 205 | + substring(metadata->>'query', 1, 150) AS query_preview, |
| 206 | + (statistics->'execution_statistics'->'maxMemUsage'->>'mean')::FLOAT8 / 1048576 AS mean_mem_mb, |
| 207 | + (statistics->'execution_statistics'->'maxMemUsage'->>'max')::FLOAT8 / 1048576 AS max_mem_mb, |
| 208 | + (statistics->'execution_statistics'->'maxDiskUsage'->>'mean')::FLOAT8 / 1048576 AS mean_disk_mb, |
| 209 | + (statistics->'execution_statistics'->'maxDiskUsage'->>'max')::FLOAT8 / 1048576 AS max_disk_mb, |
| 210 | + metadata->>'stmtType' AS statement_type, |
| 211 | + aggregated_ts |
| 212 | +FROM crdb_internal.statement_statistics |
| 213 | +WHERE aggregated_ts > now() - INTERVAL '24 hours' |
| 214 | + AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL |
| 215 | + AND (statistics->'execution_statistics'->'maxDiskUsage'->>'mean')::FLOAT8 > 0 -- Has disk spills |
| 216 | +ORDER BY (statistics->'execution_statistics'->'maxDiskUsage'->>'mean')::FLOAT8 DESC |
| 217 | +LIMIT 20; |
| 218 | +``` |
| 219 | + |
| 220 | +**Interpretation:** Disk usage > 0 = memory spill (~100-1000x slower than in-memory). Common for large aggregations, sorts, hash joins. Fix with indexes or increased `sql.distsql.temp_storage.workmem`. |
| 221 | + |
| 222 | +### Query 7: Error-Prone Statements |
| 223 | + |
| 224 | +```sql |
| 225 | +SELECT |
| 226 | + fingerprint_id, |
| 227 | + metadata->>'db' AS database, |
| 228 | + substring(metadata->>'query', 1, 150) AS query_preview, |
| 229 | + (statistics->'statistics'->>'cnt')::INT AS total_executions, |
| 230 | + COALESCE((statistics->'statistics'->>'failureCount')::INT, 0) AS failure_count, |
| 231 | + ROUND( |
| 232 | + COALESCE((statistics->'statistics'->>'failureCount')::INT, 0)::NUMERIC / |
| 233 | + NULLIF((statistics->'statistics'->>'cnt')::INT, 0) * 100, 2 |
| 234 | + ) AS failure_rate_pct, |
| 235 | + aggregated_ts |
| 236 | +FROM crdb_internal.statement_statistics |
| 237 | +WHERE aggregated_ts > now() - INTERVAL '24 hours' |
| 238 | + AND (statistics->'statistics'->>'cnt')::INT > 10 |
| 239 | + AND COALESCE((statistics->'statistics'->>'failureCount')::INT, 0) > 0 |
| 240 | +ORDER BY failure_rate_pct DESC, failure_count DESC |
| 241 | +LIMIT 20; |
| 242 | +``` |
| 243 | + |
| 244 | +**Common causes:** Constraint violations, query timeouts, transaction retry errors (40001), permission denied. |
| 245 | + |
| 246 | +## Common Workflows |
| 247 | + |
| 248 | +### Workflow 1: Slowness Investigation |
| 249 | + |
| 250 | +1. **Identify slow fingerprints:** Run Query 1 with 24h window, focus on `mean_run_lat_seconds > 5` and high execution counts |
| 251 | +2. **Check for full scans:** Filter `full_scan = true`, review `index_recommendations` |
| 252 | +3. **Correlate to applications:** Group by `metadata->>'app'`, contact teams with specific patterns |
| 253 | +4. **Cross-reference live activity:** If ongoing, use triaging-live-sql-activity to cancel runaway queries |
| 254 | + |
| 255 | +### Workflow 2: Contention Analysis |
| 256 | + |
| 257 | +1. **Find high-contention statements:** Run Query 4, focus on `contention_pct_of_runtime > 20%` |
| 258 | +2. **Check plan stability:** Run Query 3 for contending fingerprints (plan changes affect lock order) |
| 259 | +3. **Remediate:** Batch operations, use `SELECT FOR UPDATE`, partition hot tables, denormalize schema |
| 260 | + |
| 261 | +### Workflow 3: Admission Control Debugging |
| 262 | + |
| 263 | +1. **Identify admission waits:** Run Query 2, calculate wait ratio |
| 264 | +2. **Correlate with CPU:** Run Query 5 for same window, cross-reference fingerprint IDs |
| 265 | +3. **Analyze time patterns:** Group by `aggregated_ts` to find peak periods |
| 266 | +4. **Triage:** Short-term: spread batch jobs; Long-term: add capacity, optimize queries |
| 267 | + |
| 268 | +### Workflow 4: Memory Spill Investigation |
| 269 | + |
| 270 | +1. **Find spilling statements:** Run Query 6, focus on `max_disk_mb > 100` |
| 271 | +2. **Analyze patterns:** Identify large `GROUP BY`, `ORDER BY`, hash joins |
| 272 | +3. **Remediate:** Add indexes, increase workmem (with caution), rewrite queries, use materialized views |
| 273 | + |
| 274 | +## Safety Considerations |
| 275 | + |
| 276 | +**Read-only operations:** All queries are `SELECT` statements against production-approved `crdb_internal.statement_statistics`. |
| 277 | + |
| 278 | +**Performance impact:** |
| 279 | + |
| 280 | +| Consideration | Impact | Mitigation | |
| 281 | +|---------------|--------|------------| |
| 282 | +| Large table | Many rows with high statement diversity | Always use time filters and `LIMIT` | |
| 283 | +| JSON parsing | CPU overhead | Use specific time windows, avoid tight loops | |
| 284 | +| Broad windows | 7-day queries = more rows | Default to 24h; expand only when needed | |
| 285 | + |
| 286 | +**Privacy:** Use `VIEWACTIVITYREDACTED` to redact query constants in multi-tenant environments. |
| 287 | + |
| 288 | +## Troubleshooting |
| 289 | + |
| 290 | +| Issue | Cause | Fix | |
| 291 | +|-------|-------|-----| |
| 292 | +| Empty results | No data or stats collection disabled | Check `sql.stats.automatic_collection.enabled = true` | |
| 293 | +| `column does not exist` | JSON field typo or version mismatch | Verify field names; check CockroachDB version | |
| 294 | +| NULL in sampled metrics | Metric not sampled in bucket | Filter: `WHERE (statistics->'execution_statistics'->>'cnt') IS NOT NULL` | |
| 295 | +| Query text shows `<hidden>` | Using VIEWACTIVITYREDACTED | Expected; use VIEWACTIVITY if authorized | |
| 296 | +| "invalid input syntax for type json" | Malformed JSON path | Check operators: `->` for JSON, `->>` for text | |
| 297 | +| Very slow query | Large table, no time filter | Always add time window and LIMIT | |
| 298 | +| Empty `index_recommendations` | No recommendations or optimal | Normal if indexes exist | |
| 299 | + |
| 300 | +## Key Considerations |
| 301 | + |
| 302 | +- **Time windows:** Default to 24h; expand to 7d for trends |
| 303 | +- **Sampled metrics:** Not all executions captured; check sample size (`cnt`) |
| 304 | +- **JSON safety:** Use defensive NULL checks; handle type casting errors |
| 305 | +- **Privacy:** Use VIEWACTIVITYREDACTED in production |
| 306 | +- **Performance:** Always include time filters and LIMIT |
| 307 | +- **Complement to live triage:** Use together for complete coverage (historical + real-time) |
| 308 | +- **Data retention:** Default ~7 days; verify with `sql.stats.persisted_rows.max` |
| 309 | +- **Plan instability:** Multiple plan hashes indicate optimizer/schema changes |
| 310 | + |
| 311 | +## References |
| 312 | + |
| 313 | +**Skill references:** |
| 314 | +- [JSON field schema and extraction](references/json-field-reference.md) |
| 315 | +- [Metrics catalog and units](references/metrics-and-units.md) |
| 316 | +- [SQL query variations](references/sql-query-variations.md) |
| 317 | +- [RBAC and privileges](../triaging-live-sql-activity/references/permissions.md) (shared with triaging-live-sql-activity) |
| 318 | + |
| 319 | +**Official CockroachDB Documentation:** |
| 320 | +- [crdb_internal](https://www.cockroachlabs.com/docs/stable/crdb-internal.html) |
| 321 | +- [Statements Page (DB Console)](https://www.cockroachlabs.com/docs/stable/ui-statements-page.html) |
| 322 | +- [Monitor and Analyze Transaction Contention](https://www.cockroachlabs.com/docs/stable/monitor-and-analyze-transaction-contention.html) |
| 323 | +- [VIEWACTIVITY privilege](https://www.cockroachlabs.com/docs/stable/security-reference/authorization.html#supported-privileges) |
| 324 | + |
| 325 | +**Related skills:** |
| 326 | +- [triaging-live-sql-activity](../triaging-live-sql-activity/SKILL.md) - For immediate triage of currently running queries |
0 commit comments