Skip to content

Commit 0a9254e

Browse files
Kevin NgoKevin Ngo
authored andcommitted
add profiling-statement-fingerprints skill
Adds new skill for analyzing historical statement performance patterns using crdb_internal.statement_statistics. Includes 7 diagnostic queries covering latency, CPU, contention, admission control, memory spills, and errors. Also updates triaging-live-sql-activity with bidirectional cross-references to clarify the complementary use cases (historical vs live analysis).
1 parent 963b635 commit 0a9254e

File tree

5 files changed

+1910
-0
lines changed

5 files changed

+1910
-0
lines changed
Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
---
2+
name: profiling-statement-fingerprints
3+
description: Ranks and analyzes statement fingerprints using aggregated SQL statistics from crdb_internal.statement_statistics to identify slow, resource-intensive, or error-prone query patterns. Use when investigating historical performance trends, identifying optimization opportunities, or diagnosing recurring slowness without DB Console access.
4+
compatibility: Requires SQL access with VIEWACTIVITY or VIEWACTIVITYREDACTED cluster privilege. Uses crdb_internal.statement_statistics (production-safe). Execution statistics fields are sampled and may be sparse.
5+
metadata:
6+
author: cockroachdb
7+
version: "1.0"
8+
---
9+
10+
# Profiling Statement Fingerprints
11+
12+
Analyzes historical statement performance patterns using aggregated SQL statistics to identify slow, resource-intensive, or error-prone query fingerprints. Uses `crdb_internal.statement_statistics` for time-windowed analysis of latency, CPU, contention, admission delays, and failure rates - entirely via SQL without requiring DB Console access.
13+
14+
**Complement to triaging-live-sql-activity:** This skill analyzes historical patterns; for immediate triage of currently running queries, see [triaging-live-sql-activity](../triaging-live-sql-activity/SKILL.md).
15+
16+
## When to Use This Skill
17+
18+
- Identify slowest statement fingerprints over past hours/days/weeks
19+
- Find queries with high CPU consumption, contention, or admission waits
20+
- Investigate performance regressions or plan changes
21+
- Locate full table scans or missing indexes via index recommendations
22+
- Analyze resource consumption by application or database
23+
- SQL-only historical analysis without DB Console access
24+
25+
**For immediate incident response:** Use [triaging-live-sql-activity](../triaging-live-sql-activity/SKILL.md) to triage currently running queries and cancel runaway work.
26+
27+
## Prerequisites
28+
29+
- SQL connection to CockroachDB cluster
30+
- `VIEWACTIVITY` or `VIEWACTIVITYREDACTED` cluster privilege for cluster-wide visibility
31+
- Statement statistics collection enabled (default): `sql.stats.automatic_collection.enabled = true`
32+
33+
**Check collection status:**
34+
```sql
35+
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should return: true
36+
```
37+
38+
See [triaging-live-sql-activity permissions reference](../triaging-live-sql-activity/references/permissions.md) for RBAC setup (same privileges).
39+
40+
## Core Concepts
41+
42+
### Statement Fingerprints vs Live Queries
43+
44+
**Statement fingerprint:** Normalized SQL pattern with parameterized constants (e.g., `SELECT * FROM users WHERE id = $1` vs `SELECT * FROM users WHERE id = 123`)
45+
46+
**Key differences:**
47+
- **Time scope:** Historical hourly buckets vs real-time current state
48+
- **Granularity:** Aggregated pattern statistics vs individual execution instances
49+
50+
### Time-Series Bucketing
51+
52+
**aggregated_ts:** Hourly UTC buckets (e.g., `2026-02-21 14:00:00` = 14:00-14:59 executions)
53+
**Data retention:** Default ~7 days (check `sql.stats.persisted_rows.max`)
54+
**Best practice:** Always filter by time window: `WHERE aggregated_ts > now() - INTERVAL '24 hours'`
55+
56+
### Aggregated vs Sampled Metrics
57+
58+
| Metric Category | JSON Path | Scope | Use Case |
59+
|-----------------|-----------|-------|----------|
60+
| **Aggregated** | `statistics.statistics.*` | All executions | Latency, row counts, execution counts |
61+
| **Sampled** | `statistics.execution_statistics.*` | ~10% sample | CPU, contention, admission wait, memory/disk |
62+
63+
**Critical:** Always check sampled metrics presence: `WHERE (statistics->'execution_statistics'->>'cnt') IS NOT NULL`
64+
65+
### JSON Field Extraction
66+
67+
**Operators:**
68+
- `->`: Extract JSON object (returns JSON)
69+
- `->>`: Extract as text (returns text)
70+
- `::TYPE`: Cast to specific type
71+
72+
**Examples:**
73+
```sql
74+
metadata->>'db' -- Database name
75+
(statistics->'statistics'->>'cnt')::INT -- Execution count
76+
(statistics->'statistics'->'runLat'->>'mean')::FLOAT8 -- Mean latency (seconds)
77+
(statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 / 1e9 -- CPU (convert nanos to seconds)
78+
```
79+
80+
**Units:** Latency = seconds, CPU/admission = nanoseconds (÷ 1e9), Memory/disk = bytes (÷ 1048576 for MB)
81+
82+
See [JSON field reference](references/json-field-reference.md) for complete schema.
83+
84+
## Core Diagnostic Queries
85+
86+
### Query 1: Top Statements by Mean Run Latency
87+
88+
```sql
89+
SELECT
90+
fingerprint_id,
91+
metadata->>'db' AS database,
92+
metadata->>'query' AS query_text,
93+
(statistics->'statistics'->>'cnt')::INT AS execution_count,
94+
(statistics->'statistics'->'runLat'->>'mean')::FLOAT8 AS mean_run_lat_seconds,
95+
(statistics->'statistics'->'runLat'->>'max')::FLOAT8 AS max_run_lat_seconds,
96+
(metadata->>'fullScan')::BOOL AS full_scan,
97+
metadata->'index_recommendations' AS index_recommendations,
98+
aggregated_ts
99+
FROM crdb_internal.statement_statistics
100+
WHERE aggregated_ts > now() - INTERVAL '24 hours'
101+
AND (statistics->'statistics'->'runLat'->>'mean')::FLOAT8 > 1.0 -- > 1 second mean latency
102+
ORDER BY (statistics->'statistics'->'runLat'->>'mean')::FLOAT8 DESC
103+
LIMIT 20;
104+
```
105+
106+
**Focus:** Slowest queries; check `full_scan` and `index_recommendations` for optimization opportunities.
107+
108+
### Query 2: Admission Control Impact
109+
110+
```sql
111+
SELECT
112+
fingerprint_id,
113+
metadata->>'db' AS database,
114+
metadata->>'query' AS query_text,
115+
(statistics->'statistics'->>'cnt')::INT AS execution_count,
116+
(statistics->'execution_statistics'->'admissionWaitTime'->>'mean')::FLOAT8 / 1e9 AS mean_admission_wait_seconds,
117+
(statistics->'statistics'->'runLat'->>'mean')::FLOAT8 AS mean_run_lat_seconds,
118+
aggregated_ts
119+
FROM crdb_internal.statement_statistics
120+
WHERE aggregated_ts > now() - INTERVAL '24 hours'
121+
AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL
122+
AND (statistics->'execution_statistics'->'admissionWaitTime'->>'mean')::FLOAT8 > 0
123+
ORDER BY (statistics->'execution_statistics'->'admissionWaitTime'->>'mean')::FLOAT8 DESC
124+
LIMIT 20;
125+
```
126+
127+
**Interpretation:** High admission wait = cluster at resource limits (CPU, memory, I/O). Ratio > 1.0 (wait > runtime) indicates severe queueing.
128+
129+
### Query 3: Plan Hash Diversity
130+
131+
```sql
132+
SELECT
133+
fingerprint_id,
134+
metadata->>'db' AS database,
135+
metadata->>'query' AS query_text,
136+
COUNT(DISTINCT plan_hash) AS distinct_plan_count,
137+
array_agg(DISTINCT plan_hash ORDER BY plan_hash) AS plan_hashes,
138+
SUM((statistics->'statistics'->>'cnt')::INT) AS total_executions
139+
FROM crdb_internal.statement_statistics
140+
WHERE aggregated_ts > now() - INTERVAL '7 days'
141+
GROUP BY fingerprint_id, metadata->>'db', metadata->>'query'
142+
HAVING COUNT(DISTINCT plan_hash) > 1
143+
ORDER BY COUNT(DISTINCT plan_hash) DESC, SUM((statistics->'statistics'->>'cnt')::INT) DESC
144+
LIMIT 20;
145+
```
146+
147+
**Interpretation:** Multiple plans indicate instability from schema changes, statistics updates, or routing changes. Performance can vary significantly between plans.
148+
149+
### Query 4: High Contention Statements
150+
151+
```sql
152+
SELECT
153+
fingerprint_id,
154+
metadata->>'db' AS database,
155+
metadata->>'app' AS application,
156+
substring(metadata->>'query', 1, 150) AS query_preview,
157+
(statistics->'statistics'->>'cnt')::INT AS execution_count,
158+
(statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 / 1e9 AS mean_contention_seconds,
159+
ROUND(
160+
((statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 / 1e9) /
161+
NULLIF((statistics->'statistics'->'runLat'->>'mean')::FLOAT8, 0) * 100, 2
162+
) AS contention_pct_of_runtime,
163+
aggregated_ts
164+
FROM crdb_internal.statement_statistics
165+
WHERE aggregated_ts > now() - INTERVAL '24 hours'
166+
AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL
167+
AND (statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 > 0
168+
ORDER BY (statistics->'execution_statistics'->'contentionTime'->>'mean')::FLOAT8 DESC
169+
LIMIT 20;
170+
```
171+
172+
**Interpretation:** >20% contention = transaction conflicts, hot row access. Remediate with batching, transaction boundary changes, or schema redesign.
173+
174+
### Query 5: High CPU Consumers
175+
176+
```sql
177+
SELECT
178+
fingerprint_id,
179+
metadata->>'db' AS database,
180+
substring(metadata->>'query', 1, 150) AS query_preview,
181+
(statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 / 1e9 AS mean_cpu_seconds,
182+
(statistics->'statistics'->>'cnt')::INT AS total_executions,
183+
ROUND(
184+
((statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 / 1e9) *
185+
(statistics->'statistics'->>'cnt')::INT, 2
186+
) AS estimated_total_cpu_seconds,
187+
(metadata->>'fullScan')::BOOL AS full_scan,
188+
aggregated_ts
189+
FROM crdb_internal.statement_statistics
190+
WHERE aggregated_ts > now() - INTERVAL '24 hours'
191+
AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL
192+
AND (statistics->'execution_statistics'->'cpuSQLNanos'->>'mean')::FLOAT8 > 0
193+
ORDER BY estimated_total_cpu_seconds DESC
194+
LIMIT 20;
195+
```
196+
197+
**Focus:** `estimated_total_cpu_seconds` shows cluster impact. High mean CPU often correlates with `full_scan = true`.
198+
199+
### Query 6: Memory and Disk Spill Detection
200+
201+
```sql
202+
SELECT
203+
fingerprint_id,
204+
metadata->>'db' AS database,
205+
substring(metadata->>'query', 1, 150) AS query_preview,
206+
(statistics->'execution_statistics'->'maxMemUsage'->>'mean')::FLOAT8 / 1048576 AS mean_mem_mb,
207+
(statistics->'execution_statistics'->'maxMemUsage'->>'max')::FLOAT8 / 1048576 AS max_mem_mb,
208+
(statistics->'execution_statistics'->'maxDiskUsage'->>'mean')::FLOAT8 / 1048576 AS mean_disk_mb,
209+
(statistics->'execution_statistics'->'maxDiskUsage'->>'max')::FLOAT8 / 1048576 AS max_disk_mb,
210+
metadata->>'stmtType' AS statement_type,
211+
aggregated_ts
212+
FROM crdb_internal.statement_statistics
213+
WHERE aggregated_ts > now() - INTERVAL '24 hours'
214+
AND (statistics->'execution_statistics'->>'cnt') IS NOT NULL
215+
AND (statistics->'execution_statistics'->'maxDiskUsage'->>'mean')::FLOAT8 > 0 -- Has disk spills
216+
ORDER BY (statistics->'execution_statistics'->'maxDiskUsage'->>'mean')::FLOAT8 DESC
217+
LIMIT 20;
218+
```
219+
220+
**Interpretation:** Disk usage > 0 = memory spill (~100-1000x slower than in-memory). Common for large aggregations, sorts, hash joins. Fix with indexes or increased `sql.distsql.temp_storage.workmem`.
221+
222+
### Query 7: Error-Prone Statements
223+
224+
```sql
225+
SELECT
226+
fingerprint_id,
227+
metadata->>'db' AS database,
228+
substring(metadata->>'query', 1, 150) AS query_preview,
229+
(statistics->'statistics'->>'cnt')::INT AS total_executions,
230+
COALESCE((statistics->'statistics'->>'failureCount')::INT, 0) AS failure_count,
231+
ROUND(
232+
COALESCE((statistics->'statistics'->>'failureCount')::INT, 0)::NUMERIC /
233+
NULLIF((statistics->'statistics'->>'cnt')::INT, 0) * 100, 2
234+
) AS failure_rate_pct,
235+
aggregated_ts
236+
FROM crdb_internal.statement_statistics
237+
WHERE aggregated_ts > now() - INTERVAL '24 hours'
238+
AND (statistics->'statistics'->>'cnt')::INT > 10
239+
AND COALESCE((statistics->'statistics'->>'failureCount')::INT, 0) > 0
240+
ORDER BY failure_rate_pct DESC, failure_count DESC
241+
LIMIT 20;
242+
```
243+
244+
**Common causes:** Constraint violations, query timeouts, transaction retry errors (40001), permission denied.
245+
246+
## Common Workflows
247+
248+
### Workflow 1: Slowness Investigation
249+
250+
1. **Identify slow fingerprints:** Run Query 1 with 24h window, focus on `mean_run_lat_seconds > 5` and high execution counts
251+
2. **Check for full scans:** Filter `full_scan = true`, review `index_recommendations`
252+
3. **Correlate to applications:** Group by `metadata->>'app'`, contact teams with specific patterns
253+
4. **Cross-reference live activity:** If ongoing, use triaging-live-sql-activity to cancel runaway queries
254+
255+
### Workflow 2: Contention Analysis
256+
257+
1. **Find high-contention statements:** Run Query 4, focus on `contention_pct_of_runtime > 20%`
258+
2. **Check plan stability:** Run Query 3 for contending fingerprints (plan changes affect lock order)
259+
3. **Remediate:** Batch operations, use `SELECT FOR UPDATE`, partition hot tables, denormalize schema
260+
261+
### Workflow 3: Admission Control Debugging
262+
263+
1. **Identify admission waits:** Run Query 2, calculate wait ratio
264+
2. **Correlate with CPU:** Run Query 5 for same window, cross-reference fingerprint IDs
265+
3. **Analyze time patterns:** Group by `aggregated_ts` to find peak periods
266+
4. **Triage:** Short-term: spread batch jobs; Long-term: add capacity, optimize queries
267+
268+
### Workflow 4: Memory Spill Investigation
269+
270+
1. **Find spilling statements:** Run Query 6, focus on `max_disk_mb > 100`
271+
2. **Analyze patterns:** Identify large `GROUP BY`, `ORDER BY`, hash joins
272+
3. **Remediate:** Add indexes, increase workmem (with caution), rewrite queries, use materialized views
273+
274+
## Safety Considerations
275+
276+
**Read-only operations:** All queries are `SELECT` statements against production-approved `crdb_internal.statement_statistics`.
277+
278+
**Performance impact:**
279+
280+
| Consideration | Impact | Mitigation |
281+
|---------------|--------|------------|
282+
| Large table | Many rows with high statement diversity | Always use time filters and `LIMIT` |
283+
| JSON parsing | CPU overhead | Use specific time windows, avoid tight loops |
284+
| Broad windows | 7-day queries = more rows | Default to 24h; expand only when needed |
285+
286+
**Privacy:** Use `VIEWACTIVITYREDACTED` to redact query constants in multi-tenant environments.
287+
288+
## Troubleshooting
289+
290+
| Issue | Cause | Fix |
291+
|-------|-------|-----|
292+
| Empty results | No data or stats collection disabled | Check `sql.stats.automatic_collection.enabled = true` |
293+
| `column does not exist` | JSON field typo or version mismatch | Verify field names; check CockroachDB version |
294+
| NULL in sampled metrics | Metric not sampled in bucket | Filter: `WHERE (statistics->'execution_statistics'->>'cnt') IS NOT NULL` |
295+
| Query text shows `<hidden>` | Using VIEWACTIVITYREDACTED | Expected; use VIEWACTIVITY if authorized |
296+
| "invalid input syntax for type json" | Malformed JSON path | Check operators: `->` for JSON, `->>` for text |
297+
| Very slow query | Large table, no time filter | Always add time window and LIMIT |
298+
| Empty `index_recommendations` | No recommendations or optimal | Normal if indexes exist |
299+
300+
## Key Considerations
301+
302+
- **Time windows:** Default to 24h; expand to 7d for trends
303+
- **Sampled metrics:** Not all executions captured; check sample size (`cnt`)
304+
- **JSON safety:** Use defensive NULL checks; handle type casting errors
305+
- **Privacy:** Use VIEWACTIVITYREDACTED in production
306+
- **Performance:** Always include time filters and LIMIT
307+
- **Complement to live triage:** Use together for complete coverage (historical + real-time)
308+
- **Data retention:** Default ~7 days; verify with `sql.stats.persisted_rows.max`
309+
- **Plan instability:** Multiple plan hashes indicate optimizer/schema changes
310+
311+
## References
312+
313+
**Skill references:**
314+
- [JSON field schema and extraction](references/json-field-reference.md)
315+
- [Metrics catalog and units](references/metrics-and-units.md)
316+
- [SQL query variations](references/sql-query-variations.md)
317+
- [RBAC and privileges](../triaging-live-sql-activity/references/permissions.md) (shared with triaging-live-sql-activity)
318+
319+
**Official CockroachDB Documentation:**
320+
- [crdb_internal](https://www.cockroachlabs.com/docs/stable/crdb-internal.html)
321+
- [Statements Page (DB Console)](https://www.cockroachlabs.com/docs/stable/ui-statements-page.html)
322+
- [Monitor and Analyze Transaction Contention](https://www.cockroachlabs.com/docs/stable/monitor-and-analyze-transaction-contention.html)
323+
- [VIEWACTIVITY privilege](https://www.cockroachlabs.com/docs/stable/security-reference/authorization.html#supported-privileges)
324+
325+
**Related skills:**
326+
- [triaging-live-sql-activity](../triaging-live-sql-activity/SKILL.md) - For immediate triage of currently running queries

0 commit comments

Comments
 (0)