Skip to content

Commit 618140c

Browse files
committed
feat: restructure API runtime monitor to 6-part Agent Skills format
Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all 6 shared knowledge files to Phase 2 references - Add detailed data source explanation (ltxvapi tables and GPU cost table) - Add percentile calculations (P50/P95/P99 latency) - Add error type separation (infrastructure vs applicative) - Add baseline comparisons (7-day rolling average) - Add alert routing by error type (Engineering vs API/Product team) Performance monitoring coverage: - P95 latency spikes (> 2x baseline or > 60s) - Error rate increases (> 5% or DoD > 50%) - Throughput drops (> 30% DoD/WoW) - Queue time issues (> 50% of processing time) - Infrastructure errors (> 10 requests/hour) Benefits: - Comprehensive API performance monitoring - Endpoint/model/org breakdown for root cause - Error type routing to appropriate teams - Clear structure (6-part spec) - Baseline-driven alerting
1 parent fdd2727 commit 618140c

File tree

1 file changed

+257
-78
lines changed

1 file changed

+257
-78
lines changed
Lines changed: 257 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,263 @@
11
---
22
name: api-runtime-monitor
3-
description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors.
4-
tags: [monitoring, api, performance, latency, errors]
3+
description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org."
4+
tags: [monitoring, api, performance, latency, errors, throughput]
55
---
66

77
# API Runtime Monitor
88

9-
## When to use
10-
11-
- "Monitor API latency"
12-
- "Alert on API errors"
13-
- "Track API throughput"
14-
- "Monitor inference time"
15-
- "Alert on API performance degradation"
16-
17-
## What it monitors
18-
19-
- **Latency**: Request processing time, inference time, queue time
20-
- **Error rates**: % of failed requests, error types, error sources
21-
- **Throughput**: Requests per hour/day, by endpoint/model
22-
- **Performance**: P50/P95/P99 latency, success rate
23-
- **Utilization**: API usage by org, model, resolution
24-
25-
## Steps
26-
27-
1. **Gather requirements from user:**
28-
- Which performance metric to monitor (latency, errors, throughput)
29-
- Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%")
30-
- Time window (hourly, daily)
31-
- Scope (all requests, specific endpoint, specific org)
32-
- Notification channel
33-
34-
2. **Read shared files:**
35-
- `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables
36-
- `shared/metric-standards.md` — Performance metric patterns
37-
38-
3. **Identify data source:**
39-
- For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost`
40-
- **Key columns explained:**
41-
- `request_processing_time_ms`: Total time from request submission to completion
42-
- `request_inference_time_ms`: GPU processing time (actual model inference)
43-
- `request_queue_time_ms`: Time waiting in queue before processing starts
44-
- `result`: Request outcome (success, failed, timeout, etc.)
45-
- `error_type`: Classification of errors (infrastructure vs applicative)
46-
- `endpoint`: API endpoint called (e.g., /generate, /upscale)
47-
- `model_type`: Model used (ltxv2, retake, etc.)
48-
- `org_name`: Customer organization making the request
49-
50-
4. **Write monitoring SQL:**
51-
- Query relevant performance metric
52-
- Calculate percentiles (P50, P95, P99) for latency
53-
- Calculate error rate (failed / total requests)
54-
- Compare against baseline
55-
56-
5. **Present to user:**
57-
- Show SQL query
58-
- Show example alert format with performance breakdown
59-
- Confirm threshold values
60-
61-
6. **Set up alert** (manual for now):
62-
- Document SQL
63-
- Configure notification to engineering team
64-
65-
## Reference files
66-
67-
| File | Read when |
68-
|------|-----------|
69-
| `shared/product-context.md` | LTX products and business context |
70-
| `shared/bq-schema.md` | API tables and GPU cost table schema |
71-
| `shared/metric-standards.md` | Performance metric patterns |
72-
| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) |
73-
| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) |
74-
| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) |
75-
76-
## Rules
77-
78-
- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99)
79-
- DO separate errors by error_source (infrastructure vs applicative)
80-
- DO filter by result = 'success' for success rate calculations
81-
- DO break down by endpoint, model, and resolution for detailed analysis
82-
- DO compare current performance against historical baseline
83-
- DO alert engineering team for infrastructure errors, product team for applicative errors
84-
- DO partition by dt for performance
9+
## 1. Overview (Why?)
10+
11+
LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention.
12+
13+
This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis.
14+
15+
**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification.
16+
17+
## 2. Requirements (What?)
18+
19+
Monitor these outcomes autonomously:
20+
21+
- [ ] P95 latency spikes (> 2x baseline or > 60s)
22+
- [ ] Error rate increases (> 5% or DoD increase > 50%)
23+
- [ ] Throughput drops (> 30% DoD/WoW)
24+
- [ ] Queue time excessive (> 50% of processing time)
25+
- [ ] Infrastructure errors (> 10 requests/hour)
26+
- [ ] Alerts include breakdown by endpoint, model, organization
27+
- [ ] Results formatted by priority (infrastructure vs applicative errors)
28+
- [ ] Findings routed to appropriate team (API team or Engineering)
29+
30+
## 3. Progress Tracker
31+
32+
* [ ] Read shared knowledge (schema, metrics, performance patterns)
33+
* [ ] Identify data source (ltxvapi tables or GPU cost table)
34+
* [ ] Write monitoring SQL with percentile calculations
35+
* [ ] Execute query for target date range
36+
* [ ] Analyze results by endpoint, model, organization
37+
* [ ] Separate infrastructure vs applicative errors
38+
* [ ] Present findings with performance breakdown
39+
* [ ] Route alerts to appropriate team
40+
41+
## 4. Implementation Plan
42+
43+
### Phase 1: Read Alert Thresholds
44+
45+
**Generic thresholds** (data-driven analysis pending):
46+
- P95 latency > 2x baseline or > 60s
47+
- Error rate > 5% or DoD increase > 50%
48+
- Throughput drops > 30% DoD/WoW
49+
- Queue time > 50% of processing time
50+
- Infrastructure errors > 10 requests/hour
51+
52+
[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring).
53+
54+
### Phase 2: Read Shared Knowledge
55+
56+
Before writing SQL, read:
57+
- **`shared/product-context.md`** — LTX products, user types, business model, API context
58+
- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema
59+
- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput)
60+
- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
61+
- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
62+
- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance)
63+
64+
**Data nuances**:
65+
- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
66+
- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
67+
- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance
68+
- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`
69+
70+
### Phase 3: Identify Data Source
71+
72+
**PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics**
73+
74+
**Key columns**:
75+
- `request_processing_time_ms`: Total time from request submission to completion
76+
- `request_inference_time_ms`: GPU processing time (actual model inference)
77+
- `request_queue_time_ms`: Time waiting in queue before processing starts
78+
- `result`: Request outcome (success, failed, timeout, etc.)
79+
- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative)
80+
- `endpoint`: API endpoint called (e.g., /generate, /upscale)
81+
- `model_type`: Model used (ltxv2, retake, etc.)
82+
- `org_name`: Customer organization making the request
83+
84+
[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema
85+
86+
### Phase 4: Write Monitoring SQL
87+
88+
**PREFERRED: Calculate percentiles and error rates with baseline comparisons**
89+
90+
```sql
91+
WITH api_metrics AS (
92+
SELECT
93+
DATE(action_ts) AS dt,
94+
endpoint,
95+
model_type,
96+
org_name,
97+
COUNT(*) AS total_requests,
98+
COUNTIF(result = 'success') AS successful_requests,
99+
COUNTIF(result != 'success') AS failed_requests,
100+
SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct,
101+
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms,
102+
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms,
103+
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms,
104+
AVG(request_queue_time_ms) AS avg_queue_time_ms,
105+
AVG(request_inference_time_ms) AS avg_inference_time_ms
106+
FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
107+
WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
108+
AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
109+
GROUP BY dt, endpoint, model_type, org_name
110+
),
111+
metrics_with_baseline AS (
112+
SELECT
113+
*,
114+
AVG(p95_latency_ms) OVER (
115+
PARTITION BY endpoint, model_type
116+
ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
117+
) AS p95_latency_baseline_7d,
118+
AVG(error_rate_pct) OVER (
119+
PARTITION BY endpoint, model_type
120+
ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
121+
) AS error_rate_baseline_7d
122+
FROM api_metrics
123+
)
124+
SELECT * FROM metrics_with_baseline
125+
WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
126+
```
127+
128+
**Key patterns**:
129+
- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99
130+
- **Error rate**: `SAFE_DIVIDE(failed, total) * 100`
131+
- **Baseline**: 7-day rolling average by endpoint and model
132+
- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data)
133+
134+
### Phase 5: Execute Query
135+
136+
Run query using:
137+
```bash
138+
bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
139+
<query>
140+
"
141+
```
142+
143+
### Phase 6: Analyze Results
144+
145+
**For latency trends**:
146+
- Compare P95 latency vs baseline (7-day avg)
147+
- Flag if P95 > 2x baseline or > 60s absolute
148+
- Identify which endpoint/model/org drove spikes
149+
150+
**For error rate analysis**:
151+
- Compare error rate vs baseline
152+
- Separate errors by `error_type`/`error_source` (infrastructure vs applicative)
153+
- Flag if error rate > 5% or DoD increase > 50%
154+
155+
**For throughput**:
156+
- Track requests per hour/day by endpoint
157+
- Flag throughput drops > 30% DoD/WoW
158+
- Identify which endpoints lost traffic
159+
160+
**For queue analysis**:
161+
- Calculate queue time as % of total processing time
162+
- Flag if queue time > 50% of processing time
163+
- Indicates capacity/scaling issues
164+
165+
### Phase 7: Present Findings
166+
167+
Format results with:
168+
- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video")
169+
- **Root cause**: Which endpoint/model/org drove the issue
170+
- **Breakdown**: Performance metrics by dimension
171+
- **Error classification**: Infrastructure vs applicative errors
172+
- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure)
173+
174+
**Alert format**:
175+
```
176+
⚠️ API PERFORMANCE ALERT:
177+
• Endpoint: /v1/text-to-video
178+
Model: ltxv2
179+
Metric: P95 Latency
180+
Current: 85s | Baseline: 30s
181+
Change: +183%
182+
183+
Error rate: 8.2% (baseline: 2.1%)
184+
Error type: Infrastructure
185+
186+
Recommendation: Alert Engineering team for infrastructure issue
187+
```
188+
189+
### Phase 8: Route Alert
190+
191+
For ongoing monitoring:
192+
1. Save SQL query
193+
2. Set up in BigQuery scheduled query or Hex Thread
194+
3. Configure notification by error type:
195+
- Infrastructure errors → Engineering team
196+
- Applicative errors → API/Product team
197+
4. Include endpoint, model, and org details in alert
198+
199+
## 5. Context & References
200+
201+
### Shared Knowledge
202+
- **`shared/product-context.md`** — LTX products and API context
203+
- **`shared/bq-schema.md`** — API tables and GPU cost table schema
204+
- **`shared/metric-standards.md`** — Performance metric patterns
205+
- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
206+
- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
207+
- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns
208+
209+
### Data Sources
210+
211+
**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
212+
- Partitioned by `action_ts` (TIMESTAMP)
213+
- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`
214+
215+
**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
216+
- Contains API runtime data but not primary source for performance metrics
217+
218+
### Endpoints
219+
Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate`
220+
221+
### Models
222+
Common models: `ltxv2`, `retake`, etc.
223+
224+
## 6. Constraints & Done
225+
226+
### DO NOT
227+
228+
- **DO NOT** use absolute thresholds without baseline comparison
229+
- **DO NOT** mix infrastructure and applicative errors in same alert
230+
- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance
231+
- **DO NOT** forget to separate errors by error type/source
232+
233+
[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source`
234+
235+
### DO
236+
237+
- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99)
238+
- **DO** separate errors by error_source (infrastructure vs applicative)
239+
- **DO** filter by `result = 'success'` for success rate calculations
240+
- **DO** break down by endpoint, model, and organization for detailed analysis
241+
- **DO** compare current performance against historical baseline (7-day rolling avg)
242+
- **DO** alert engineering team for infrastructure errors
243+
- **DO** alert product/API team for applicative errors
244+
- **DO** partition on `action_ts` or `dt` for performance
245+
- **DO** use `ltx-dwh-explore` as execution project
246+
- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100`
247+
- **DO** flag P95 latency > 2x baseline or > 60s
248+
- **DO** flag error rate > 5% or DoD increase > 50%
249+
- **DO** flag throughput drops > 30% DoD/WoW
250+
- **DO** flag queue time > 50% of processing time
251+
- **DO** flag infrastructure errors > 10 requests/hour
252+
- **DO** include endpoint, model, org details in all alerts
253+
- **DO** validate unusual patterns with API/Engineering team before alerting
254+
255+
### Completion Criteria
256+
257+
✅ All performance metrics monitored (latency, errors, throughput, queue time)
258+
✅ Alerts fire with thresholds (generic pending production analysis)
259+
✅ Endpoint/model/org breakdown provided
260+
✅ Errors separated by type (infrastructure vs applicative)
261+
✅ Findings routed to appropriate team
262+
✅ Partition filtering applied for performance
263+
✅ Column name verified (error_type vs error_source)

0 commit comments

Comments
 (0)