|
1 | 1 | --- |
2 | 2 | name: api-runtime-monitor |
3 | | -description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors. |
4 | | -tags: [monitoring, api, performance, latency, errors] |
| 3 | +description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org." |
| 4 | +tags: [monitoring, api, performance, latency, errors, throughput] |
5 | 5 | --- |
6 | 6 |
|
7 | 7 | # API Runtime Monitor |
8 | 8 |
|
9 | | -## When to use |
10 | | - |
11 | | -- "Monitor API latency" |
12 | | -- "Alert on API errors" |
13 | | -- "Track API throughput" |
14 | | -- "Monitor inference time" |
15 | | -- "Alert on API performance degradation" |
16 | | - |
17 | | -## What it monitors |
18 | | - |
19 | | -- **Latency**: Request processing time, inference time, queue time |
20 | | -- **Error rates**: % of failed requests, error types, error sources |
21 | | -- **Throughput**: Requests per hour/day, by endpoint/model |
22 | | -- **Performance**: P50/P95/P99 latency, success rate |
23 | | -- **Utilization**: API usage by org, model, resolution |
24 | | - |
25 | | -## Steps |
26 | | - |
27 | | -1. **Gather requirements from user:** |
28 | | - - Which performance metric to monitor (latency, errors, throughput) |
29 | | - - Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%") |
30 | | - - Time window (hourly, daily) |
31 | | - - Scope (all requests, specific endpoint, specific org) |
32 | | - - Notification channel |
33 | | - |
34 | | -2. **Read shared files:** |
35 | | - - `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables |
36 | | - - `shared/metric-standards.md` — Performance metric patterns |
37 | | - |
38 | | -3. **Identify data source:** |
39 | | - - For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost` |
40 | | - - **Key columns explained:** |
41 | | - - `request_processing_time_ms`: Total time from request submission to completion |
42 | | - - `request_inference_time_ms`: GPU processing time (actual model inference) |
43 | | - - `request_queue_time_ms`: Time waiting in queue before processing starts |
44 | | - - `result`: Request outcome (success, failed, timeout, etc.) |
45 | | - - `error_type`: Classification of errors (infrastructure vs applicative) |
46 | | - - `endpoint`: API endpoint called (e.g., /generate, /upscale) |
47 | | - - `model_type`: Model used (ltxv2, retake, etc.) |
48 | | - - `org_name`: Customer organization making the request |
49 | | - |
50 | | -4. **Write monitoring SQL:** |
51 | | - - Query relevant performance metric |
52 | | - - Calculate percentiles (P50, P95, P99) for latency |
53 | | - - Calculate error rate (failed / total requests) |
54 | | - - Compare against baseline |
55 | | - |
56 | | -5. **Present to user:** |
57 | | - - Show SQL query |
58 | | - - Show example alert format with performance breakdown |
59 | | - - Confirm threshold values |
60 | | - |
61 | | -6. **Set up alert** (manual for now): |
62 | | - - Document SQL |
63 | | - - Configure notification to engineering team |
64 | | - |
65 | | -## Reference files |
66 | | - |
67 | | -| File | Read when | |
68 | | -|------|-----------| |
69 | | -| `shared/product-context.md` | LTX products and business context | |
70 | | -| `shared/bq-schema.md` | API tables and GPU cost table schema | |
71 | | -| `shared/metric-standards.md` | Performance metric patterns | |
72 | | -| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) | |
73 | | -| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) | |
74 | | -| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) | |
75 | | - |
76 | | -## Rules |
77 | | - |
78 | | -- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99) |
79 | | -- DO separate errors by error_source (infrastructure vs applicative) |
80 | | -- DO filter by result = 'success' for success rate calculations |
81 | | -- DO break down by endpoint, model, and resolution for detailed analysis |
82 | | -- DO compare current performance against historical baseline |
83 | | -- DO alert engineering team for infrastructure errors, product team for applicative errors |
84 | | -- DO partition by dt for performance |
| 9 | +## 1. Overview (Why?) |
| 10 | + |
| 11 | +LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention. |
| 12 | + |
| 13 | +This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis. |
| 14 | + |
| 15 | +**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification. |
| 16 | + |
| 17 | +## 2. Requirements (What?) |
| 18 | + |
| 19 | +Monitor these outcomes autonomously: |
| 20 | + |
| 21 | +- [ ] P95 latency spikes (> 2x baseline or > 60s) |
| 22 | +- [ ] Error rate increases (> 5% or DoD increase > 50%) |
| 23 | +- [ ] Throughput drops (> 30% DoD/WoW) |
| 24 | +- [ ] Queue time excessive (> 50% of processing time) |
| 25 | +- [ ] Infrastructure errors (> 10 requests/hour) |
| 26 | +- [ ] Alerts include breakdown by endpoint, model, organization |
| 27 | +- [ ] Results formatted by priority (infrastructure vs applicative errors) |
| 28 | +- [ ] Findings routed to appropriate team (API team or Engineering) |
| 29 | + |
| 30 | +## 3. Progress Tracker |
| 31 | + |
| 32 | +* [ ] Read shared knowledge (schema, metrics, performance patterns) |
| 33 | +* [ ] Identify data source (ltxvapi tables or GPU cost table) |
| 34 | +* [ ] Write monitoring SQL with percentile calculations |
| 35 | +* [ ] Execute query for target date range |
| 36 | +* [ ] Analyze results by endpoint, model, organization |
| 37 | +* [ ] Separate infrastructure vs applicative errors |
| 38 | +* [ ] Present findings with performance breakdown |
| 39 | +* [ ] Route alerts to appropriate team |
| 40 | + |
| 41 | +## 4. Implementation Plan |
| 42 | + |
| 43 | +### Phase 1: Read Alert Thresholds |
| 44 | + |
| 45 | +**Generic thresholds** (data-driven analysis pending): |
| 46 | +- P95 latency > 2x baseline or > 60s |
| 47 | +- Error rate > 5% or DoD increase > 50% |
| 48 | +- Throughput drops > 30% DoD/WoW |
| 49 | +- Queue time > 50% of processing time |
| 50 | +- Infrastructure errors > 10 requests/hour |
| 51 | + |
| 52 | +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring). |
| 53 | + |
| 54 | +### Phase 2: Read Shared Knowledge |
| 55 | + |
| 56 | +Before writing SQL, read: |
| 57 | +- **`shared/product-context.md`** — LTX products, user types, business model, API context |
| 58 | +- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema |
| 59 | +- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput) |
| 60 | +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) |
| 61 | +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) |
| 62 | +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance) |
| 63 | + |
| 64 | +**Data nuances**: |
| 65 | +- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` |
| 66 | +- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` |
| 67 | +- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance |
| 68 | +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` |
| 69 | + |
| 70 | +### Phase 3: Identify Data Source |
| 71 | + |
| 72 | +✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics** |
| 73 | + |
| 74 | +**Key columns**: |
| 75 | +- `request_processing_time_ms`: Total time from request submission to completion |
| 76 | +- `request_inference_time_ms`: GPU processing time (actual model inference) |
| 77 | +- `request_queue_time_ms`: Time waiting in queue before processing starts |
| 78 | +- `result`: Request outcome (success, failed, timeout, etc.) |
| 79 | +- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative) |
| 80 | +- `endpoint`: API endpoint called (e.g., /generate, /upscale) |
| 81 | +- `model_type`: Model used (ltxv2, retake, etc.) |
| 82 | +- `org_name`: Customer organization making the request |
| 83 | + |
| 84 | +[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema |
| 85 | + |
| 86 | +### Phase 4: Write Monitoring SQL |
| 87 | + |
| 88 | +✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons** |
| 89 | + |
| 90 | +```sql |
| 91 | +WITH api_metrics AS ( |
| 92 | + SELECT |
| 93 | + DATE(action_ts) AS dt, |
| 94 | + endpoint, |
| 95 | + model_type, |
| 96 | + org_name, |
| 97 | + COUNT(*) AS total_requests, |
| 98 | + COUNTIF(result = 'success') AS successful_requests, |
| 99 | + COUNTIF(result != 'success') AS failed_requests, |
| 100 | + SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct, |
| 101 | + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms, |
| 102 | + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms, |
| 103 | + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms, |
| 104 | + AVG(request_queue_time_ms) AS avg_queue_time_ms, |
| 105 | + AVG(request_inference_time_ms) AS avg_inference_time_ms |
| 106 | + FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` |
| 107 | + WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) |
| 108 | + AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) |
| 109 | + GROUP BY dt, endpoint, model_type, org_name |
| 110 | +), |
| 111 | +metrics_with_baseline AS ( |
| 112 | + SELECT |
| 113 | + *, |
| 114 | + AVG(p95_latency_ms) OVER ( |
| 115 | + PARTITION BY endpoint, model_type |
| 116 | + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING |
| 117 | + ) AS p95_latency_baseline_7d, |
| 118 | + AVG(error_rate_pct) OVER ( |
| 119 | + PARTITION BY endpoint, model_type |
| 120 | + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING |
| 121 | + ) AS error_rate_baseline_7d |
| 122 | + FROM api_metrics |
| 123 | +) |
| 124 | +SELECT * FROM metrics_with_baseline |
| 125 | +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); |
| 126 | +``` |
| 127 | + |
| 128 | +**Key patterns**: |
| 129 | +- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99 |
| 130 | +- **Error rate**: `SAFE_DIVIDE(failed, total) * 100` |
| 131 | +- **Baseline**: 7-day rolling average by endpoint and model |
| 132 | +- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data) |
| 133 | + |
| 134 | +### Phase 5: Execute Query |
| 135 | + |
| 136 | +Run query using: |
| 137 | +```bash |
| 138 | +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " |
| 139 | +<query> |
| 140 | +" |
| 141 | +``` |
| 142 | + |
| 143 | +### Phase 6: Analyze Results |
| 144 | + |
| 145 | +**For latency trends**: |
| 146 | +- Compare P95 latency vs baseline (7-day avg) |
| 147 | +- Flag if P95 > 2x baseline or > 60s absolute |
| 148 | +- Identify which endpoint/model/org drove spikes |
| 149 | + |
| 150 | +**For error rate analysis**: |
| 151 | +- Compare error rate vs baseline |
| 152 | +- Separate errors by `error_type`/`error_source` (infrastructure vs applicative) |
| 153 | +- Flag if error rate > 5% or DoD increase > 50% |
| 154 | + |
| 155 | +**For throughput**: |
| 156 | +- Track requests per hour/day by endpoint |
| 157 | +- Flag throughput drops > 30% DoD/WoW |
| 158 | +- Identify which endpoints lost traffic |
| 159 | + |
| 160 | +**For queue analysis**: |
| 161 | +- Calculate queue time as % of total processing time |
| 162 | +- Flag if queue time > 50% of processing time |
| 163 | +- Indicates capacity/scaling issues |
| 164 | + |
| 165 | +### Phase 7: Present Findings |
| 166 | + |
| 167 | +Format results with: |
| 168 | +- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video") |
| 169 | +- **Root cause**: Which endpoint/model/org drove the issue |
| 170 | +- **Breakdown**: Performance metrics by dimension |
| 171 | +- **Error classification**: Infrastructure vs applicative errors |
| 172 | +- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure) |
| 173 | + |
| 174 | +**Alert format**: |
| 175 | +``` |
| 176 | +⚠️ API PERFORMANCE ALERT: |
| 177 | + • Endpoint: /v1/text-to-video |
| 178 | + Model: ltxv2 |
| 179 | + Metric: P95 Latency |
| 180 | + Current: 85s | Baseline: 30s |
| 181 | + Change: +183% |
| 182 | +
|
| 183 | + Error rate: 8.2% (baseline: 2.1%) |
| 184 | + Error type: Infrastructure |
| 185 | +
|
| 186 | +Recommendation: Alert Engineering team for infrastructure issue |
| 187 | +``` |
| 188 | + |
| 189 | +### Phase 8: Route Alert |
| 190 | + |
| 191 | +For ongoing monitoring: |
| 192 | +1. Save SQL query |
| 193 | +2. Set up in BigQuery scheduled query or Hex Thread |
| 194 | +3. Configure notification by error type: |
| 195 | + - Infrastructure errors → Engineering team |
| 196 | + - Applicative errors → API/Product team |
| 197 | +4. Include endpoint, model, and org details in alert |
| 198 | + |
| 199 | +## 5. Context & References |
| 200 | + |
| 201 | +### Shared Knowledge |
| 202 | +- **`shared/product-context.md`** — LTX products and API context |
| 203 | +- **`shared/bq-schema.md`** — API tables and GPU cost table schema |
| 204 | +- **`shared/metric-standards.md`** — Performance metric patterns |
| 205 | +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) |
| 206 | +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) |
| 207 | +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns |
| 208 | + |
| 209 | +### Data Sources |
| 210 | + |
| 211 | +**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` |
| 212 | +- Partitioned by `action_ts` (TIMESTAMP) |
| 213 | +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` |
| 214 | + |
| 215 | +**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` |
| 216 | +- Contains API runtime data but not primary source for performance metrics |
| 217 | + |
| 218 | +### Endpoints |
| 219 | +Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate` |
| 220 | + |
| 221 | +### Models |
| 222 | +Common models: `ltxv2`, `retake`, etc. |
| 223 | + |
| 224 | +## 6. Constraints & Done |
| 225 | + |
| 226 | +### DO NOT |
| 227 | + |
| 228 | +- **DO NOT** use absolute thresholds without baseline comparison |
| 229 | +- **DO NOT** mix infrastructure and applicative errors in same alert |
| 230 | +- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance |
| 231 | +- **DO NOT** forget to separate errors by error type/source |
| 232 | + |
| 233 | +[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source` |
| 234 | + |
| 235 | +### DO |
| 236 | + |
| 237 | +- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99) |
| 238 | +- **DO** separate errors by error_source (infrastructure vs applicative) |
| 239 | +- **DO** filter by `result = 'success'` for success rate calculations |
| 240 | +- **DO** break down by endpoint, model, and organization for detailed analysis |
| 241 | +- **DO** compare current performance against historical baseline (7-day rolling avg) |
| 242 | +- **DO** alert engineering team for infrastructure errors |
| 243 | +- **DO** alert product/API team for applicative errors |
| 244 | +- **DO** partition on `action_ts` or `dt` for performance |
| 245 | +- **DO** use `ltx-dwh-explore` as execution project |
| 246 | +- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100` |
| 247 | +- **DO** flag P95 latency > 2x baseline or > 60s |
| 248 | +- **DO** flag error rate > 5% or DoD increase > 50% |
| 249 | +- **DO** flag throughput drops > 30% DoD/WoW |
| 250 | +- **DO** flag queue time > 50% of processing time |
| 251 | +- **DO** flag infrastructure errors > 10 requests/hour |
| 252 | +- **DO** include endpoint, model, org details in all alerts |
| 253 | +- **DO** validate unusual patterns with API/Engineering team before alerting |
| 254 | + |
| 255 | +### Completion Criteria |
| 256 | + |
| 257 | +✅ All performance metrics monitored (latency, errors, throughput, queue time) |
| 258 | +✅ Alerts fire with thresholds (generic pending production analysis) |
| 259 | +✅ Endpoint/model/org breakdown provided |
| 260 | +✅ Errors separated by type (infrastructure vs applicative) |
| 261 | +✅ Findings routed to appropriate team |
| 262 | +✅ Partition filtering applied for performance |
| 263 | +✅ Column name verified (error_type vs error_source) |
0 commit comments