Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
335 changes: 257 additions & 78 deletions agents/monitoring/api-runtime/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,263 @@
---
name: api-runtime-monitor
description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors.
tags: [monitoring, api, performance, latency, errors]
description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org."
tags: [monitoring, api, performance, latency, errors, throughput]
---

# API Runtime Monitor

## When to use

- "Monitor API latency"
- "Alert on API errors"
- "Track API throughput"
- "Monitor inference time"
- "Alert on API performance degradation"

## What it monitors

- **Latency**: Request processing time, inference time, queue time
- **Error rates**: % of failed requests, error types, error sources
- **Throughput**: Requests per hour/day, by endpoint/model
- **Performance**: P50/P95/P99 latency, success rate
- **Utilization**: API usage by org, model, resolution

## Steps

1. **Gather requirements from user:**
- Which performance metric to monitor (latency, errors, throughput)
- Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%")
- Time window (hourly, daily)
- Scope (all requests, specific endpoint, specific org)
- Notification channel

2. **Read shared files:**
- `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables
- `shared/metric-standards.md` — Performance metric patterns

3. **Identify data source:**
- For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost`
- **Key columns explained:**
- `request_processing_time_ms`: Total time from request submission to completion
- `request_inference_time_ms`: GPU processing time (actual model inference)
- `request_queue_time_ms`: Time waiting in queue before processing starts
- `result`: Request outcome (success, failed, timeout, etc.)
- `error_type`: Classification of errors (infrastructure vs applicative)
- `endpoint`: API endpoint called (e.g., /generate, /upscale)
- `model_type`: Model used (ltxv2, retake, etc.)
- `org_name`: Customer organization making the request

4. **Write monitoring SQL:**
- Query relevant performance metric
- Calculate percentiles (P50, P95, P99) for latency
- Calculate error rate (failed / total requests)
- Compare against baseline

5. **Present to user:**
- Show SQL query
- Show example alert format with performance breakdown
- Confirm threshold values

6. **Set up alert** (manual for now):
- Document SQL
- Configure notification to engineering team

## Reference files

| File | Read when |
|------|-----------|
| `shared/product-context.md` | LTX products and business context |
| `shared/bq-schema.md` | API tables and GPU cost table schema |
| `shared/metric-standards.md` | Performance metric patterns |
| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) |
| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) |
| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) |

## Rules

- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99)
- DO separate errors by error_source (infrastructure vs applicative)
- DO filter by result = 'success' for success rate calculations
- DO break down by endpoint, model, and resolution for detailed analysis
- DO compare current performance against historical baseline
- DO alert engineering team for infrastructure errors, product team for applicative errors
- DO partition by dt for performance
## 1. Overview (Why?)

LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention.

This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis.

**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification.

## 2. Requirements (What?)

Monitor these outcomes autonomously:

- [ ] P95 latency spikes (> 2x baseline or > 60s)
- [ ] Error rate increases (> 5% or DoD increase > 50%)
- [ ] Throughput drops (> 30% DoD/WoW)
- [ ] Queue time excessive (> 50% of processing time)
- [ ] Infrastructure errors (> 10 requests/hour)
- [ ] Alerts include breakdown by endpoint, model, organization
- [ ] Results formatted by priority (infrastructure vs applicative errors)
- [ ] Findings routed to appropriate team (API team or Engineering)

## 3. Progress Tracker

* [ ] Read shared knowledge (schema, metrics, performance patterns)
* [ ] Identify data source (ltxvapi tables or GPU cost table)
* [ ] Write monitoring SQL with percentile calculations
* [ ] Execute query for target date range
* [ ] Analyze results by endpoint, model, organization
* [ ] Separate infrastructure vs applicative errors
* [ ] Present findings with performance breakdown
* [ ] Route alerts to appropriate team

## 4. Implementation Plan

### Phase 1: Read Alert Thresholds

**Generic thresholds** (data-driven analysis pending):
- P95 latency > 2x baseline or > 60s
- Error rate > 5% or DoD increase > 50%
- Throughput drops > 30% DoD/WoW
- Queue time > 50% of processing time
- Infrastructure errors > 10 requests/hour

[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring).

### Phase 2: Read Shared Knowledge

Before writing SQL, read:
- **`shared/product-context.md`** — LTX products, user types, business model, API context
- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema
- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput)
- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance)

**Data nuances**:
- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance
- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`

### Phase 3: Identify Data Source

✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics**

**Key columns**:
- `request_processing_time_ms`: Total time from request submission to completion
- `request_inference_time_ms`: GPU processing time (actual model inference)
- `request_queue_time_ms`: Time waiting in queue before processing starts
- `result`: Request outcome (success, failed, timeout, etc.)
- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative)
- `endpoint`: API endpoint called (e.g., /generate, /upscale)
- `model_type`: Model used (ltxv2, retake, etc.)
- `org_name`: Customer organization making the request

[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema

### Phase 4: Write Monitoring SQL

✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons**

```sql
WITH api_metrics AS (
SELECT
DATE(action_ts) AS dt,
endpoint,
model_type,
org_name,
COUNT(*) AS total_requests,
COUNTIF(result = 'success') AS successful_requests,
COUNTIF(result != 'success') AS failed_requests,
SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct,
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms,
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms,
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms,
AVG(request_queue_time_ms) AS avg_queue_time_ms,
AVG(request_inference_time_ms) AS avg_inference_time_ms
FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY dt, endpoint, model_type, org_name
),
metrics_with_baseline AS (
SELECT
*,
AVG(p95_latency_ms) OVER (
PARTITION BY endpoint, model_type
ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS p95_latency_baseline_7d,
AVG(error_rate_pct) OVER (
PARTITION BY endpoint, model_type
ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS error_rate_baseline_7d
FROM api_metrics
)
SELECT * FROM metrics_with_baseline
WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
```

**Key patterns**:
- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99
- **Error rate**: `SAFE_DIVIDE(failed, total) * 100`
- **Baseline**: 7-day rolling average by endpoint and model
- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data)

### Phase 5: Execute Query

Run query using:
```bash
bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
<query>
"
```

### Phase 6: Analyze Results

**For latency trends**:
- Compare P95 latency vs baseline (7-day avg)
- Flag if P95 > 2x baseline or > 60s absolute
- Identify which endpoint/model/org drove spikes

**For error rate analysis**:
- Compare error rate vs baseline
- Separate errors by `error_type`/`error_source` (infrastructure vs applicative)
- Flag if error rate > 5% or DoD increase > 50%

**For throughput**:
- Track requests per hour/day by endpoint
- Flag throughput drops > 30% DoD/WoW
- Identify which endpoints lost traffic

**For queue analysis**:
- Calculate queue time as % of total processing time
- Flag if queue time > 50% of processing time
- Indicates capacity/scaling issues

### Phase 7: Present Findings

Format results with:
- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video")
- **Root cause**: Which endpoint/model/org drove the issue
- **Breakdown**: Performance metrics by dimension
- **Error classification**: Infrastructure vs applicative errors
- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure)

**Alert format**:
```
⚠️ API PERFORMANCE ALERT:
• Endpoint: /v1/text-to-video
Model: ltxv2
Metric: P95 Latency
Current: 85s | Baseline: 30s
Change: +183%

Error rate: 8.2% (baseline: 2.1%)
Error type: Infrastructure

Recommendation: Alert Engineering team for infrastructure issue
```

### Phase 8: Route Alert

For ongoing monitoring:
1. Save SQL query
2. Set up in BigQuery scheduled query or Hex Thread
3. Configure notification by error type:
- Infrastructure errors → Engineering team
- Applicative errors → API/Product team
4. Include endpoint, model, and org details in alert

## 5. Context & References

### Shared Knowledge
- **`shared/product-context.md`** — LTX products and API context
- **`shared/bq-schema.md`** — API tables and GPU cost table schema
- **`shared/metric-standards.md`** — Performance metric patterns
- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns

### Data Sources

**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
- Partitioned by `action_ts` (TIMESTAMP)
- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`

**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
- Contains API runtime data but not primary source for performance metrics

### Endpoints
Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate`

### Models
Common models: `ltxv2`, `retake`, etc.

## 6. Constraints & Done

### DO NOT

- **DO NOT** use absolute thresholds without baseline comparison
- **DO NOT** mix infrastructure and applicative errors in same alert
- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance
- **DO NOT** forget to separate errors by error type/source

[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source`

### DO

- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99)
- **DO** separate errors by error_source (infrastructure vs applicative)
- **DO** filter by `result = 'success'` for success rate calculations
- **DO** break down by endpoint, model, and organization for detailed analysis
- **DO** compare current performance against historical baseline (7-day rolling avg)
- **DO** alert engineering team for infrastructure errors
- **DO** alert product/API team for applicative errors
- **DO** partition on `action_ts` or `dt` for performance
- **DO** use `ltx-dwh-explore` as execution project
- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100`
- **DO** flag P95 latency > 2x baseline or > 60s
- **DO** flag error rate > 5% or DoD increase > 50%
- **DO** flag throughput drops > 30% DoD/WoW
- **DO** flag queue time > 50% of processing time
- **DO** flag infrastructure errors > 10 requests/hour
- **DO** include endpoint, model, org details in all alerts
- **DO** validate unusual patterns with API/Engineering team before alerting

### Completion Criteria

✅ All performance metrics monitored (latency, errors, throughput, queue time)
✅ Alerts fire with thresholds (generic pending production analysis)
✅ Endpoint/model/org breakdown provided
✅ Errors separated by type (infrastructure vs applicative)
✅ Findings routed to appropriate team
✅ Partition filtering applied for performance
✅ Column name verified (error_type vs error_source)
Loading