Lightricks · Daniel Beer (DanielB945) · Mar 8, 2026 · Mar 8, 2026 · Mar 8, 2026 · Mar 8, 2026
@@ -1,84 +1,263 @@
 ---
 name: api-runtime-monitor
-description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors.
-tags: [monitoring, api, performance, latency, errors]
+description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org."
+tags: [monitoring, api, performance, latency, errors, throughput]
 ---
 
 # API Runtime Monitor
 
-## When to use
-
-- "Monitor API latency"
-- "Alert on API errors"
-- "Track API throughput"
-- "Monitor inference time"
-- "Alert on API performance degradation"
-
-## What it monitors
-
-- **Latency**: Request processing time, inference time, queue time
-- **Error rates**: % of failed requests, error types, error sources
-- **Throughput**: Requests per hour/day, by endpoint/model
-- **Performance**: P50/P95/P99 latency, success rate
-- **Utilization**: API usage by org, model, resolution
-
-## Steps
-
-1. **Gather requirements from user:**
-   - Which performance metric to monitor (latency, errors, throughput)
-   - Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%")
-   - Time window (hourly, daily)
-   - Scope (all requests, specific endpoint, specific org)
-   - Notification channel
-
-2. **Read shared files:**
-   - `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables
-   - `shared/metric-standards.md` — Performance metric patterns
-
-3. **Identify data source:**
-   - For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost`
-   - **Key columns explained:**
-     - `request_processing_time_ms`: Total time from request submission to completion
-     - `request_inference_time_ms`: GPU processing time (actual model inference)
-     - `request_queue_time_ms`: Time waiting in queue before processing starts
-     - `result`: Request outcome (success, failed, timeout, etc.)
-     - `error_type`: Classification of errors (infrastructure vs applicative)
-     - `endpoint`: API endpoint called (e.g., /generate, /upscale)
-     - `model_type`: Model used (ltxv2, retake, etc.)
-     - `org_name`: Customer organization making the request
-
-4. **Write monitoring SQL:**
-   - Query relevant performance metric
-   - Calculate percentiles (P50, P95, P99) for latency
-   - Calculate error rate (failed / total requests)
-   - Compare against baseline
-
-5. **Present to user:**
-   - Show SQL query
-   - Show example alert format with performance breakdown
-   - Confirm threshold values
-
-6. **Set up alert** (manual for now):
-   - Document SQL
-   - Configure notification to engineering team
-
-## Reference files
-
-| File | Read when |
-|------|-----------|
-| `shared/product-context.md` | LTX products and business context |
-| `shared/bq-schema.md` | API tables and GPU cost table schema |
-| `shared/metric-standards.md` | Performance metric patterns |
-| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) |
-| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) |
-| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) |
-
-## Rules
-
-- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99)
-- DO separate errors by error_source (infrastructure vs applicative)
-- DO filter by result = 'success' for success rate calculations
-- DO break down by endpoint, model, and resolution for detailed analysis
-- DO compare current performance against historical baseline
-- DO alert engineering team for infrastructure errors, product team for applicative errors
-- DO partition by dt for performance
+## 1. Overview (Why?)
+
+LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention.
+
+This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis.
+
+**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification.
+
+## 2. Requirements (What?)
+
+Monitor these outcomes autonomously:
+
+- [ ] P95 latency spikes (> 2x baseline or > 60s)
+- [ ] Error rate increases (> 5% or DoD increase > 50%)
+- [ ] Throughput drops (> 30% DoD/WoW)
+- [ ] Queue time excessive (> 50% of processing time)
+- [ ] Infrastructure errors (> 10 requests/hour)
+- [ ] Alerts include breakdown by endpoint, model, organization
+- [ ] Results formatted by priority (infrastructure vs applicative errors)
+- [ ] Findings routed to appropriate team (API team or Engineering)
+
+## 3. Progress Tracker
+
+* [ ] Read shared knowledge (schema, metrics, performance patterns)
+* [ ] Identify data source (ltxvapi tables or GPU cost table)
+* [ ] Write monitoring SQL with percentile calculations
+* [ ] Execute query for target date range
+* [ ] Analyze results by endpoint, model, organization
+* [ ] Separate infrastructure vs applicative errors
+* [ ] Present findings with performance breakdown
+* [ ] Route alerts to appropriate team
+
+## 4. Implementation Plan
+
+### Phase 1: Read Alert Thresholds
+
+**Generic thresholds** (data-driven analysis pending):
+- P95 latency > 2x baseline or > 60s
+- Error rate > 5% or DoD increase > 50%
+- Throughput drops > 30% DoD/WoW
+- Queue time > 50% of processing time
+- Infrastructure errors > 10 requests/hour
+
+[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring).
+
+### Phase 2: Read Shared Knowledge
+
+Before writing SQL, read:
+- **`shared/product-context.md`** — LTX products, user types, business model, API context
+- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema
+- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput)
+- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
+- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
+- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance)
+
+**Data nuances**:
+- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
+- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
+- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance
+- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`
+
+### Phase 3: Identify Data Source
+
+✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics**
+
+**Key columns**:
+- `request_processing_time_ms`: Total time from request submission to completion
+- `request_inference_time_ms`: GPU processing time (actual model inference)
+- `request_queue_time_ms`: Time waiting in queue before processing starts
+- `result`: Request outcome (success, failed, timeout, etc.)
+- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative)
+- `endpoint`: API endpoint called (e.g., /generate, /upscale)
+- `model_type`: Model used (ltxv2, retake, etc.)
+- `org_name`: Customer organization making the request
+
+[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema
+
+### Phase 4: Write Monitoring SQL
+
+✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons**
+
+```sql
+WITH api_metrics AS (
+  SELECT
+    DATE(action_ts) AS dt,
+    endpoint,
+    model_type,
+    org_name,
+    COUNT(*) AS total_requests,
+    COUNTIF(result = 'success') AS successful_requests,
+    COUNTIF(result != 'success') AS failed_requests,
+    SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct,
+    APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms,
+    APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms,
+    APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms,
+    AVG(request_queue_time_ms) AS avg_queue_time_ms,
+    AVG(request_inference_time_ms) AS avg_inference_time_ms
+  FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
+  WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
+    AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
+  GROUP BY dt, endpoint, model_type, org_name
+),
+metrics_with_baseline AS (
+  SELECT
+    *,
+    AVG(p95_latency_ms) OVER (
+      PARTITION BY endpoint, model_type
+      ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
+    ) AS p95_latency_baseline_7d,
+    AVG(error_rate_pct) OVER (
+      PARTITION BY endpoint, model_type
+      ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
+    ) AS error_rate_baseline_7d
+  FROM api_metrics
+)
+SELECT * FROM metrics_with_baseline
+WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
+```
+
+**Key patterns**:
+- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99
+- **Error rate**: `SAFE_DIVIDE(failed, total) * 100`
+- **Baseline**: 7-day rolling average by endpoint and model
+- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data)
+
+### Phase 5: Execute Query
+
+Run query using:
+```bash
+bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
+<query>
+"
+```
+
+### Phase 6: Analyze Results
+
+**For latency trends**:
+- Compare P95 latency vs baseline (7-day avg)
+- Flag if P95 > 2x baseline or > 60s absolute
+- Identify which endpoint/model/org drove spikes
+
+**For error rate analysis**:
+- Compare error rate vs baseline
+- Separate errors by `error_type`/`error_source` (infrastructure vs applicative)
+- Flag if error rate > 5% or DoD increase > 50%
+
+**For throughput**:
+- Track requests per hour/day by endpoint
+- Flag throughput drops > 30% DoD/WoW
+- Identify which endpoints lost traffic
+
+**For queue analysis**:
+- Calculate queue time as % of total processing time
+- Flag if queue time > 50% of processing time
+- Indicates capacity/scaling issues
+
+### Phase 7: Present Findings
+
+Format results with:
+- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video")
+- **Root cause**: Which endpoint/model/org drove the issue
+- **Breakdown**: Performance metrics by dimension
+- **Error classification**: Infrastructure vs applicative errors
+- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure)
+
+**Alert format**:
+```
+⚠️ API PERFORMANCE ALERT:
+  • Endpoint: /v1/text-to-video
+    Model: ltxv2
+    Metric: P95 Latency
+    Current: 85s | Baseline: 30s
+    Change: +183%
+
+  Error rate: 8.2% (baseline: 2.1%)
+  Error type: Infrastructure
+
+Recommendation: Alert Engineering team for infrastructure issue
+```
+
+### Phase 8: Route Alert
+
+For ongoing monitoring:
+1. Save SQL query
+2. Set up in BigQuery scheduled query or Hex Thread
+3. Configure notification by error type:
+   - Infrastructure errors → Engineering team
+   - Applicative errors → API/Product team
+4. Include endpoint, model, and org details in alert
+
+## 5. Context & References
+
+### Shared Knowledge
+- **`shared/product-context.md`** — LTX products and API context
+- **`shared/bq-schema.md`** — API tables and GPU cost table schema
+- **`shared/metric-standards.md`** — Performance metric patterns
+- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
+- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
+- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns
+
+### Data Sources
+
+**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
+- Partitioned by `action_ts` (TIMESTAMP)
+- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`
+
+**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
+- Contains API runtime data but not primary source for performance metrics
+
+### Endpoints
+Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate`
+
+### Models
+Common models: `ltxv2`, `retake`, etc.
+
+## 6. Constraints & Done
+
+### DO NOT
+
+- **DO NOT** use absolute thresholds without baseline comparison
+- **DO NOT** mix infrastructure and applicative errors in same alert
+- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance
+- **DO NOT** forget to separate errors by error type/source
+
+[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source`
+
+### DO
+
+- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99)
+- **DO** separate errors by error_source (infrastructure vs applicative)
+- **DO** filter by `result = 'success'` for success rate calculations
+- **DO** break down by endpoint, model, and organization for detailed analysis
+- **DO** compare current performance against historical baseline (7-day rolling avg)
+- **DO** alert engineering team for infrastructure errors
+- **DO** alert product/API team for applicative errors
+- **DO** partition on `action_ts` or `dt` for performance
+- **DO** use `ltx-dwh-explore` as execution project
+- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100`
+- **DO** flag P95 latency > 2x baseline or > 60s
+- **DO** flag error rate > 5% or DoD increase > 50%
+- **DO** flag throughput drops > 30% DoD/WoW
+- **DO** flag queue time > 50% of processing time
+- **DO** flag infrastructure errors > 10 requests/hour
+- **DO** include endpoint, model, org details in all alerts
+- **DO** validate unusual patterns with API/Engineering team before alerting
+
+### Completion Criteria
+
+✅ All performance metrics monitored (latency, errors, throughput, queue time)
+✅ Alerts fire with thresholds (generic pending production analysis)
+✅ Endpoint/model/org breakdown provided
+✅ Errors separated by type (infrastructure vs applicative)
+✅ Findings routed to appropriate team
+✅ Partition filtering applied for performance
+✅ Column name verified (error_type vs error_source)