This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is a research and analysis repository for the Coverage Redesign project at IncludedHealth. It contains:
- Data extraction scripts (Node.js/Python) for analyzing Apollo GraphQL operations and traffic patterns
- Traffic control simulation planning for the Real-Time Eligibility (RTE) service
- Architecture analysis of the RTE service request chains and dependencies
- Client profiling tools for understanding API usage patterns
This is NOT a production codebase - it's a collection of analysis tools, data files, and design documents.
This repository is part of a multi-repository workspace managed through project.yaml:
- References 11 related repositories (coverage, realtime-eligibility, member-sponsorship, etc.)
- Uses VS Code/Cursor workspace configuration (
coverage-redesign.code-workspace) - No symlinks - repositories are accessed via relative paths in workspace folders
- See AGENTS.md for workspace management details
-
Production Services (referenced, not in this repo):
coverage- GraphQL gateway serving 59+ client servicesrealtime-eligibility- RTE service managing 15 Stedi connectionsmember-sponsorship- Alternative RTE gatewaysupport-tools- Internal support tools
-
This Repository - Analysis tools organized by purpose:
- Apollo Studio data extraction (JavaScript)
- Prometheus/Grafana metrics analysis (Python)
- Client profiling and traffic analysis (Python)
- Design documents and specifications (Markdown)
GraphQL Operation Analysis:
# Fetch Apollo Studio operation details (requires APOLLO_KEY)
export APOLLO_KEY='your-key'
./fetch-apollo-operation-details.js --all
# Match operations to persisted query manifest
node match-by-field-signature.js
# Generate client operation statistics
node operations-by-client-stats.js --improvedMetrics Analysis (Python):
# Analyze GraphQL operations by month
python graphql_operations_monthly.py
# Generate client profiles from metrics
python client_profile.py
# Inspect Prometheus metrics
python inspect_metrics.py
# Analyze trading partner histograms
python trading_partner_histogram_weekly.pyClient Profiling:
# Generate client profile for specific service
python client_profile.py --service realtime_eligibility
# Generate profiles for all services
python client_profile.py --all-servicesThis repository contains only scripts - there is no build process, no test suite, no deployment. Scripts are run directly via node or python.
This project addresses traffic control challenges for the RTE service:
System Constraints:
- 15 concurrent connection limit to Stedi eligibility service
- 90+ services depend on RTE through coverage-server gateway
- P95 latency > 10 seconds for Stedi calls
- 8 major incidents in 14 months (6 SEV-1, 2 SEV-2)
Core Problems (P1-P5):
- P1: Request Collapsing - In-memory singleflight limited to single pod (25x amplification)
- P2: Aberrant Client Protection - No rate limiting; single client can saturate all 15 connections
- P3: Payer Fairness - Slow payers monopolize connections, starving others
- P4: Backpressure - No admission control; system relies on timeouts
- P5: Priority Metadata - Cannot distinguish frontend from batch traffic
Architecture Layers ("Three Gates"):
- Gate 1: Per-member request collapsing (coverage-server, member-sponsorship)
- Gate 2: RTE response cache (realtime-eligibility)
- Gate 3: Stedi concurrency scheduler (15-slot resource manager)
See spec.md for complete architecture specification.
Read these first for context:
- spec.md - Complete traffic control architecture specification
- prd.md - Scheduler Simulation Framework requirements
- phase1-data-plan.md - Data collection strategy for simulation
- SOLUTION_SUMMARY.md - Apollo operation matching solution
- RTE_ANALYSIS_INDEX.md - Complete RTE dependency analysis index
RTE Analysis Suite (4 comprehensive documents):
- RTE_REQUEST_CHAIN_ANALYSIS.md - 7 request chains from clients to Stedi
- RTE_UPSTREAM_DEPENDENCIES.md - Complete backward trace of 59 services
- RTE_ERROR_BACKPRESSURE_ANALYSIS.md - Failure scenarios & user experience
- RTE_BACKPRESSURE_CURRENT_STATE_ANALYSIS.md - First principles analysis
Problem Solved: Apollo Studio operation IDs use different hashing than persisted query manifests, requiring field signature matching.
Pipeline Flow:
1. fetch-apollo-operation-details.js
↓ (apollo-operation-details.json)
2. match-by-field-signature.js
↓ (apollo-field-signature-matching.json)
3. integrate-improved-matching.js
↓ (coverage-clients-analysis-improved.json)
4. operations-by-client-stats.js --improved
↓ (coverage-operations-by-client.{json,csv})
Key Files:
apollo-operation-details.json- Raw Apollo Studio API responses (794 operations)apollo-field-signature-matching.json- Field signature match results (49.7% exact, 14.4% ambiguous)coverage-operations-by-client.json- Final analysis with improved match data
Phase 1 Data Collection (for simulation framework):
Prometheus Metrics → payer_profile_generator.py → payer_definitions.yaml
↘ workload_profile_generator.py → client_definitions.yaml
↘ incident_analyzer.py → timeline_events.yaml
Loki Logs → log_analyzer.py → member_distribution.yaml (Zipf parameters)
Key Metrics:
stedi_request_duration- Latency histogram by trading_partner_idstedi_http_status- HTTP status codes by trading_partner_idstedi_errors- DTE error codes (80, 42)traces_spanmetrics_calls_total- GraphQL operation call counts
See phase1-data-plan.md for complete metric collection strategy.
Data Extraction:
fetch-apollo-operation-details.js- Queries Apollo Studio API for operation signaturesidentify-coverage-clients.js- Identifies clients hitting coverage subgraphextract-*.js- Various extraction utilities for manifest/query data
Matching & Analysis:
match-by-field-signature.js- Matches Apollo ops to manifest by field signaturesintegrate-improved-matching.js- Updates analysis with improved matchesoperations-by-client-stats.js- Generates statistics with--improvedflag
Query Plan Analysis:
generate-query-plans.js- Generates query plans for operationsnormalize-query-plan-cache.js- Normalizes query plan cache filesgenerate-dependency-graph.js- Creates subgraph dependency graphs
Metrics Analysis:
graphql_operations_monthly.py- Analyzes GraphQL operations from Prometheusinspect_metrics.py- Inspects Prometheus metrics for analysistrading_partner_histogram_weekly.py- Analyzes trading partner request patternstrading_partner_profile.py- Generates trading partner profiles
Client Profiling:
client_profile.py- Main client profiling script (generates YAML definitions)- Supports
--service <name>for specific service - Supports
--all-servicesfor batch processing - Outputs to
client_definitions_<service>.yamlandclient_profiles_<service>.csv
- Supports
Scripts use environment variable APOLLO_KEY or auto-detect from rover config:
export APOLLO_KEY='your-api-key'
export APOLLO_GRAPH_REF='application-layer-graph-kwpy1@production' # optionalApollo Studio canonicalizes field order in operation signatures, so matching by field set (ignoring order) works better than exact string matching:
- Extract fields from both Apollo operations and manifest queries
- Compare as sets (order-independent)
- Classify as: exact match, ambiguous match (multiple variants), or unmatched
Payer Latency:
- Default: Log-normal distribution (per PRD spec)
- Alternative: Pareto distribution for long-tail payers (P95 > 10s)
- Parameters: P50, P90, P95, P99 from
stedi_request_durationhistogram
Member Distribution:
- Zipf distribution (models "popular member" collapsing problem)
- Extracted from Loki logs (account_id frequency analysis)
- Parameters: s (skew), N (pool size)
Arrival Rate:
- Poisson distribution (random independent arrivals)
- Parameter: meanRequestsPerSecond from
traces_spanmetrics_calls_total
Client definitions follow simulation.yaml schema from prd.md:
clientDefinitions:
- name: "Frontend-Sponsorship"
priorityTier: "gold" # gold/silver/bronze
requestType: "sponsorship"
payerDefinitions:
- name: "Payer-Fast"
latency:
type: "log-normal"
mean_ms: 800
stddev_ms: 300
errorRates:
http_5xx: 0.001
dte_80: 0.0
dte_42: 0.0- Scripts are exploratory and analysis-focused
- Output files (JSON/CSV/YAML) are checked in for reference
- No CI/CD, no deployment, no production dependencies
- Scripts may require manual configuration (API keys, environment variables)
External APIs:
- Apollo Studio GraphQL API (requires authentication)
- Grafana/Prometheus (production metrics)
- Loki (production logs)
Static Files:
persisted-query-manifest-*.json- Persisted query definitions from Apollo Studioapplication-layer-graph-*.json- GraphQL schema from Apollo Studio*.json,*.csv- Extracted data and analysis results
This repository references production services defined in separate repositories:
- Use workspace folders to navigate to service code
- Service implementations are in
coverage/,realtime-eligibility/, etc. - Platform metadata is in
platform-map/tng_descriptors/
When analyzing service behavior, cross-reference:
- This repo's analysis documents (RTE_*.md)
- Service source code (via workspace folders)
- Platform-map descriptors for GraphQL operations
Check APOLLO_KEY environment variable:
# Verify key is set
echo $APOLLO_KEY
# Or let script auto-detect from rover config
# (script looks for ~/.rover/config.toml)Queries in phase1-data-plan.md use 30-day windows. For large datasets:
- Reduce time window (e.g.,
[7d]instead of[30d]) - Use sampling (e.g., 1-hour windows across 30 days)
- Query cold storage for historical data beyond hot retention
member_id/account_id NOT available in Prometheus metrics (too high cardinality). Use:
- Primary: Loki logs via LogQL queries
- Secondary: Distributed traces (7-day hot storage, 30+ day cold storage)
- Historical: Potluck data warehouse via Lasik SQL client
See phase1-data-plan.md Section 2.3 for complete strategy.
Current Work:
- ACT-2366: Improve Traffic Control (rate limiting, priority queuing)
- CLOUD-6296: GraphQL Inbound Request Graphs by Client (caller visibility)
- CLOUD-6297: Kafka Replay Protection (prevent INC-864 recurrence)
- PHARM-1038: Update Consumer Offset Handling (medication service)
Incident References:
- INC-864 (Oct 2025): Medication Kafka replay saturated Stedi connections
- INC-824 (July 2025): Unknown client flooding coverage-server
- INC-821 (July 2025): Traffic surge outage
- See RTE_ANALYSIS_INDEX.md for complete incident history (8 incidents)