Severity: P0 (100% failure rate for a core feature) Root Cause: Lack of integration testing, performance testing, and monitoring Detection: Customer report (not caught by CI/CD or monitoring) Time to Detection: Unknown (could have been broken for weeks/months)
What we had:
# tests/unit/test_historical_resource.py
@patch('httpx.Client.request') # Mocked HTTP client
def test_get_historical_data(self, mock_request, ...):
mock_response.status_code = 200 # Always succeeds
mock_response.json.return_value = {...} # Instant responseWhat we DIDN'T test:
- ✗ Actual API endpoint existence
- ✗ Actual response times
- ✗ Actual timeout behavior
- ✗ What happens when query takes 60+ seconds
- ✗ SDK behavior with real production data volumes
Impact: Unit tests passed but production failed
Missing tests:
# Should have had this test
def test_historical_query_completes_within_timeout():
"""Verify 1 year queries complete within SDK timeout."""
client = OilPriceAPI(api_key=PROD_KEY, timeout=30)
start = time.time()
try:
historical = client.historical.get(
commodity="WTI_USD",
start_date="2024-01-01",
end_date="2024-12-31"
)
duration = time.time() - start
# FAIL: This would have caught the issue
assert duration < 30, f"Query took {duration}s, exceeds 30s timeout"
except TimeoutError:
pytest.fail("Query timed out - timeout too short!")What we should test:
- Query response times for different date ranges
- Timeout thresholds for different data volumes
- Performance degradation as data grows
- P50, P95, P99 response times
Missing validation:
# SDK assumed these endpoints existed but never verified
ENDPOINTS_ASSUMED = [
"/v1/prices/past_day",
"/v1/prices/past_week",
"/v1/prices/past_month",
"/v1/prices/past_year"
]
# Should have had this test
def test_all_historical_endpoints_exist():
"""Verify SDK endpoints actually exist in API."""
client = OilPriceAPI(api_key=TEST_KEY)
for endpoint in ENDPOINTS_ASSUMED:
response = client._client.get(f"{client.base_url}{endpoint}")
assert response.status_code != 404, f"Endpoint {endpoint} doesn't exist!"Issue: SDK v1.4.2 now uses 4 different endpoints. What if one doesn't exist?
What we DON'T monitor:
- ✗ SDK timeout rate (% of requests that timeout)
- ✗ Historical endpoint response times
- ✗ SDK error rates by version
- ✗ Customer retry rates (sign of persistent failures)
Should have alerts for:
ALERT: SDK timeout rate > 5% for 5 minutes
ALERT: Historical endpoint P95 > 60s
ALERT: Requests to /past_year endpoint taking >30s (SDK default timeout)
ALERT: Same user retrying same query >3 times in 5 minutes
How we should have detected this:
- Monitoring would show 100% timeout rate
- Alert fires to engineering
- Fix deployed before customer reports
Reality: Customer reported it first ❌
Missing end-to-end tests:
# tests/e2e/test_historical_real_api.py (DOESN'T EXIST)
@pytest.mark.e2e
@pytest.mark.slow
def test_1_year_historical_query_production():
"""Test real 1-year query against production API."""
client = OilPriceAPI(api_key=os.getenv("PROD_API_KEY"))
start_time = time.time()
historical = client.historical.get(
commodity="WTI_USD",
start_date=(datetime.now() - timedelta(days=365)).isoformat(),
end_date=datetime.now().isoformat(),
interval="daily"
)
duration = time.time() - start_time
# Validate
assert len(historical.data) > 300, "Should get ~365 daily records"
assert duration < client.timeout, f"Query took {duration}s, timeout is {client.timeout}s"
# Performance assertion
assert duration < 90, f"Query took {duration}s, should be <90s (with margin)"Run schedule:
- Every release (pre-deployment)
- Daily against production (detect degradation)
- After any backend changes to historical endpoints
Current release process (what we did):
- Write code
- Run unit tests (mocked)
- Publish to PyPI
- ❌ Hope nothing breaks
Should be:
- Write code
- Run unit tests (mocked)
- Run integration tests (real API)
- Run performance tests (measure response times)
- Test against production API (canary)
- Publish to PyPI
- Monitor error rates for 24h
Missing step: Pre-release validation against production
We never tested with realistic data volumes:
# Should have parametrized tests with production data volumes
@pytest.mark.parametrize("commodity,date_range,expected_records", [
("WTI_USD", ("2024-01-01", "2024-12-31"), 1_011_233), # Actual production volume
("BRENT_CRUDE_USD", ("2024-01-01", "2024-12-31"), 987_456),
# ... etc
])
def test_historical_with_production_volumes(commodity, date_range, expected_records):
"""Test SDK handles production-scale data volumes."""
# This would have revealed the timeout issueWe should know:
- How many records exist for each commodity/year
- Expected aggregation time for that volume
- Appropriate timeout for that volume
Missing tests:
def test_timeout_appropriateness_for_query_size():
"""Verify timeout scales with query complexity."""
test_cases = [
# (date_range_days, expected_timeout)
(1, 30), # 1 day = 30s timeout
(7, 30), # 1 week = 30s timeout
(30, 60), # 1 month = 60s timeout
(365, 120), # 1 year = 120s timeout
]
for days, expected_timeout in test_cases:
resource = HistoricalResource(mock_client)
start = datetime.now() - timedelta(days=days)
end = datetime.now()
actual_timeout = resource._calculate_timeout(start, end, None)
assert actual_timeout == expected_timeoutIssue: We never validated our timeout calculations were reasonable
SDK README should document:
- Expected response times for different date ranges
- Timeout defaults and how to override
- When to use custom timeouts
- Performance characteristics
Example missing documentation:
## Performance Guidelines
### Expected Response Times
- 1 day queries: ~1-2 seconds
- 1 week queries: ~5-10 seconds
- 1 month queries: ~15-25 seconds
- 1 year queries: ~60-90 seconds
### Custom Timeouts
For queries longer than 1 year, increase the timeout:
```python
historical = client.historical.get(
commodity="WTI_USD",
start_date="2020-01-01",
end_date="2024-12-31",
timeout=180 # 3 minutes for 5 years
)
---
### 10. Customer Feedback Loop Gaps 🟡 HIGH
**How we found out**: Customer email ❌
**How we SHOULD have found out**: Automated monitoring ✅
**Missing feedback mechanisms**:
1. SDK telemetry (opt-in error reporting)
2. Timeout event logging
3. Retry pattern detection
4. Version adoption tracking
**Should implement**:
```python
# oilpriceapi/telemetry.py (opt-in)
def log_timeout_event(endpoint, timeout, duration):
"""Log timeout events for analysis (opt-in)."""
if user_opted_in_to_telemetry():
send_to_analytics({
"event": "timeout",
"endpoint": endpoint,
"timeout": timeout,
"duration": duration,
"sdk_version": __version__,
"timestamp": datetime.utcnow()
})
-
Add integration tests that call real API
# New test file tests/integration/test_real_api_historical.py- Test each historical endpoint exists
- Test timeout handling with real queries
- Run in CI before every release
-
Add performance monitoring
- Track historical endpoint response times - Alert on P95 > 60s - Alert on timeout rate > 5% -
Add pre-release validation
# Run before every PyPI publish ./scripts/pre_release_validation.sh- Calls real API with test account
- Validates all endpoints work
- Measures response times
- Fails if any test takes > expected time
-
Add SDK telemetry (opt-in)
- Track timeout events
- Track error rates by SDK version
- Detect issues before customers report
-
Add performance regression tests
# tests/performance/test_historical_benchmarks.py- Baseline response times
- Alert if performance degrades >20%
-
Improve documentation
- Document expected response times
- Document when to use custom timeouts
- Add troubleshooting guide
-
Add contract testing
- Validate API schema matches SDK expectations
- Run on every backend deployment
-
Add canary releases
- Release new SDK versions to 5% of users first
- Monitor error rates before full rollout
-
Add synthetic monitoring
- Run realistic SDK queries every hour
- Alert if they fail
- Detect issues proactively
# tests/integration/test_historical_integration.py
@pytest.fixture
def prod_client():
"""Client configured for production API."""
return OilPriceAPI(
api_key=os.getenv("TEST_API_KEY"),
base_url="https://api.oilpriceapi.com"
)
class TestHistoricalIntegration:
"""Integration tests against real API."""
@pytest.mark.integration
def test_past_day_endpoint_exists(self, prod_client):
"""Verify /past_day endpoint exists."""
response = prod_client.historical.get(
commodity="WTI_USD",
start_date=(datetime.now() - timedelta(days=1)).date(),
end_date=datetime.now().date()
)
assert len(response.data) > 0
@pytest.mark.integration
def test_past_week_endpoint_exists(self, prod_client):
"""Verify /past_week endpoint exists."""
response = prod_client.historical.get(
commodity="WTI_USD",
start_date=(datetime.now() - timedelta(days=7)).date(),
end_date=datetime.now().date()
)
assert len(response.data) > 0
@pytest.mark.integration
@pytest.mark.slow
def test_1_year_query_completes_within_timeout(self, prod_client):
"""Verify 1 year queries complete within SDK timeout."""
start_time = time.time()
response = prod_client.historical.get(
commodity="WTI_USD",
start_date="2024-01-01",
end_date="2024-12-31",
interval="daily"
)
duration = time.time() - start_time
assert len(response.data) > 300, "Should get ~365 daily records"
assert duration < 120, f"Query took {duration}s, should be <120s timeout"
@pytest.mark.integration
def test_endpoint_selection_uses_correct_endpoint(self, prod_client, monkeypatch):
"""Verify SDK uses optimal endpoint based on date range."""
called_endpoints = []
original_request = prod_client.request
def track_request(method, path, **kwargs):
called_endpoints.append(path)
return original_request(method, path, **kwargs)
monkeypatch.setattr(prod_client, "request", track_request)
# 1 week query should use /past_week
prod_client.historical.get(
commodity="WTI_USD",
start_date=(datetime.now() - timedelta(days=7)).date(),
end_date=datetime.now().date()
)
assert "/v1/prices/past_week" in called_endpoints[-1]# tests/performance/test_historical_performance.py
class TestHistoricalPerformance:
"""Performance benchmarks for historical queries."""
@pytest.mark.performance
@pytest.mark.parametrize("days,max_duration", [
(1, 5), # 1 day should be <5s
(7, 15), # 1 week should be <15s
(30, 30), # 1 month should be <30s
(365, 90), # 1 year should be <90s
])
def test_query_performance_baseline(self, prod_client, days, max_duration):
"""Verify queries complete within expected time."""
start_date = (datetime.now() - timedelta(days=days)).date()
end_date = datetime.now().date()
start_time = time.time()
response = prod_client.historical.get(
commodity="WTI_USD",
start_date=start_date,
end_date=end_date,
interval="daily"
)
duration = time.time() - start_time
assert duration < max_duration, \
f"{days}-day query took {duration:.2f}s, expected <{max_duration}s"
assert len(response.data) > 0
@pytest.mark.performance
def test_timeout_is_appropriate_for_query_size(self):
"""Verify calculated timeouts are reasonable."""
resource = HistoricalResource(Mock())
test_cases = [
# (start, end, expected_min, expected_max)
("2024-12-15", "2024-12-16", 25, 35), # 1 day: ~30s
("2024-12-09", "2024-12-16", 25, 35), # 1 week: ~30s
("2024-11-16", "2024-12-16", 55, 65), # 1 month: ~60s
("2024-01-01", "2024-12-31", 115, 125), # 1 year: ~120s
]
for start, end, min_t, max_t in test_cases:
timeout = resource._calculate_timeout(start, end, None)
assert min_t <= timeout <= max_t, \
f"Timeout {timeout}s not in range [{min_t}, {max_t}] for {start} to {end}"# tests/unit/test_timeout_handling.py
class TestTimeoutHandling:
"""Test SDK timeout behavior."""
def test_timeout_exception_raised_on_slow_query(self):
"""Verify TimeoutError raised when query exceeds timeout."""
client = OilPriceAPI(api_key="test", timeout=1) # 1 second timeout
# Mock slow response (2 seconds)
with patch('httpx.Client.request') as mock_request:
mock_request.side_effect = httpx.TimeoutException("Request timed out")
with pytest.raises(TimeoutError) as exc_info:
client.historical.get(commodity="WTI_USD")
assert "timed out" in str(exc_info.value).lower()
def test_custom_timeout_overrides_default(self):
"""Verify custom timeout parameter is respected."""
client = OilPriceAPI(api_key="test", timeout=30)
with patch('httpx.Client.request') as mock_request:
mock_request.return_value = Mock(status_code=200, json=lambda: {"data": {"prices": []}})
client.historical.get(
commodity="WTI_USD",
start_date="2024-01-01",
end_date="2024-12-31",
timeout=180 # Custom 3 min timeout
)
# Verify request was called with 180s timeout
call_kwargs = mock_request.call_args.kwargs
assert call_kwargs["timeout"] == 180
def test_retry_behavior_on_timeout(self):
"""Verify SDK retries on timeout."""
client = OilPriceAPI(api_key="test", timeout=30, max_retries=3)
with patch('httpx.Client.request') as mock_request:
# First 2 attempts timeout, 3rd succeeds
mock_request.side_effect = [
httpx.TimeoutException("Timeout 1"),
httpx.TimeoutException("Timeout 2"),
Mock(status_code=200, json=lambda: {"data": {"prices": []}})
]
response = client.historical.get(commodity="WTI_USD")
# Should have retried 3 times
assert mock_request.call_count == 3# Track in production API
METRICS_TO_TRACK = {
"historical_endpoint_response_time": {
"metric": "histogram",
"labels": ["endpoint", "commodity", "date_range_days"],
"alert": "p95 > 60s for 5min"
},
"historical_endpoint_timeout_rate": {
"metric": "counter",
"labels": ["endpoint", "sdk_version"],
"alert": "rate > 5% for 5min"
},
"sdk_error_rate": {
"metric": "counter",
"labels": ["sdk_version", "error_type"],
"alert": "rate > 10% for 5min"
}
}# prometheus/alerts.yml
groups:
- name: sdk_health
rules:
- alert: HistoricalEndpointSlow
expr: histogram_quantile(0.95, rate(historical_response_time_bucket[5m])) > 60
annotations:
summary: "Historical endpoint P95 latency > 60s"
- alert: HighTimeoutRate
expr: rate(historical_timeout_total[5m]) / rate(historical_requests_total[5m]) > 0.05
annotations:
summary: "Historical endpoint timeout rate > 5%"
- alert: SDKVersionErrorSpike
expr: rate(sdk_errors_total{sdk_version="1.4.2"}[5m]) > 0.1
annotations:
summary: "SDK v1.4.2 error rate > 10%"# SDK Release Checklist
## Before Publishing to PyPI
### Code Quality
- [ ] All unit tests pass
- [ ] All integration tests pass (real API)
- [ ] All performance tests pass (response times)
- [ ] Code coverage > 80%
- [ ] No linting errors
### Integration Validation
- [ ] Test against production API with test account
- [ ] Verify all endpoints exist and return expected responses
- [ ] Measure response times for common queries:
- [ ] 1 day query: <5s
- [ ] 1 week query: <15s
- [ ] 1 month query: <30s
- [ ] 1 year query: <90s
- [ ] Verify timeout handling works correctly
- [ ] Test with production data volumes
### Performance Validation
- [ ] Run performance benchmarks
- [ ] Compare with previous version (no regression)
- [ ] Verify timeouts are appropriate
### Documentation
- [ ] CHANGELOG updated with all changes
- [ ] README updated if needed
- [ ] Examples tested and working
- [ ] Migration guide if breaking changes
### Post-Release
- [ ] Monitor error rates for 24h
- [ ] Check SDK download stats
- [ ] Monitor support tickets for issues
- [ ] Create GitHub release with notesBefore this issue:
- Customer impact: 100% failure rate
- Support burden: Manual investigation + customer communication
- Engineering time: 2 hours emergency fix
- Reputation risk: Early customer hit by P0 bug
With proper testing (what we should have):
- Would have caught in pre-release: ✅
- Customer impact: 0% (caught before release)
- Support burden: 0 (no customer issues)
- Engineering time: 15 min to add timeout before release
- Reputation risk: 0 (internal catch)
ROI of QA investment:
- Time to write tests: ~4 hours
- Time saved on this bug: 2 hours (investigation) + unknown reputation cost
- But: Prevents ALL future similar issues
- Real ROI: Catches issues we don't know about yet
- ✅ Integration tests calling real API (tests/integration/)
- ✅ Performance tests with realistic data volumes
- ✅ Pre-release validation script (must pass before PyPI publish)
- ✅ Monitoring for timeout rates and response times
- SDK telemetry (opt-in) to detect issues faster
- Contract tests to validate API assumptions
- Documentation of performance characteristics
- Canary releases for safer rollouts
- Synthetic monitoring running SDK queries continuously
- Automated performance regression detection
- Customer feedback loop improvements
We got lucky this was caught by an early paying customer who reported it nicely.
Could have been worse:
- Hundreds of users silently failing
- Support flood of angry customers
- Social media backlash
- Churn of paying customers
Prevention > Reaction:
- Investment: ~1 day to set up proper testing
- Payoff: Catch P0 bugs before customers see them
- Long-term: Build confidence in releases
Recommendation: Treat this as a wake-up call. Add the P0 items this week before next release.