Skip to content

Latest commit

 

History

History
697 lines (547 loc) · 20.3 KB

File metadata and controls

697 lines (547 loc) · 20.3 KB

Senior QA Engineer Assessment - Historical Timeout Issue

Executive Summary

Severity: P0 (100% failure rate for a core feature) Root Cause: Lack of integration testing, performance testing, and monitoring Detection: Customer report (not caught by CI/CD or monitoring) Time to Detection: Unknown (could have been broken for weeks/months)


What We Missed - Gap Analysis

1. Integration Testing Gaps 🔴 CRITICAL

What we had:

# tests/unit/test_historical_resource.py
@patch('httpx.Client.request')  # Mocked HTTP client
def test_get_historical_data(self, mock_request, ...):
    mock_response.status_code = 200  # Always succeeds
    mock_response.json.return_value = {...}  # Instant response

What we DIDN'T test:

  • ✗ Actual API endpoint existence
  • ✗ Actual response times
  • ✗ Actual timeout behavior
  • ✗ What happens when query takes 60+ seconds
  • ✗ SDK behavior with real production data volumes

Impact: Unit tests passed but production failed


2. Performance Testing Gaps 🔴 CRITICAL

Missing tests:

# Should have had this test
def test_historical_query_completes_within_timeout():
    """Verify 1 year queries complete within SDK timeout."""
    client = OilPriceAPI(api_key=PROD_KEY, timeout=30)

    start = time.time()
    try:
        historical = client.historical.get(
            commodity="WTI_USD",
            start_date="2024-01-01",
            end_date="2024-12-31"
        )
        duration = time.time() - start

        # FAIL: This would have caught the issue
        assert duration < 30, f"Query took {duration}s, exceeds 30s timeout"
    except TimeoutError:
        pytest.fail("Query timed out - timeout too short!")

What we should test:

  • Query response times for different date ranges
  • Timeout thresholds for different data volumes
  • Performance degradation as data grows
  • P50, P95, P99 response times

3. Endpoint Contract Testing Gaps 🟡 HIGH

Missing validation:

# SDK assumed these endpoints existed but never verified
ENDPOINTS_ASSUMED = [
    "/v1/prices/past_day",
    "/v1/prices/past_week",
    "/v1/prices/past_month",
    "/v1/prices/past_year"
]

# Should have had this test
def test_all_historical_endpoints_exist():
    """Verify SDK endpoints actually exist in API."""
    client = OilPriceAPI(api_key=TEST_KEY)

    for endpoint in ENDPOINTS_ASSUMED:
        response = client._client.get(f"{client.base_url}{endpoint}")
        assert response.status_code != 404, f"Endpoint {endpoint} doesn't exist!"

Issue: SDK v1.4.2 now uses 4 different endpoints. What if one doesn't exist?


4. Monitoring & Alerting Gaps 🔴 CRITICAL

What we DON'T monitor:

  • ✗ SDK timeout rate (% of requests that timeout)
  • ✗ Historical endpoint response times
  • ✗ SDK error rates by version
  • ✗ Customer retry rates (sign of persistent failures)

Should have alerts for:

ALERT: SDK timeout rate > 5% for 5 minutes
ALERT: Historical endpoint P95 > 60s
ALERT: Requests to /past_year endpoint taking >30s (SDK default timeout)
ALERT: Same user retrying same query >3 times in 5 minutes

How we should have detected this:

  1. Monitoring would show 100% timeout rate
  2. Alert fires to engineering
  3. Fix deployed before customer reports

Reality: Customer reported it first ❌


5. Regression Testing Gaps 🟡 HIGH

Missing end-to-end tests:

# tests/e2e/test_historical_real_api.py (DOESN'T EXIST)

@pytest.mark.e2e
@pytest.mark.slow
def test_1_year_historical_query_production():
    """Test real 1-year query against production API."""
    client = OilPriceAPI(api_key=os.getenv("PROD_API_KEY"))

    start_time = time.time()
    historical = client.historical.get(
        commodity="WTI_USD",
        start_date=(datetime.now() - timedelta(days=365)).isoformat(),
        end_date=datetime.now().isoformat(),
        interval="daily"
    )
    duration = time.time() - start_time

    # Validate
    assert len(historical.data) > 300, "Should get ~365 daily records"
    assert duration < client.timeout, f"Query took {duration}s, timeout is {client.timeout}s"

    # Performance assertion
    assert duration < 90, f"Query took {duration}s, should be <90s (with margin)"

Run schedule:

  • Every release (pre-deployment)
  • Daily against production (detect degradation)
  • After any backend changes to historical endpoints

6. SDK Release Process Gaps 🟡 HIGH

Current release process (what we did):

  1. Write code
  2. Run unit tests (mocked)
  3. Publish to PyPI
  4. ❌ Hope nothing breaks

Should be:

  1. Write code
  2. Run unit tests (mocked)
  3. Run integration tests (real API)
  4. Run performance tests (measure response times)
  5. Test against production API (canary)
  6. Publish to PyPI
  7. Monitor error rates for 24h

Missing step: Pre-release validation against production


7. Data-Driven Testing Gaps 🟡 HIGH

We never tested with realistic data volumes:

# Should have parametrized tests with production data volumes

@pytest.mark.parametrize("commodity,date_range,expected_records", [
    ("WTI_USD", ("2024-01-01", "2024-12-31"), 1_011_233),  # Actual production volume
    ("BRENT_CRUDE_USD", ("2024-01-01", "2024-12-31"), 987_456),
    # ... etc
])
def test_historical_with_production_volumes(commodity, date_range, expected_records):
    """Test SDK handles production-scale data volumes."""
    # This would have revealed the timeout issue

We should know:

  • How many records exist for each commodity/year
  • Expected aggregation time for that volume
  • Appropriate timeout for that volume

8. Timeout Configuration Testing Gaps 🟡 HIGH

Missing tests:

def test_timeout_appropriateness_for_query_size():
    """Verify timeout scales with query complexity."""

    test_cases = [
        # (date_range_days, expected_timeout)
        (1, 30),    # 1 day = 30s timeout
        (7, 30),    # 1 week = 30s timeout
        (30, 60),   # 1 month = 60s timeout
        (365, 120), # 1 year = 120s timeout
    ]

    for days, expected_timeout in test_cases:
        resource = HistoricalResource(mock_client)
        start = datetime.now() - timedelta(days=days)
        end = datetime.now()

        actual_timeout = resource._calculate_timeout(start, end, None)
        assert actual_timeout == expected_timeout

Issue: We never validated our timeout calculations were reasonable


9. Documentation Gaps 🟢 MEDIUM

SDK README should document:

  • Expected response times for different date ranges
  • Timeout defaults and how to override
  • When to use custom timeouts
  • Performance characteristics

Example missing documentation:

## Performance Guidelines

### Expected Response Times
- 1 day queries: ~1-2 seconds
- 1 week queries: ~5-10 seconds
- 1 month queries: ~15-25 seconds
- 1 year queries: ~60-90 seconds

### Custom Timeouts
For queries longer than 1 year, increase the timeout:
```python
historical = client.historical.get(
    commodity="WTI_USD",
    start_date="2020-01-01",
    end_date="2024-12-31",
    timeout=180  # 3 minutes for 5 years
)

---

### 10. Customer Feedback Loop Gaps 🟡 HIGH

**How we found out**: Customer email ❌
**How we SHOULD have found out**: Automated monitoring ✅

**Missing feedback mechanisms**:
1. SDK telemetry (opt-in error reporting)
2. Timeout event logging
3. Retry pattern detection
4. Version adoption tracking

**Should implement**:
```python
# oilpriceapi/telemetry.py (opt-in)
def log_timeout_event(endpoint, timeout, duration):
    """Log timeout events for analysis (opt-in)."""
    if user_opted_in_to_telemetry():
        send_to_analytics({
            "event": "timeout",
            "endpoint": endpoint,
            "timeout": timeout,
            "duration": duration,
            "sdk_version": __version__,
            "timestamp": datetime.utcnow()
        })

Recommended Immediate Actions

P0 - This Week 🔴

  1. Add integration tests that call real API

    # New test file
    tests/integration/test_real_api_historical.py
    • Test each historical endpoint exists
    • Test timeout handling with real queries
    • Run in CI before every release
  2. Add performance monitoring

    - Track historical endpoint response times
    - Alert on P95 > 60s
    - Alert on timeout rate > 5%
    
  3. Add pre-release validation

    # Run before every PyPI publish
    ./scripts/pre_release_validation.sh
    • Calls real API with test account
    • Validates all endpoints work
    • Measures response times
    • Fails if any test takes > expected time

P1 - Next Sprint 🟡

  1. Add SDK telemetry (opt-in)

    • Track timeout events
    • Track error rates by SDK version
    • Detect issues before customers report
  2. Add performance regression tests

    # tests/performance/test_historical_benchmarks.py
    • Baseline response times
    • Alert if performance degrades >20%
  3. Improve documentation

    • Document expected response times
    • Document when to use custom timeouts
    • Add troubleshooting guide

P2 - Future 🟢

  1. Add contract testing

    • Validate API schema matches SDK expectations
    • Run on every backend deployment
  2. Add canary releases

    • Release new SDK versions to 5% of users first
    • Monitor error rates before full rollout
  3. Add synthetic monitoring

    • Run realistic SDK queries every hour
    • Alert if they fail
    • Detect issues proactively

Specific Test Cases We Should Add

1. Integration Test Suite (NEW)

# tests/integration/test_historical_integration.py

@pytest.fixture
def prod_client():
    """Client configured for production API."""
    return OilPriceAPI(
        api_key=os.getenv("TEST_API_KEY"),
        base_url="https://api.oilpriceapi.com"
    )

class TestHistoricalIntegration:
    """Integration tests against real API."""

    @pytest.mark.integration
    def test_past_day_endpoint_exists(self, prod_client):
        """Verify /past_day endpoint exists."""
        response = prod_client.historical.get(
            commodity="WTI_USD",
            start_date=(datetime.now() - timedelta(days=1)).date(),
            end_date=datetime.now().date()
        )
        assert len(response.data) > 0

    @pytest.mark.integration
    def test_past_week_endpoint_exists(self, prod_client):
        """Verify /past_week endpoint exists."""
        response = prod_client.historical.get(
            commodity="WTI_USD",
            start_date=(datetime.now() - timedelta(days=7)).date(),
            end_date=datetime.now().date()
        )
        assert len(response.data) > 0

    @pytest.mark.integration
    @pytest.mark.slow
    def test_1_year_query_completes_within_timeout(self, prod_client):
        """Verify 1 year queries complete within SDK timeout."""
        start_time = time.time()

        response = prod_client.historical.get(
            commodity="WTI_USD",
            start_date="2024-01-01",
            end_date="2024-12-31",
            interval="daily"
        )

        duration = time.time() - start_time

        assert len(response.data) > 300, "Should get ~365 daily records"
        assert duration < 120, f"Query took {duration}s, should be <120s timeout"

    @pytest.mark.integration
    def test_endpoint_selection_uses_correct_endpoint(self, prod_client, monkeypatch):
        """Verify SDK uses optimal endpoint based on date range."""
        called_endpoints = []

        original_request = prod_client.request
        def track_request(method, path, **kwargs):
            called_endpoints.append(path)
            return original_request(method, path, **kwargs)

        monkeypatch.setattr(prod_client, "request", track_request)

        # 1 week query should use /past_week
        prod_client.historical.get(
            commodity="WTI_USD",
            start_date=(datetime.now() - timedelta(days=7)).date(),
            end_date=datetime.now().date()
        )

        assert "/v1/prices/past_week" in called_endpoints[-1]

2. Performance Test Suite (NEW)

# tests/performance/test_historical_performance.py

class TestHistoricalPerformance:
    """Performance benchmarks for historical queries."""

    @pytest.mark.performance
    @pytest.mark.parametrize("days,max_duration", [
        (1, 5),      # 1 day should be <5s
        (7, 15),     # 1 week should be <15s
        (30, 30),    # 1 month should be <30s
        (365, 90),   # 1 year should be <90s
    ])
    def test_query_performance_baseline(self, prod_client, days, max_duration):
        """Verify queries complete within expected time."""
        start_date = (datetime.now() - timedelta(days=days)).date()
        end_date = datetime.now().date()

        start_time = time.time()
        response = prod_client.historical.get(
            commodity="WTI_USD",
            start_date=start_date,
            end_date=end_date,
            interval="daily"
        )
        duration = time.time() - start_time

        assert duration < max_duration, \
            f"{days}-day query took {duration:.2f}s, expected <{max_duration}s"
        assert len(response.data) > 0

    @pytest.mark.performance
    def test_timeout_is_appropriate_for_query_size(self):
        """Verify calculated timeouts are reasonable."""
        resource = HistoricalResource(Mock())

        test_cases = [
            # (start, end, expected_min, expected_max)
            ("2024-12-15", "2024-12-16", 25, 35),    # 1 day: ~30s
            ("2024-12-09", "2024-12-16", 25, 35),    # 1 week: ~30s
            ("2024-11-16", "2024-12-16", 55, 65),    # 1 month: ~60s
            ("2024-01-01", "2024-12-31", 115, 125),  # 1 year: ~120s
        ]

        for start, end, min_t, max_t in test_cases:
            timeout = resource._calculate_timeout(start, end, None)
            assert min_t <= timeout <= max_t, \
                f"Timeout {timeout}s not in range [{min_t}, {max_t}] for {start} to {end}"

3. Timeout Behavior Tests (NEW)

# tests/unit/test_timeout_handling.py

class TestTimeoutHandling:
    """Test SDK timeout behavior."""

    def test_timeout_exception_raised_on_slow_query(self):
        """Verify TimeoutError raised when query exceeds timeout."""
        client = OilPriceAPI(api_key="test", timeout=1)  # 1 second timeout

        # Mock slow response (2 seconds)
        with patch('httpx.Client.request') as mock_request:
            mock_request.side_effect = httpx.TimeoutException("Request timed out")

            with pytest.raises(TimeoutError) as exc_info:
                client.historical.get(commodity="WTI_USD")

            assert "timed out" in str(exc_info.value).lower()

    def test_custom_timeout_overrides_default(self):
        """Verify custom timeout parameter is respected."""
        client = OilPriceAPI(api_key="test", timeout=30)

        with patch('httpx.Client.request') as mock_request:
            mock_request.return_value = Mock(status_code=200, json=lambda: {"data": {"prices": []}})

            client.historical.get(
                commodity="WTI_USD",
                start_date="2024-01-01",
                end_date="2024-12-31",
                timeout=180  # Custom 3 min timeout
            )

            # Verify request was called with 180s timeout
            call_kwargs = mock_request.call_args.kwargs
            assert call_kwargs["timeout"] == 180

    def test_retry_behavior_on_timeout(self):
        """Verify SDK retries on timeout."""
        client = OilPriceAPI(api_key="test", timeout=30, max_retries=3)

        with patch('httpx.Client.request') as mock_request:
            # First 2 attempts timeout, 3rd succeeds
            mock_request.side_effect = [
                httpx.TimeoutException("Timeout 1"),
                httpx.TimeoutException("Timeout 2"),
                Mock(status_code=200, json=lambda: {"data": {"prices": []}})
            ]

            response = client.historical.get(commodity="WTI_USD")

            # Should have retried 3 times
            assert mock_request.call_count == 3

Monitoring & Alerting We Should Add

1. SDK Metrics (Backend)

# Track in production API
METRICS_TO_TRACK = {
    "historical_endpoint_response_time": {
        "metric": "histogram",
        "labels": ["endpoint", "commodity", "date_range_days"],
        "alert": "p95 > 60s for 5min"
    },
    "historical_endpoint_timeout_rate": {
        "metric": "counter",
        "labels": ["endpoint", "sdk_version"],
        "alert": "rate > 5% for 5min"
    },
    "sdk_error_rate": {
        "metric": "counter",
        "labels": ["sdk_version", "error_type"],
        "alert": "rate > 10% for 5min"
    }
}

2. Alerts to Create

# prometheus/alerts.yml
groups:
  - name: sdk_health
    rules:
      - alert: HistoricalEndpointSlow
        expr: histogram_quantile(0.95, rate(historical_response_time_bucket[5m])) > 60
        annotations:
          summary: "Historical endpoint P95 latency > 60s"

      - alert: HighTimeoutRate
        expr: rate(historical_timeout_total[5m]) / rate(historical_requests_total[5m]) > 0.05
        annotations:
          summary: "Historical endpoint timeout rate > 5%"

      - alert: SDKVersionErrorSpike
        expr: rate(sdk_errors_total{sdk_version="1.4.2"}[5m]) > 0.1
        annotations:
          summary: "SDK v1.4.2 error rate > 10%"

Pre-Release Checklist (SHOULD HAVE HAD)

# SDK Release Checklist

## Before Publishing to PyPI

### Code Quality
- [ ] All unit tests pass
- [ ] All integration tests pass (real API)
- [ ] All performance tests pass (response times)
- [ ] Code coverage > 80%
- [ ] No linting errors

### Integration Validation
- [ ] Test against production API with test account
- [ ] Verify all endpoints exist and return expected responses
- [ ] Measure response times for common queries:
  - [ ] 1 day query: <5s
  - [ ] 1 week query: <15s
  - [ ] 1 month query: <30s
  - [ ] 1 year query: <90s
- [ ] Verify timeout handling works correctly
- [ ] Test with production data volumes

### Performance Validation
- [ ] Run performance benchmarks
- [ ] Compare with previous version (no regression)
- [ ] Verify timeouts are appropriate

### Documentation
- [ ] CHANGELOG updated with all changes
- [ ] README updated if needed
- [ ] Examples tested and working
- [ ] Migration guide if breaking changes

### Post-Release
- [ ] Monitor error rates for 24h
- [ ] Check SDK download stats
- [ ] Monitor support tickets for issues
- [ ] Create GitHub release with notes

Cost of Not Having These Tests

Before this issue:

  • Customer impact: 100% failure rate
  • Support burden: Manual investigation + customer communication
  • Engineering time: 2 hours emergency fix
  • Reputation risk: Early customer hit by P0 bug

With proper testing (what we should have):

  • Would have caught in pre-release: ✅
  • Customer impact: 0% (caught before release)
  • Support burden: 0 (no customer issues)
  • Engineering time: 15 min to add timeout before release
  • Reputation risk: 0 (internal catch)

ROI of QA investment:

  • Time to write tests: ~4 hours
  • Time saved on this bug: 2 hours (investigation) + unknown reputation cost
  • But: Prevents ALL future similar issues
  • Real ROI: Catches issues we don't know about yet

Summary - What We Must Do

Critical (This Week)

  1. Integration tests calling real API (tests/integration/)
  2. Performance tests with realistic data volumes
  3. Pre-release validation script (must pass before PyPI publish)
  4. Monitoring for timeout rates and response times

Important (Next Sprint)

  1. SDK telemetry (opt-in) to detect issues faster
  2. Contract tests to validate API assumptions
  3. Documentation of performance characteristics
  4. Canary releases for safer rollouts

Nice to Have (Future)

  1. Synthetic monitoring running SDK queries continuously
  2. Automated performance regression detection
  3. Customer feedback loop improvements

Bottom Line

We got lucky this was caught by an early paying customer who reported it nicely.

Could have been worse:

  • Hundreds of users silently failing
  • Support flood of angry customers
  • Social media backlash
  • Churn of paying customers

Prevention > Reaction:

  • Investment: ~1 day to set up proper testing
  • Payoff: Catch P0 bugs before customers see them
  • Long-term: Build confidence in releases

Recommendation: Treat this as a wake-up call. Add the P0 items this week before next release.