Skip to content

Add integration tests for scrapers with cached HTML samples #73

@leblancfg

Description

@leblancfg

Problem

Currently there are no tests to ensure scrapers work correctly or to catch regressions when provider pages change.

Proposed Solution

Create integration tests with cached HTML samples:

tests/
  fixtures/
    openai_deprecations.html
    anthropic_deprecations.html
    google_changelog.html
    aws_bedrock_lifecycle.html
    azure_foundry_lifecycle.html
    cohere_deprecations.html
    xai_models.html
  integration/
    test_openai_scraper.py
    test_anthropic_scraper.py
    test_google_scraper.py
    test_aws_scraper.py
    test_azure_scraper.py
    test_cohere_scraper.py
    test_xai_scraper.py

Example Test

import pytest
from pathlib import Path
from src.scrapers.openai_scraper import OpenAIScraper

def test_openai_scraper_with_fixture():
    """Test OpenAI scraper against known HTML."""
    scraper = OpenAIScraper()
    
    # Load cached HTML
    fixture_path = Path(__file__).parent.parent / "fixtures" / "openai_deprecations.html"
    with open(fixture_path) as f:
        html = f.read()
    
    # Extract deprecations
    items = scraper.extract_structured_deprecations(html)
    
    # Assertions
    assert len(items) > 0, "Should extract at least one deprecation"
    
    for item in items:
        assert item.provider == "OpenAI"
        assert item.model_id, "model_id should not be empty"
        assert item.model_id not in ["N/A", "TBD"], "model_id should be valid"
        
        # At least one date should be present
        assert item.announcement_date or item.shutdown_date, "Should have at least one date"
        
        # Dates should be ISO format if present
        if item.shutdown_date:
            assert re.match(r'^\d{4}-\d{2}-\d{2}$', item.shutdown_date)
        if item.announcement_date:
            assert re.match(r'^\d{4}-\d{2}-\d{2}$', item.announcement_date)
        
        # Should have context
        assert item.deprecation_context, "Should have deprecation context"

Benefits

  • Catch scraper breakage immediately
  • Fast tests (no network requests)
  • Document expected output format
  • Make refactoring safer
  • Enable TDD for scraper fixes

Implementation Steps

  1. Capture current HTML from each provider page
  2. Create fixture files
  3. Write tests for each scraper
  4. Run tests in CI
  5. Update fixtures when provider pages change (with git history)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions