-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
Currently there are no tests to ensure scrapers work correctly or to catch regressions when provider pages change.
Proposed Solution
Create integration tests with cached HTML samples:
tests/
fixtures/
openai_deprecations.html
anthropic_deprecations.html
google_changelog.html
aws_bedrock_lifecycle.html
azure_foundry_lifecycle.html
cohere_deprecations.html
xai_models.html
integration/
test_openai_scraper.py
test_anthropic_scraper.py
test_google_scraper.py
test_aws_scraper.py
test_azure_scraper.py
test_cohere_scraper.py
test_xai_scraper.py
Example Test
import pytest
from pathlib import Path
from src.scrapers.openai_scraper import OpenAIScraper
def test_openai_scraper_with_fixture():
"""Test OpenAI scraper against known HTML."""
scraper = OpenAIScraper()
# Load cached HTML
fixture_path = Path(__file__).parent.parent / "fixtures" / "openai_deprecations.html"
with open(fixture_path) as f:
html = f.read()
# Extract deprecations
items = scraper.extract_structured_deprecations(html)
# Assertions
assert len(items) > 0, "Should extract at least one deprecation"
for item in items:
assert item.provider == "OpenAI"
assert item.model_id, "model_id should not be empty"
assert item.model_id not in ["N/A", "TBD"], "model_id should be valid"
# At least one date should be present
assert item.announcement_date or item.shutdown_date, "Should have at least one date"
# Dates should be ISO format if present
if item.shutdown_date:
assert re.match(r'^\d{4}-\d{2}-\d{2}$', item.shutdown_date)
if item.announcement_date:
assert re.match(r'^\d{4}-\d{2}-\d{2}$', item.announcement_date)
# Should have context
assert item.deprecation_context, "Should have deprecation context"Benefits
- Catch scraper breakage immediately
- Fast tests (no network requests)
- Document expected output format
- Make refactoring safer
- Enable TDD for scraper fixes
Implementation Steps
- Capture current HTML from each provider page
- Create fixture files
- Write tests for each scraper
- Run tests in CI
- Update fixtures when provider pages change (with git history)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels