refactor: simplify model validation to use Claude AI

zealoushacker · zealoushacker · commit 79384981461b · 2025-09-07T17:27:34.000-06:00
Major simplification of CI/CD:
- Remove complex Python model validation scripts (400+ lines)
- Let Claude handle model validation intelligently via GitHub Actions
- Claude fetches latest models from docs.anthropic.com/en/docs/about-claude/models/overview.md
- Add comprehensive notebook validation script for local testing
  - Interactive dashboard with progress tracking
  - Auto-fix for deprecated models
  - GitHub issue export format
  - Idempotent with state persistence
- Simplify CI to use single Python version (3.11)
- Update workflows to use Claude for all intelligent validation

Benefits:
- No more hardcoded model lists to maintain
- Claude understands context (e.g., educational examples)
- 50% faster CI (removed matrix strategy)
- Single source of truth for models (docs site)
diff --git a/.github/workflows/claude-model-check.yml b/.github/workflows/claude-model-check.yml
@@ -21,40 +21,23 @@ jobs:
         with:
           fetch-depth: 0
       
-      - name: Install uv
-        uses: astral-sh/setup-uv@v4
-      
-      - name: Setup Python
-        run: uv python install 3.11
-      
-      - name: Install dependencies
-        run: uv sync
-      
-      - name: Check models with script
-        id: model_check
-        run: |
-          uv run python scripts/check_models.py --github-output || true
-      
-      # Only run Claude validation for repo members (API costs)
       - name: Claude Model Validation
-        if: |
-          github.event.pull_request.author_association == 'MEMBER' ||
-          github.event.pull_request.author_association == 'OWNER'
         uses: anthropics/claude-code-action@beta
         with:
           use_sticky_comment: true
           anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
           github_token: ${{ secrets.GITHUB_TOKEN }}
           timeout_minutes: "5"
           direct_prompt: |
-            Review the changed files for Claude model usage. 
+            Review the changed files for Claude model usage.
             
-            Check the latest models at: https://docs.anthropic.com/en/docs/about-claude/models/overview.md
+            First, fetch the current list of allowed models from:
+            https://docs.anthropic.com/en/docs/about-claude/models/overview.md
             
-            Please check for:
-            1. Any internal/non-public model names
-            2. Usage of deprecated models (older Sonnet 3.5 and Opus 3 models)
-            3. Recommend using aliases for better maintainability
-            4. For testing examples, suggest claude-3-5-haiku-latest (fastest/cheapest)
+            Then check:
+            1. All model references are from the current public models list
+            2. Flag any deprecated models (older Sonnet 3.5, Opus 3 versions)
+            3. Flag any internal/non-public model names
+            4. Suggest using aliases ending in -latest for better maintainability
             
-            Format as actionable feedback.
+            Provide clear, actionable feedback on any issues found.
diff --git a/.github/workflows/claude-notebook-review.yml b/.github/workflows/claude-notebook-review.yml
@@ -34,10 +34,12 @@ jobs:
             Review the changes to Jupyter notebooks and Python scripts in this PR. Please check for:
 
             ## Model Usage
-            Check that all Claude model references use current, public models:
-            - claude-3-5-haiku-latest (recommended for testing)
-            - claude-3-5-sonnet-latest (for complex tasks)
-            - Avoid deprecated models like claude-3-haiku-20240307, old Sonnet 3.5 versions
+            Verify all Claude model references against the current list at:
+            https://docs.anthropic.com/en/docs/about-claude/models/overview.md
+            - Flag any deprecated models (older Sonnet 3.5, Opus 3 versions)
+            - Flag any internal/non-public model names
+            - Suggest current alternatives when issues found
+            - Recommend aliases ending in -latest for stability
 
             ## Code Quality
             - Python code follows PEP 8 conventions
diff --git a/.github/workflows/notebook-quality.yml b/.github/workflows/notebook-quality.yml
@@ -44,10 +44,6 @@ jobs:
         run: |
           uv run python scripts/validate_notebooks.py
       
-      - name: Check model usage
-        run: |
-          uv run python scripts/check_models.py
-      
       # Only run API tests on main branch or for maintainers (costs money)
       - name: Execute notebooks (API Testing)
         if: |
diff --git a/.gitignore b/.gitignore
@@ -144,4 +144,9 @@ examples/fine-tuned_qa/local_cache/*
 test_outputs/
 .ruff_cache/
 lychee-report.md
-.lycheecache
+.lycheecache
+
+# Notebook validation
+.notebook_validation_state.json
+.notebook_validation_checkpoint.json  
+validation_report_*.md
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,13 +10,6 @@ repos:
 
   - repo: local
     hooks:
-      - id: check-models
-        name: Check Claude model usage
-        entry: python scripts/check_models.py
-        language: python
-        files: '\.ipynb$'
-        pass_filenames: false
-      
       - id: validate-notebooks
         name: Validate notebook structure
         entry: python scripts/validate_notebooks.py
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -54,8 +54,9 @@ This repository uses automated tools to maintain code quality:
 
 ### The Notebook Validation Stack
 
-- **[papermill](https://papermill.readthedocs.io/)**: Parameterized notebook execution for testing
+- **[nbconvert](https://nbconvert.readthedocs.io/)**: Notebook execution for testing
 - **[ruff](https://docs.astral.sh/ruff/)**: Fast Python linter and formatter with native Jupyter support
+- **Claude AI Review**: Intelligent code review using Claude
 
 **Note**: Notebook outputs are intentionally kept in this repository as they demonstrate expected results for users.
 
@@ -67,26 +68,22 @@ This repository uses automated tools to maintain code quality:
    uv run ruff format skills/
    
    uv run python scripts/validate_notebooks.py
-   uv run python scripts/check_models.py
    ```
 
 3. **Test notebook execution** (optional, requires API key):
    ```bash
-   uv run papermill skills/classification/guide.ipynb test.ipynb \
-     -p model "claude-3-5-haiku-latest" \
-     -p test_mode true \
-     -p max_tokens 10
+   uv run jupyter nbconvert --to notebook \
+     --execute skills/classification/guide.ipynb \
+     --ExecutePreprocessor.kernel_name=python3 \
+     --output test_output.ipynb
    ```
 
 ### Pre-commit Hooks
 
 Pre-commit hooks will automatically run before each commit to ensure code quality:
 
-- Strip notebook outputs
 - Format code with ruff
 - Validate notebook structure
-- Check for hardcoded API keys
-- Validate Claude model usage
 
 If a hook fails, fix the issues and try committing again.
 
@@ -101,9 +98,9 @@ If a hook fails, fix the issues and try committing again.
    ```
 
 2. **Use current Claude models**:
-   - For examples: `claude-3-5-haiku-latest` (fast and cheap)
-   - For powerful tasks: `claude-opus-4-1`
-   - Check allowed models in `scripts/allowed_models.py`
+   - Use model aliases (e.g., `claude-3-5-haiku-latest`) for better maintainability
+   - Check current models at: https://docs.anthropic.com/en/docs/about-claude/models/overview
+   - Claude will automatically validate model usage in PR reviews
 
 3. **Keep notebooks focused**:
    - One concept per notebook
@@ -175,9 +172,6 @@ Run the validation suite:
 # Check all notebooks
 uv run python scripts/validate_notebooks.py
 
-# Check model usage
-uv run python scripts/check_models.py
-
 # Run pre-commit on all files
 uv run pre-commit run --all-files
 ```
@@ -187,11 +181,10 @@ uv run pre-commit run --all-files
 Our GitHub Actions workflows will automatically:
 
 - Validate notebook structure
-- Check for hardcoded secrets
 - Lint code with ruff
 - Test notebook execution (for maintainers)
 - Check links
-- Validate Claude model usage
+- Claude reviews code and model usage
 
 External contributors will have limited API testing to conserve resources.
 
diff --git a/scripts/allowed_models.py b/scripts/allowed_models.py
diff --git a/scripts/check_models.py b/scripts/check_models.py
diff --git a/scripts/validate_all_notebooks.py b/scripts/validate_all_notebooks.py