Skip to content

Enable Playwright by default for all scrape sources#101

Merged
mbuckingham74 merged 7 commits intomainfrom
fix/playwright-default
Dec 2, 2025
Merged

Enable Playwright by default for all scrape sources#101
mbuckingham74 merged 7 commits intomainfrom
fix/playwright-default

Conversation

@mbuckingham74
Copy link
Owner

Summary

  • Adds migration 019 to set use_playwright=True for all existing sources and change the column default
  • Most modern job sites use JavaScript rendering - without Playwright, the scraper misses dynamically loaded job listings
  • This was causing approximately 50% of scraping failures (sources configured without Playwright)

Changes

  • backend/alembic/versions/019_enable_playwright_by_default.py - Migration to enable Playwright globally
  • CLAUDE_STATUS.md - Document new default behavior
  • backend/app/templates/admin/scraper_guide.html - Clarify Playwright is enabled by default

Test plan

  • Run migration on production: migration will update all existing sources
  • Verify new sources default to use_playwright=True
  • Test scraping on a previously-failing source (e.g., Copper River Native Association)

🤖 Generated with Claude Code

Migration 019 sets use_playwright=True for all existing sources and
changes the column default so new sources also use Playwright.

Most modern job sites use JavaScript rendering, and without Playwright
the scraper only gets the initial HTML before JS executes, missing
dynamically loaded job listings. This was causing ~50% of scraping
failures.

Changes:
- Add migration 019_enable_playwright_by_default.py
- Update CLAUDE_STATUS.md with new default behavior
- Update scraper guide to clarify Playwright is enabled by default

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +34 to +37
op.alter_column(
'scrape_sources',
'use_playwright',
server_default=sa.text('1'), # MySQL uses 1 for True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align ORM default with Playwright migration

The migration sets a server default of TRUE for scrape_sources.use_playwright, but the SQLAlchemy model ScrapeSource.use_playwright still declares default=False (backend/app/models/scrape_source.py). Because the admin create/import paths instantiate ScrapeSource without setting this field, SQLAlchemy will send False on insert, overriding the new server default. New sources will therefore continue to have Playwright disabled despite the intent to enable it by default. Update the model (or omit the Python default) so inserts inherit the new default.

Useful? React with 👍 / 👎.

mbuckingham74 and others added 6 commits December 1, 2025 21:37
Previous commit only set DB default but ORM default was still False,
and runner.py hardcoded True ignoring the database setting entirely.

Fixes:
- Change ORM default from False to True in scrape_source.py
- Runner now reads source.use_playwright (with True fallback for NULL)
- Update/add tests to verify the default behavior

This ensures:
1. New sources created via admin/CSV have use_playwright=True
2. The admin toggle can actually disable Playwright for rare httpx-only cases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Prevents the Configure Source form from silently resetting
use_playwright to False on every save. The checkbox is checked
by default for new sources and preserves the existing value for
existing sources.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Display "Using Playwright (headless browser)" or "Using httpx (direct HTTP)"
in the scrape modal while the scrape is running, so admins can confirm
which fetch method is being used without checking logs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Defensive coding to handle case where the playwright text
element might not be found in the DOM.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Previously, a scrape was marked "Failed" if there were ANY errors,
even if jobs were successfully found. This was too strict.

Changes:
- last_scrape_success is now True if jobs were found OR no errors
- Auto-enable now triggers when jobs are found (ignores warnings)

This fixes sources staying in "Needs Configuration" and showing
"Failed" status even when they successfully scraped jobs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Resolve conflicts by taking main's simpler Playwright indicator implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@mbuckingham74 mbuckingham74 merged commit ad95d9b into main Dec 2, 2025
@mbuckingham74 mbuckingham74 deleted the fix/playwright-default branch December 2, 2025 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant