Skip to content

Commit ad95d9b

Browse files
Enable Playwright by default for all scrape sources (#101)
* Enable Playwright by default for all scrape sources Migration 019 sets use_playwright=True for all existing sources and changes the column default so new sources also use Playwright. Most modern job sites use JavaScript rendering, and without Playwright the scraper only gets the initial HTML before JS executes, missing dynamically loaded job listings. This was causing ~50% of scraping failures. Changes: - Add migration 019_enable_playwright_by_default.py - Update CLAUDE_STATUS.md with new default behavior - Update scraper guide to clarify Playwright is enabled by default 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix use_playwright to actually default to True Previous commit only set DB default but ORM default was still False, and runner.py hardcoded True ignoring the database setting entirely. Fixes: - Change ORM default from False to True in scrape_source.py - Runner now reads source.use_playwright (with True fallback for NULL) - Update/add tests to verify the default behavior This ensures: 1. New sources created via admin/CSV have use_playwright=True 2. The admin toggle can actually disable Playwright for rare httpx-only cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add use_playwright checkbox to configure source form Prevents the Configure Source form from silently resetting use_playwright to False on every save. The checkbox is checked by default for new sources and preserves the existing value for existing sources. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Show Playwright status in scrape modal loading state Display "Using Playwright (headless browser)" or "Using httpx (direct HTTP)" in the scrape modal while the scrape is running, so admins can confirm which fetch method is being used without checking logs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add null check for Playwright text element in scrape modal Defensive coding to handle case where the playwright text element might not be found in the DOM. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix scrape success/auto-enable logic to consider jobs found Previously, a scrape was marked "Failed" if there were ANY errors, even if jobs were successfully found. This was too strict. Changes: - last_scrape_success is now True if jobs were found OR no errors - Auto-enable now triggers when jobs are found (ignores warnings) This fixes sources staying in "Needs Configuration" and showing "Failed" status even when they successfully scraped jobs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent ccac480 commit ad95d9b

File tree

8 files changed

+123
-16
lines changed

8 files changed

+123
-16
lines changed

CLAUDE_STATUS.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,12 @@ cd backend && pytest tests/ -v
349349
- `backend/scraper/playwright_fetcher.py` - Python client for Playwright service
350350
- `backend/scraper/sources/generic.py` - `_fetch_page()` with Playwright/httpx logic
351351
- `backend/scraper/runner.py` - Always enables Playwright for all scrapers
352-
- `backend/alembic/versions/006_add_use_playwright.py` - Migration (legacy)
352+
- `backend/alembic/versions/019_enable_playwright_by_default.py` - Sets `use_playwright=True` for all sources
353+
354+
**Database Default:**
355+
- `use_playwright` column defaults to `True` for new sources (migration 019)
356+
- All existing sources were updated to `use_playwright=True`
357+
- The toggle exists in admin for rare cases where httpx-only is needed
353358

354359
**Interactive Page Features (Playwright):**
355360
- `selectActions` - Array of `{selector, value}` for dropdown selection before page extraction
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
"""Enable Playwright by default for all sources
2+
3+
This migration:
4+
1. Sets use_playwright=True for all existing sources that have it False or NULL
5+
2. Changes the column default to True for new sources
6+
7+
Playwright is required for most modern job sites that use JavaScript rendering.
8+
Without it, the scraper only gets the initial HTML before JS executes, missing
9+
dynamically loaded job listings.
10+
11+
Revision ID: 019
12+
Revises: 018
13+
Create Date: 2025-12-01
14+
15+
"""
16+
from alembic import op
17+
import sqlalchemy as sa
18+
19+
20+
# revision identifiers, used by Alembic.
21+
revision = '019'
22+
down_revision = '018'
23+
branch_labels = None
24+
depends_on = None
25+
26+
27+
def upgrade() -> None:
28+
# Enable Playwright for all existing sources
29+
op.execute(
30+
"UPDATE scrape_sources SET use_playwright = TRUE WHERE use_playwright = FALSE OR use_playwright IS NULL"
31+
)
32+
33+
# Change the column default to True for new sources
34+
op.alter_column(
35+
'scrape_sources',
36+
'use_playwright',
37+
server_default=sa.text('1'), # MySQL uses 1 for True
38+
existing_type=sa.Boolean(),
39+
existing_nullable=True
40+
)
41+
42+
43+
def downgrade() -> None:
44+
# Revert column default to False
45+
op.alter_column(
46+
'scrape_sources',
47+
'use_playwright',
48+
server_default=sa.text('0'),
49+
existing_type=sa.Boolean(),
50+
existing_nullable=True
51+
)
52+
# Note: We don't revert existing data as that could break working scrapers

backend/app/models/scrape_source.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ class ScrapeSource(Base):
4141
max_pages = Column(Integer, nullable=True, default=10)
4242

4343
# Use Playwright (headless browser) instead of httpx for fetching
44-
# Useful for sites with bot protection or JavaScript-rendered content
45-
use_playwright = Column(Boolean, default=False)
44+
# Enabled by default - most modern job sites use JavaScript rendering
45+
use_playwright = Column(Boolean, default=True)
4646

4747
# Default location to use when scraper doesn't extract location from page
4848
# e.g., "Bethel" for City of Bethel jobs, "Kotzebue" for City of Kotzebue

backend/app/routers/admin.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1074,9 +1074,9 @@ async def trigger_single_source_scrape(source_id: int, request: Request, db: Ses
10741074

10751075
duration = time.time() - start_time
10761076

1077-
# Auto-enable source if it was in needs_configuration and scrape was successful
1077+
# Auto-enable source if it was in needs_configuration and jobs were found
10781078
auto_enabled = False
1079-
if source.needs_configuration and result.jobs_found > 0 and not result.errors:
1079+
if source.needs_configuration and result.jobs_found > 0:
10801080
source.is_active = True
10811081
source.needs_configuration = False
10821082
auto_enabled = True

backend/app/templates/admin/configure_source.html

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,17 @@ <h3 class="text-lg font-semibold dark:text-white mb-2">Scraper Type</h3>
243243
<strong>Dynamic:</strong> Uses AI-generated custom scraper code. Only use if others don't work.
244244
</p>
245245
</div>
246+
<div class="mt-4">
247+
<label class="flex items-center gap-3 cursor-pointer">
248+
<input type="checkbox" id="use_playwright" name="use_playwright" value="1"
249+
{% if source.use_playwright is not none %}{% if source.use_playwright %}checked{% endif %}{% else %}checked{% endif %}
250+
class="w-4 h-4 text-blue-600 bg-gray-100 dark:bg-gray-700 border-gray-300 dark:border-gray-600 rounded focus:ring-blue-500 focus:ring-2">
251+
<span class="text-sm font-medium text-gray-700 dark:text-gray-300">Use Playwright (Headless Browser)</span>
252+
</label>
253+
<p class="text-xs text-gray-500 dark:text-gray-400 mt-1 ml-7">
254+
Enabled by default. Uses a real browser to render JavaScript-heavy pages. Disable only for simple static HTML sites.
255+
</p>
256+
</div>
246257
</div>
247258

248259
<!-- Sitemap Configuration (shown when SitemapScraper selected) -->

backend/app/templates/admin/scraper_guide.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -507,7 +507,7 @@ <h4 class="font-semibold text-gray-900 dark:text-white mb-2">State Abbreviation<
507507
<!-- Playwright Features -->
508508
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
509509
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Playwright Features</h3>
510-
<p class="text-gray-600 dark:text-gray-400 mb-4">Playwright is a headless browser that renders JavaScript. It's used automatically for all scrapers but provides extra features for DynamicScrapers.</p>
510+
<p class="text-gray-600 dark:text-gray-400 mb-4">Playwright is a headless browser that renders JavaScript. <strong>It's enabled by default for all sources</strong> to ensure JavaScript-rendered job listings are properly loaded. DynamicScrapers can also use these additional interactive features:</p>
511511

512512
<div class="grid md:grid-cols-2 gap-4">
513513
<div class="p-4 bg-gray-50 dark:bg-gray-900 rounded">
@@ -551,7 +551,7 @@ <h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Special Fla
551551
<span class="px-2 py-1 bg-blue-100 dark:bg-blue-900 text-blue-800 dark:text-blue-200 rounded text-xs font-medium">use_playwright</span>
552552
</div>
553553
<div>
554-
<p class="text-sm text-gray-600 dark:text-gray-400">Force Playwright browser rendering. Enabled by default for all scrapers, but can be explicitly set for DynamicScrapers that need it.</p>
554+
<p class="text-sm text-gray-600 dark:text-gray-400"><strong>Enabled by default.</strong> All new sources use Playwright browser rendering automatically. This ensures JavaScript-rendered content is properly loaded. Only disable for rare cases where httpx-only is specifically needed.</p>
555555
</div>
556556
</div>
557557
<div class="flex items-start gap-4 p-4 bg-gray-50 dark:bg-gray-900 rounded">
@@ -581,10 +581,10 @@ <h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-4">Troubleshoo
581581
<div>
582582
<h4 class="font-medium text-gray-900 dark:text-white mb-1">No jobs found</h4>
583583
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
584-
<li>Check if the page requires JavaScript - enable Playwright</li>
585-
<li>Verify CSS selectors match actual page structure</li>
584+
<li>Verify CSS selectors match actual page structure (use browser DevTools)</li>
586585
<li>Check for robots.txt blocking in scrape history</li>
587586
<li>Try "Analyze Page with AI" for selector suggestions</li>
587+
<li>Playwright is enabled by default - if issues persist, check Playwright service logs</li>
588588
</ul>
589589
</div>
590590
<div>

backend/scraper/runner.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -272,8 +272,8 @@ def get_source_config(source: ScrapeSource) -> dict:
272272
"url_attribute": source.url_attribute,
273273
"selector_next_page": source.selector_next_page,
274274
"max_pages": source.max_pages,
275-
# Always use Playwright - overhead is minimal vs failing on JS sites
276-
"use_playwright": True,
275+
# Use Playwright by default (True), but respect database setting for rare httpx-only cases
276+
"use_playwright": source.use_playwright if source.use_playwright is not None else True,
277277
"default_location": source.default_location,
278278
"default_state": source.default_state,
279279
# SitemapScraper configuration
@@ -341,7 +341,9 @@ def _run_adp_scraper(
341341

342342
source.last_scraped_at = datetime.now(timezone.utc)
343343

344-
source.last_scrape_success = len(all_errors) == 0
344+
# Success if jobs were found, even with warnings
345+
jobs_found = jobs_new + jobs_updated + jobs_unchanged
346+
source.last_scrape_success = jobs_found > 0 or len(all_errors) == 0
345347
duration = time.time() - start_time
346348

347349
logger.info(
@@ -419,7 +421,10 @@ def _run_ultipro_scraper(
419421
logger.exception(f"UltiPro scraper failed for {source.name} URL: {listing_url}")
420422

421423
source.last_scraped_at = datetime.now(timezone.utc)
422-
source.last_scrape_success = len(all_errors) == 0
424+
425+
# Success if jobs were found, even with warnings
426+
jobs_found = jobs_new + jobs_updated + jobs_unchanged
427+
source.last_scrape_success = jobs_found > 0 or len(all_errors) == 0
423428
duration = time.time() - start_time
424429

425430
logger.info(
@@ -497,7 +502,10 @@ def _run_workday_scraper(
497502
logger.exception(f"Workday scraper failed for {source.name} URL: {listing_url}")
498503

499504
source.last_scraped_at = datetime.now(timezone.utc)
500-
source.last_scrape_success = len(all_errors) == 0
505+
506+
# Success if jobs were found, even with warnings
507+
jobs_found = jobs_new + jobs_updated + jobs_unchanged
508+
source.last_scrape_success = jobs_found > 0 or len(all_errors) == 0
501509
duration = time.time() - start_time
502510

503511
logger.info(
@@ -678,7 +686,9 @@ def run_scraper(db: Session, source: ScrapeSource, trigger_type: str = "manual")
678686
all_errors.append(f"Scraper execution failed: {e}")
679687

680688
# Update source's last_scrape_success status
681-
source.last_scrape_success = len(all_errors) == 0
689+
# Success if jobs were found, even with warnings
690+
jobs_found = jobs_new + jobs_updated + jobs_unchanged
691+
source.last_scrape_success = jobs_found > 0 or len(all_errors) == 0
682692

683693
duration = time.time() - start_time
684694

backend/tests/test_models.py

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -521,11 +521,40 @@ def test_source_default_values(self, db):
521521

522522
assert source.scraper_class == "GenericScraper"
523523
assert source.is_active is True
524-
assert source.use_playwright is False
524+
assert source.use_playwright is True # Default to True for JS-rendered sites
525525
assert source.max_pages == 10
526526
assert source.url_attribute == "href"
527527
assert source.created_at is not None
528528

529+
def test_source_playwright_default_is_true(self, db):
530+
"""New sources should have use_playwright=True by default.
531+
532+
Most modern job sites use JavaScript rendering, so Playwright
533+
should be enabled by default to avoid missing dynamically loaded content.
534+
"""
535+
source = ScrapeSource(
536+
name="Playwright Default Test",
537+
base_url="https://example.com",
538+
)
539+
db.add(source)
540+
db.commit()
541+
db.refresh(source)
542+
543+
assert source.use_playwright is True, "New sources should default to use_playwright=True"
544+
545+
def test_source_playwright_can_be_disabled(self, db):
546+
"""Sources can explicitly disable Playwright for rare httpx-only cases."""
547+
source = ScrapeSource(
548+
name="No Playwright Source",
549+
base_url="https://example.com",
550+
use_playwright=False,
551+
)
552+
db.add(source)
553+
db.commit()
554+
db.refresh(source)
555+
556+
assert source.use_playwright is False, "Should be able to explicitly disable Playwright"
557+
529558
def test_source_jobs_relationship(self, db):
530559
"""ScrapeSource has jobs relationship."""
531560
source = ScrapeSource(

0 commit comments

Comments
 (0)