-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathscraper_guide.html
More file actions
652 lines (600 loc) · 41.7 KB
/
scraper_guide.html
File metadata and controls
652 lines (600 loc) · 41.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
{% extends "base.html" %}
{% block title %}Scraper Guide - Admin - Far Reach Jobs{% endblock %}
{% block content %}
<div class="mb-6 flex justify-between items-center">
<div>
<h2 class="text-2xl font-bold text-gray-900 dark:text-white">Scraper Guide</h2>
<p class="text-gray-600 dark:text-gray-300">How to configure scrapers for different job board platforms</p>
</div>
<a href="/admin" class="px-4 py-2 bg-gray-500 hover:bg-gray-600 text-white font-medium rounded-md transition-colors">
Back to Dashboard
</a>
</div>
<!-- Quick Reference -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-4">Quick Reference: Which Scraper to Use?</h3>
<div class="overflow-x-auto">
<table class="w-full text-sm">
<thead>
<tr class="border-b border-gray-200 dark:border-gray-700">
<th class="text-left py-2 px-3 text-gray-700 dark:text-gray-300">URL Contains</th>
<th class="text-left py-2 px-3 text-gray-700 dark:text-gray-300">Platform</th>
<th class="text-left py-2 px-3 text-gray-700 dark:text-gray-300">Scraper</th>
<th class="text-left py-2 px-3 text-gray-700 dark:text-gray-300">Detection</th>
</tr>
</thead>
<tbody class="divide-y divide-gray-100 dark:divide-gray-700">
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">workforcenow.adp.com</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">ADP WorkforceNow</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-blue-100 dark:bg-blue-900 text-blue-800 dark:text-blue-200 rounded text-xs">Auto (API)</span></td>
<td class="py-2 px-3 text-green-600 dark:text-green-400">Automatic</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">myworkdayjobs.com</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">Workday</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-blue-100 dark:bg-blue-900 text-blue-800 dark:text-blue-200 rounded text-xs">Auto (API)</span></td>
<td class="py-2 px-3 text-green-600 dark:text-green-400">Automatic</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">ultipro.com</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">UltiPro (Legacy)</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-blue-100 dark:bg-blue-900 text-blue-800 dark:text-blue-200 rounded text-xs">Auto (API)</span></td>
<td class="py-2 px-3 text-green-600 dark:text-green-400">Automatic</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">rec.pro.ukg.net</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">UKG Pro Recruiting</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-blue-100 dark:bg-blue-900 text-blue-800 dark:text-blue-200 rounded text-xs">Auto (API)</span></td>
<td class="py-2 px-3 text-green-600 dark:text-green-400">Automatic</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">jobs.dayforcehcm.com</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">Dayforce HCM</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-orange-100 dark:bg-orange-900 text-orange-800 dark:text-orange-200 rounded text-xs">DynamicScraper</span></td>
<td class="py-2 px-3 text-amber-600 dark:text-amber-400">Custom code</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">Other HTML pages</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">Standard websites</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-purple-100 dark:bg-purple-900 text-purple-800 dark:text-purple-200 rounded text-xs">GenericScraper</span></td>
<td class="py-2 px-3 text-amber-600 dark:text-amber-400">Configure CSS</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">sitemap.xml with job URLs</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">JS-heavy sites with sitemaps</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-teal-100 dark:bg-teal-900 text-teal-800 dark:text-teal-200 rounded text-xs">SitemapScraper</span></td>
<td class="py-2 px-3 text-amber-600 dark:text-amber-400">Configure XML</td>
</tr>
<tr>
<td class="py-2 px-3 font-mono text-xs text-gray-600 dark:text-gray-400">Complex JS pages</td>
<td class="py-2 px-3 text-gray-900 dark:text-white">Oracle EBS, custom portals</td>
<td class="py-2 px-3"><span class="px-2 py-1 bg-orange-100 dark:bg-orange-900 text-orange-800 dark:text-orange-200 rounded text-xs">DynamicScraper</span></td>
<td class="py-2 px-3 text-amber-600 dark:text-amber-400">Custom code</td>
</tr>
</tbody>
</table>
</div>
</div>
<!-- Bulk Import Section -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Bulk Import Sources</h3>
<p class="text-gray-600 dark:text-gray-400 mb-4">If you have a list of job sources to add, you can import them all at once using a CSV file instead of adding them one by one.</p>
<div class="border-l-4 border-purple-500 pl-4 mb-4">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">CSV Format</h4>
<div class="mt-2 p-3 bg-gray-50 dark:bg-gray-900 rounded">
<pre class="text-xs text-gray-700 dark:text-gray-300 font-mono overflow-x-auto">Source Name,Base URL,Jobs URL
City of Bethel,https://www.cityofbethel.net,https://www.cityofbethel.net/jobs
NANA Regional,https://nana.com,https://nana.com/careers</pre>
</div>
<div class="text-sm text-gray-600 dark:text-gray-400 mt-3 space-y-1">
<p><strong>Required columns:</strong> Source Name, Base URL</p>
<p><strong>Optional column:</strong> Jobs URL (defaults to Base URL if omitted)</p>
<p><strong>Column names are flexible:</strong> "Name", "Organization", "URL", "Website" also work</p>
</div>
</div>
<div class="border-l-4 border-amber-500 pl-4">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Workflow</h4>
<ol class="list-decimal list-inside text-sm text-gray-600 dark:text-gray-400 space-y-1">
<li>Go to Admin Dashboard → Bulk Import from CSV</li>
<li>Upload your CSV file (max 1MB)</li>
<li>Review results: sources added, duplicates skipped, errors</li>
<li>Go to <a href="/admin/sources/disabled" class="text-primary-600 dark:text-primary-400 hover:underline">Disabled Sources</a> (imported sources start disabled)</li>
<li>Configure each source with CSS selectors</li>
<li>Enable sources and test with "Scrape" button</li>
</ol>
</div>
<div class="mt-4 p-4 bg-amber-50 dark:bg-amber-900/30 rounded-lg">
<p class="text-sm text-amber-800 dark:text-amber-200">
<strong>Note:</strong> Imported sources are created as <strong>disabled</strong> and won't be scraped until you configure their CSS selectors and enable them. This prevents errors from unconfigured sources.
</p>
</div>
</div>
<!-- API Scrapers Section -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">API Scrapers (Auto-Detected)</h3>
<p class="text-gray-600 dark:text-gray-400 mb-6">These scrapers are automatically used when the listing URL matches a known platform. They fetch jobs from the platform's JSON API instead of parsing HTML.</p>
<!-- ADP WorkforceNow -->
<div class="border-l-4 border-blue-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">ADP WorkforceNow</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>URL Pattern:</strong> <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=...&ccId=...</code></p>
<p><strong>How it works:</strong> Extracts <code>cid</code> (Company ID) and <code>ccId</code> (Career Center ID) from the URL, then fetches jobs from the ADP API.</p>
<p><strong>Fields extracted:</strong> Title, location, state, job type, organization</p>
<p><strong>Setup:</strong> Just paste the careers page URL as the listing URL. No configuration needed.</p>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded text-xs">
<p class="text-gray-500 dark:text-gray-400 mb-1">Example URL:</p>
<code class="text-gray-700 dark:text-gray-300 break-all">https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=abc12345&ccId=19000101_000001</code>
</div>
</div>
<!-- Workday -->
<div class="border-l-4 border-green-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Workday</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>URL Pattern:</strong> <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">https://{tenant}.wd1.myworkdayjobs.com/{site}</code> or <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">wd5.myworkdayjobs.com</code></p>
<p><strong>How it works:</strong> Extracts tenant and site from the URL, then fetches jobs from Workday's JSON API with pagination.</p>
<p><strong>Fields extracted:</strong> Title, location, state, organization, job ID</p>
<p><strong>Setup:</strong> Paste the careers page URL. Supports optional <code>hiringCompany</code> filter in URL.</p>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded text-xs">
<p class="text-gray-500 dark:text-gray-400 mb-1">Example URLs:</p>
<code class="text-gray-700 dark:text-gray-300 block break-all mb-1">https://nana.wd1.myworkdayjobs.com/NANACareers</code>
<code class="text-gray-700 dark:text-gray-300 block break-all">https://ahtna.wd5.myworkdayjobs.com/Ahtna</code>
</div>
</div>
<!-- UltiPro / UKG -->
<div class="border-l-4 border-purple-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">UltiPro / UKG Pro Recruiting</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>URL Patterns:</strong></p>
<ul class="list-disc list-inside ml-2">
<li>Legacy: <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">https://recruiting2.ultipro.com/{tenant}/JobBoard/{board-id}/</code></li>
<li>UKG Pro: <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">https://{tenant}.rec.pro.ukg.net/{tenant}/JobBoard/{board-id}/</code></li>
</ul>
<p><strong>How it works:</strong> Extracts tenant and board ID, then fetches jobs from the UltiPro/UKG API with pagination.</p>
<p><strong>Fields extracted:</strong> Title, location, state, job type, organization, description</p>
<p><strong>Setup:</strong> Paste the job board URL. Works with both legacy UltiPro and new UKG Pro domains.</p>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded text-xs">
<p class="text-gray-500 dark:text-gray-400 mb-1">Example URLs:</p>
<code class="text-gray-700 dark:text-gray-300 block break-all mb-1">https://recruiting2.ultipro.com/BRI1234/JobBoard/abc-def-123/</code>
<code class="text-gray-700 dark:text-gray-300 block break-all">https://ukg.rec.pro.ukg.net/ukg/JobBoard/abc-def-123/</code>
</div>
</div>
<div class="p-4 bg-blue-50 dark:bg-blue-900/30 rounded-lg">
<p class="text-sm text-blue-800 dark:text-blue-200">
<strong>Note:</strong> API scrapers bypass robots.txt checks because they use the platform's public API, not HTML crawling. This is the intended behavior.
</p>
</div>
</div>
<!-- Dayforce HCM Section -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Dayforce HCM (Custom DynamicScraper)</h3>
<p class="text-gray-600 dark:text-gray-400 mb-6">Dayforce is a React/Next.js job board that loads job data via JavaScript after the initial page render. It requires a custom DynamicScraper with Playwright.</p>
<div class="border-l-4 border-orange-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">When to Use</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<ul class="list-disc list-inside">
<li>URL contains <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">jobs.dayforcehcm.com</code></li>
<li>Common for Alaska Native corporations and holding companies</li>
<li>Often uses "ALLJOBSROLLUP" URLs to aggregate jobs from multiple subsidiaries</li>
</ul>
</div>
</div>
<div class="border-l-4 border-orange-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">How It Works</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<ol class="list-decimal list-inside space-y-1">
<li>Playwright loads the React SPA and waits for JavaScript to execute</li>
<li>Job links contain identifiable keywords (<code>candidateapply</code>, <code>job</code>, etc.)</li>
<li>The scraper uses fallback link detection to find job links in the rendered HTML</li>
<li>Each link's text becomes the job title, and the href becomes the job URL</li>
</ol>
</div>
</div>
<div class="border-l-4 border-orange-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Setup Steps</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<ol class="list-decimal list-inside space-y-1">
<li>Add the source with the Dayforce jobs URL</li>
<li>Set <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">scraper_class</code> to <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">DynamicScraper</code></li>
<li>Copy the custom scraper code template below into <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">custom_scraper_code</code></li>
<li>Update the class name, source_name, base_url, and listing URL</li>
<li>Enable <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">skip_robots_check</code> (Dayforce has restrictive robots.txt)</li>
<li>Test with "Scrape" button</li>
</ol>
</div>
</div>
<div class="border-l-4 border-orange-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Custom Scraper Template</h4>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded text-xs overflow-x-auto">
<pre class="text-gray-700 dark:text-gray-300 font-mono whitespace-pre-wrap">class DayforceScraper(BaseScraper):
@property
def source_name(self) -> str:
return "Your Organization Name"
@property
def base_url(self) -> str:
return "https://www.example.com"
def get_job_listing_urls(self) -> list[str]:
return ["https://jobs.dayforcehcm.com/en-US/yourcompany/JOBS"]
def parse_job_listing_page(self, soup: BeautifulSoup, url: str) -> list[ScrapedJob]:
jobs = []
# Try standard Dayforce selectors first
job_containers = soup.select("[data-automation-id='jobPostingItem'], .job-tile, .job-card, [class*='JobCard'], [class*='job-item']")
for container in job_containers:
title_elem = container.select_one("h3, h4, [class*='title'], [class*='Title'], a")
link_elem = container.select_one("a[href*='job'], a[href*='candidateapply']")
if title_elem and link_elem:
title = title_elem.get_text(strip=True)
job_url = link_elem.get("href", "")
if job_url and not job_url.startswith("http"):
job_url = f"https://jobs.dayforcehcm.com{job_url}"
if title and job_url:
jobs.append(ScrapedJob(
title=title,
url=job_url,
organization=self.source_name
))
# Fallback: look for links with job-related keywords
if not jobs:
all_links = soup.select("a[href]")
seen_urls = set()
for link in all_links:
href = link.get("href", "")
text = link.get_text(strip=True)
if any(keyword in href.lower() for keyword in ["job", "career", "position", "posting", "candidateapply"]):
if text and len(text) > 5 and href not in seen_urls:
seen_urls.add(href)
full_url = href
if not href.startswith("http"):
full_url = f"https://jobs.dayforcehcm.com{href}"
jobs.append(ScrapedJob(
title=text,
url=full_url,
organization=self.source_name
))
return jobs</pre>
</div>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded text-xs mb-6">
<p class="text-gray-500 dark:text-gray-400 mb-1">Example working sources:</p>
<div class="text-gray-700 dark:text-gray-300 font-mono space-y-1">
<p><strong>Sealaska:</strong> <code class="break-all">https://jobs.dayforcehcm.com/en-US/sealaska/CANDIDATEPORTAL</code></p>
<p><strong>BBNC:</strong> <code class="break-all">https://jobs.dayforcehcm.com/en-US/brs/BBNCALLJOBSROLLUP</code></p>
</div>
</div>
<div class="p-4 bg-orange-50 dark:bg-orange-900/30 rounded-lg">
<p class="text-sm text-orange-800 dark:text-orange-200">
<strong>Why fallback link detection?</strong> Dayforce loads job data via API after React renders. Standard CSS selectors often fail because the HTML structure changes. The fallback approach finds any link with job-related keywords in the href, which is more reliable across different Dayforce configurations.
</p>
</div>
</div>
<!-- SitemapScraper Section -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">SitemapScraper (XML Sitemap Parsing)</h3>
<p class="text-gray-600 dark:text-gray-400 mb-6">Extracts jobs from XML sitemaps by parsing job data from URL structure. Ideal for JavaScript-heavy sites where direct scraping fails.</p>
<div class="border-l-4 border-teal-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">When to Use</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<ul class="list-disc list-inside">
<li>Site is JavaScript-heavy (Vue, React, Angular) and returns 404 for direct URL access</li>
<li>Site has an XML sitemap at <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/sitemap.xml</code> or similar</li>
<li>Job URLs contain structured data (e.g., <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/kotzebue-ak/customer-service-agent/</code>)</li>
<li>No public API available (not ADP, Workday, etc.)</li>
</ul>
</div>
</div>
<div class="border-l-4 border-teal-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">How It Works</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<ol class="list-decimal list-inside space-y-1">
<li>Fetches and parses the XML sitemap</li>
<li>Filters URLs by pattern (if configured)</li>
<li>Extracts job data from URL structure:
<ul class="list-disc list-inside ml-4 mt-1">
<li><strong>Location:</strong> <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/kotzebue-ak/</code> → "Kotzebue, AK"</li>
<li><strong>Title:</strong> <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/customer-service-agent/</code> → "Customer Service Agent"</li>
<li><strong>External ID:</strong> UUID/hex segment from URL path</li>
</ul>
</li>
<li>Handles sitemap indexes recursively (fetches child sitemaps)</li>
</ol>
</div>
</div>
<div class="border-l-4 border-teal-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Configuration</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>Sitemap URL (required):</strong> URL of the XML sitemap file</p>
<p><strong>URL Filter Pattern:</strong> Regex to filter URLs (e.g., <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">-ak/</code> for Alaska jobs only)</p>
<p><strong>Organization:</strong> Organization name for all jobs (defaults to source name)</p>
<p><strong>Default State:</strong> Fallback state code (e.g., "AK")</p>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded text-xs">
<p class="text-gray-500 dark:text-gray-400 mb-2">Example: Alaska Airlines</p>
<div class="text-gray-700 dark:text-gray-300 font-mono space-y-1">
<p>Sitemap URL: <code>https://careers.alaskaair.com/sitemaps/jobs_1.xml</code></p>
<p>URL Pattern: <code>-ak/</code></p>
<p>Organization: <code>Alaska Airlines</code></p>
<p>Default State: <code>AK</code></p>
</div>
</div>
</div>
<div class="border-l-4 border-teal-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Finding the Sitemap</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<ol class="list-decimal list-inside space-y-1">
<li>Check <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/robots.txt</code> for <code>Sitemap:</code> entries</li>
<li>Try common paths: <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/sitemap.xml</code>, <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/sitemaps/sitemap.xml</code></li>
<li>Look for job-specific sitemaps: <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">/sitemaps/jobs.xml</code></li>
<li>Use browser DevTools Network tab while browsing the site</li>
</ol>
</div>
</div>
<div class="p-4 bg-teal-50 dark:bg-teal-900/30 rounded-lg">
<p class="text-sm text-teal-800 dark:text-teal-200">
<strong>Benefits:</strong> SitemapScraper is fast (~1 second), doesn't need Playwright, and works reliably on JS-heavy sites. It bypasses the need to render JavaScript entirely.
</p>
</div>
</div>
<!-- HTML Scrapers Section -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">HTML Scrapers (Configurable)</h3>
<p class="text-gray-600 dark:text-gray-400 mb-6">These scrapers parse HTML pages using CSS selectors. Use them for standard websites that don't match the API platforms above.</p>
<!-- GenericScraper -->
<div class="border-l-4 border-purple-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">GenericScraper</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>Best for:</strong> Standard HTML job listing pages with consistent structure</p>
<p><strong>How it works:</strong> Uses CSS selectors you configure to find job containers and extract fields like title, URL, location, etc.</p>
<p><strong>Supports:</strong> Pagination (next page links), multiple listing URLs, Playwright for JS rendering</p>
</div>
<div class="mt-4">
<p class="font-medium text-gray-900 dark:text-white mb-2">Setup Steps:</p>
<ol class="list-decimal list-inside text-sm text-gray-600 dark:text-gray-400 space-y-1">
<li>Add the source with the careers page URL</li>
<li>Click "Configure" on the source</li>
<li>Use browser DevTools to inspect the page structure</li>
<li>Enter CSS selectors for job container, title, and URL (required)</li>
<li>Optionally add selectors for location, organization, salary, etc.</li>
<li>Click "Analyze Page with AI" for automatic selector suggestions</li>
<li>Save and test with "Scrape" button</li>
</ol>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded">
<p class="text-xs text-gray-500 dark:text-gray-400 mb-2">Example configuration (City of Kotzebue):</p>
<div class="text-xs font-mono text-gray-700 dark:text-gray-300 space-y-1">
<p>Container: <code>tbody tr</code></p>
<p>Title: <code>td.views-field-title .tablesaw-cell-content a</code></p>
<p>URL: <code>td.views-field-title .tablesaw-cell-content a</code></p>
</div>
</div>
</div>
<!-- DynamicScraper -->
<div class="border-l-4 border-orange-500 pl-4 mb-6">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">DynamicScraper (Custom Code)</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>Best for:</strong> Complex JavaScript-heavy sites that require interactions (clicks, dropdowns) or have unusual structure</p>
<p><strong>How it works:</strong> Runs custom Python code stored in the database. The code can use Playwright features like clicking buttons and selecting dropdowns.</p>
<p><strong>When to use:</strong></p>
<ul class="list-disc list-inside ml-2">
<li>Site requires clicking a "Search" button to load results</li>
<li>Site uses JavaScript to render jobs (Oracle E-Business Suite, etc.)</li>
<li>Jobs are loaded via AJAX after page load</li>
<li>GenericScraper can't extract the data correctly</li>
</ul>
</div>
<div class="mt-4">
<p class="font-medium text-gray-900 dark:text-white mb-2">Setup Steps:</p>
<ol class="list-decimal list-inside text-sm text-gray-600 dark:text-gray-400 space-y-1">
<li>Add the source with the careers page URL</li>
<li>Click "Configure" and then "Analyze Page with AI"</li>
<li>If AI suggests "May Need Custom Scraper", click "Generate Custom Scraper"</li>
<li>Review and adjust the generated Python code</li>
<li>Set scraper_class to "DynamicScraper" in the database</li>
<li>Enable "use_playwright" and "skip_robots_check" if needed</li>
</ol>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded">
<p class="text-xs text-gray-500 dark:text-gray-400 mb-2">Example: Tanana Chiefs Conference (Oracle EBS)</p>
<div class="text-xs text-gray-700 dark:text-gray-300 space-y-1">
<p>Requires: Selecting "All Open Reqs" dropdown, clicking Search button</p>
<p>Flags: <code>use_playwright=True</code>, <code>skip_robots_check=True</code></p>
<p>Custom code handles dropdown selection, button click, and location data cleaning (removes Oracle-specific suffixes like "USVacancy Locations")</p>
</div>
</div>
</div>
</div>
<!-- Data Cleaning Section -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Data Cleaning</h3>
<p class="text-gray-600 dark:text-gray-400 mb-4">Some job platforms include extra metadata in scraped fields that needs to be cleaned before display.</p>
<div class="space-y-4">
<div class="border-l-4 border-yellow-500 pl-4">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Location Cleaning</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>Problem:</strong> Some platforms (especially Oracle EBS) include extra text in location fields like "Fairbanks, AK, USVacancy Locations" instead of just "Fairbanks, AK".</p>
<p><strong>Solution:</strong> Custom scrapers should include a <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">_clean_location()</code> method to strip these suffixes.</p>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded">
<p class="text-xs text-gray-500 dark:text-gray-400 mb-2">Common patterns to clean:</p>
<ul class="text-xs text-gray-700 dark:text-gray-300 list-disc list-inside space-y-1">
<li><code>, USVacancy Locations</code> - Oracle EBS artifact</li>
<li><code>, Vacancy Locations</code> - Oracle EBS variant</li>
<li><code>, US</code> - Trailing country code (when state is already present)</li>
</ul>
</div>
<div class="mt-3 p-3 bg-gray-50 dark:bg-gray-900 rounded">
<p class="text-xs text-gray-500 dark:text-gray-400 mb-2">Example cleaning code (Python):</p>
<pre class="text-xs text-gray-700 dark:text-gray-300 font-mono overflow-x-auto whitespace-pre-wrap">def _clean_location(self, location):
if not location:
return None
import re
cleaned = location.strip()
patterns = [
r",?\s*USVacancy Locations\s*$",
r",?\s*Vacancy Locations\s*$",
r",?\s*US\s*$",
]
for pattern in patterns:
cleaned = re.sub(pattern, "", cleaned, flags=re.IGNORECASE).strip()
return cleaned if cleaned else None</pre>
</div>
</div>
<div class="border-l-4 border-blue-500 pl-4">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">Job Type Normalization</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>Problem:</strong> Job type fields may contain inconsistent formats like "80 Full time" or "full-time" or "FULL TIME".</p>
<p><strong>Solution:</strong> The Job model's <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">display_job_type</code> property automatically normalizes common patterns to "Full-Time" or "Part-Time".</p>
</div>
</div>
<div class="border-l-4 border-green-500 pl-4">
<h4 class="font-semibold text-gray-900 dark:text-white mb-2">State Abbreviation</h4>
<div class="text-sm text-gray-600 dark:text-gray-400 space-y-2">
<p><strong>Problem:</strong> Locations may include full state names like "Bristol Bay Region, Alaska" instead of "Bristol Bay Region, AK".</p>
<p><strong>Solution:</strong> The Job model's <code class="bg-gray-100 dark:bg-gray-700 px-1 rounded">display_location</code> property automatically normalizes full state names to abbreviations when a <code>default_state</code> is configured.</p>
</div>
</div>
</div>
</div>
<!-- Playwright Features -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Playwright Features</h3>
<p class="text-gray-600 dark:text-gray-400 mb-4">Playwright is a headless browser that renders JavaScript. <strong>It's enabled by default for all sources</strong> to ensure JavaScript-rendered job listings are properly loaded. DynamicScrapers can also use these additional interactive features:</p>
<div class="grid md:grid-cols-2 gap-4">
<div class="p-4 bg-gray-50 dark:bg-gray-900 rounded">
<h4 class="font-medium text-gray-900 dark:text-white mb-2">wait_for</h4>
<p class="text-sm text-gray-600 dark:text-gray-400 mb-2">Wait for a CSS selector to appear before extracting HTML.</p>
<code class="text-xs text-gray-700 dark:text-gray-300">wait_for='table.job-list'</code>
</div>
<div class="p-4 bg-gray-50 dark:bg-gray-900 rounded">
<h4 class="font-medium text-gray-900 dark:text-white mb-2">click_selector</h4>
<p class="text-sm text-gray-600 dark:text-gray-400 mb-2">Click a button or link after page loads.</p>
<code class="text-xs text-gray-700 dark:text-gray-300">click_selector='button#search'</code>
</div>
<div class="p-4 bg-gray-50 dark:bg-gray-900 rounded">
<h4 class="font-medium text-gray-900 dark:text-white mb-2">select_actions</h4>
<p class="text-sm text-gray-600 dark:text-gray-400 mb-2">Select values from dropdowns before other actions.</p>
<code class="text-xs text-gray-700 dark:text-gray-300">select_actions=[{"selector": "select#filter", "value": {"label": "All Jobs"}}]</code>
</div>
<div class="p-4 bg-gray-50 dark:bg-gray-900 rounded">
<h4 class="font-medium text-gray-900 dark:text-white mb-2">click_wait_for</h4>
<p class="text-sm text-gray-600 dark:text-gray-400 mb-2">Wait for a selector after clicking.</p>
<code class="text-xs text-gray-700 dark:text-gray-300">click_wait_for='table tr td a'</code>
</div>
</div>
</div>
<!-- Special Flags -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6 mb-8">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-2">Special Flags</h3>
<div class="space-y-4">
<div class="flex items-start gap-4 p-4 bg-gray-50 dark:bg-gray-900 rounded">
<div class="flex-shrink-0">
<span class="px-2 py-1 bg-amber-100 dark:bg-amber-900 text-amber-800 dark:text-amber-200 rounded text-xs font-medium">skip_robots_check</span>
</div>
<div>
<p class="text-sm text-gray-600 dark:text-gray-400">Bypass robots.txt restrictions. Use for public job boards that have overly restrictive robots.txt (e.g., blanket <code>Disallow: /</code>). Only enable for sites that are clearly intended to be public.</p>
</div>
</div>
<div class="flex items-start gap-4 p-4 bg-gray-50 dark:bg-gray-900 rounded">
<div class="flex-shrink-0">
<span class="px-2 py-1 bg-blue-100 dark:bg-blue-900 text-blue-800 dark:text-blue-200 rounded text-xs font-medium">use_playwright</span>
</div>
<div>
<p class="text-sm text-gray-600 dark:text-gray-400"><strong>Enabled by default.</strong> All new sources use Playwright browser rendering automatically. This ensures JavaScript-rendered content is properly loaded. Only disable for rare cases where httpx-only is specifically needed.</p>
</div>
</div>
<div class="flex items-start gap-4 p-4 bg-gray-50 dark:bg-gray-900 rounded">
<div class="flex-shrink-0">
<span class="px-2 py-1 bg-green-100 dark:bg-green-900 text-green-800 dark:text-green-200 rounded text-xs font-medium">default_location</span>
</div>
<div>
<p class="text-sm text-gray-600 dark:text-gray-400">Fallback location when the scraper can't extract one. Useful for single-location employers (e.g., "City of Bethel" jobs are always in Bethel).</p>
</div>
</div>
<div class="flex items-start gap-4 p-4 bg-gray-50 dark:bg-gray-900 rounded">
<div class="flex-shrink-0">
<span class="px-2 py-1 bg-purple-100 dark:bg-purple-900 text-purple-800 dark:text-purple-200 rounded text-xs font-medium">default_state</span>
</div>
<div>
<p class="text-sm text-gray-600 dark:text-gray-400">Default state code for all jobs from this source. Use "AK" for Alaska-only job boards.</p>
</div>
</div>
</div>
</div>
<!-- Troubleshooting -->
<div class="bg-white dark:bg-gray-800 rounded-lg shadow-sm border border-gray-200 dark:border-gray-700 p-6">
<h3 class="text-lg font-semibold text-gray-900 dark:text-white mb-4">Troubleshooting</h3>
<div class="space-y-4">
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">No jobs found</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Verify CSS selectors match actual page structure (use browser DevTools)</li>
<li>Check for robots.txt blocking in scrape history</li>
<li>Try "Analyze Page with AI" for selector suggestions</li>
<li>Playwright is enabled by default - if issues persist, check Playwright service logs</li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">Only 1 job found when page has many</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Container selector may be matching header row instead of job rows</li>
<li>Try more specific selectors like <code>td.job-title</code> instead of <code>.job-title</code></li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">Duplicate job IDs (only 1 job saved)</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Job URLs may all be the same (common with JavaScript portals)</li>
<li>Need DynamicScraper with custom external_id generation</li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">Blocked by robots.txt</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Check if it's an API platform (ADP, Workday, UltiPro) - use correct URL format</li>
<li>For public job boards with restrictive robots.txt, enable <code>skip_robots_check</code></li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">SSL certificate errors</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>The scraper automatically retries without SSL verification</li>
<li>Check scrape logs for details</li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">Location displays extra text (e.g., "Fairbanks, AK, USVacancy Locations")</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Oracle EBS and similar platforms include metadata in location fields</li>
<li>For DynamicScrapers: Add a <code>_clean_location()</code> method (see Data Cleaning section above)</li>
<li>For GenericScraper: Use a more specific CSS selector that targets only the location text</li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">SitemapScraper: No jobs parsed from URLs</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>URL structure may not match expected pattern (city-state/title/)</li>
<li>Check the error message for sample URLs that couldn't be parsed</li>
<li>Try adjusting the URL filter pattern or use a different scraper type</li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">SitemapScraper: No URLs found in sitemap</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Verify the sitemap URL is correct and accessible</li>
<li>Check if it's a sitemap index - the scraper handles these automatically</li>
<li>Look in <code>/robots.txt</code> for alternate sitemap locations</li>
</ul>
</div>
<div>
<h4 class="font-medium text-gray-900 dark:text-white mb-1">SitemapScraper: All URLs filtered out</h4>
<ul class="text-sm text-gray-600 dark:text-gray-400 list-disc list-inside">
<li>Your URL filter pattern may be too restrictive</li>
<li>Try removing or broadening the pattern to see all jobs first</li>
<li>Pattern uses regex - escape special characters like <code>.</code> and <code>?</code></li>
</ul>
</div>
</div>
</div>
{% endblock %}