You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+108-1Lines changed: 108 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -307,5 +307,112 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
307
307
308
308
<Exercises />
309
309
310
-
TODO
310
+
### Build a scraper for watching Python jobs
311
311
312
+
You're now able to build a scraper, are you? Let's build another one, then! Python's official website features a [job board](https://www.python.org/jobs/). Scrape job postings which match the following criteria:
313
+
314
+
- Tagged as Database
315
+
- Not older than 60 days
316
+
317
+
For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data:
318
+
319
+
- Job title
320
+
- Company
321
+
- URL to the job posting
322
+
- Date of posting
323
+
324
+
Your program should print something like the following:
325
+
326
+
```text
327
+
{'title': 'Senior Full Stack Developer',
328
+
'company': 'Baserow',
329
+
'url': 'https://www.python.org/jobs/7705/',
330
+
'posted_on': datetime.date(2024, 9, 16)}
331
+
{'title': 'Senior Python Engineer',
332
+
'company': 'Active Prime',
333
+
'url': 'https://www.python.org/jobs/7699/',
334
+
'posted_on': datetime.date(2024, 9, 5)}
335
+
...
336
+
```
337
+
338
+
In Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module you should find everything you need for manipulating time: `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, `timedelta()`.
339
+
340
+
<details>
341
+
<summary>Solution</summary>
342
+
343
+
After inspecting how the job board works, we can notice that job postings tagged as Database have their own URL. We'll use it as the starting point, as it'll save us from needing to scrape and check the tags.
### Find the shortest CNN article which made it to the Sports homepage
376
+
377
+
Scrape the [CNN Sports](https://edition.cnn.com/sport) homepage. For each linked article, calculate its length in characters:
378
+
379
+
- Locate element which holds the main content of the article.
380
+
- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to get all its content as a plain text.
381
+
- Use `len()` to calculate the length.
382
+
383
+
Skip pages without text, e.g. those which contain only a video. Sort the results and print URL to the shortest article which made it to the homepage.
384
+
385
+
At the time of writing this exercise, the shortest article which made it to the CNN Sports homepage is [one about a donation to the Augusta National Golf Club](https://edition.cnn.com/2024/10/03/sport/masters-donation-hurricane-helene-relief-spt-intl/). It's just 1,642 characters long.
0 commit comments