You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/10_crawling.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -192,7 +192,7 @@ In the next lesson, we'll scrape the product detail pages so that each product v
192
192
193
193
### Scrape calling codes of African countries
194
194
195
-
This is a follow-up to an exercise from the previous lesson, so feel free to reuse code. Scrape links to Wikipedia pages of all African states and territories. Follow the links and for each country extract the calling code, which is in the info table. Print URL and the calling code for all the countries. Start with this URL:
195
+
This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the calling code from the info table. Print the URL and the calling code for each country. Start with this URL:
Hint: Locating cells in tables is sometimes easier if you know how to [go up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup.
214
+
Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup.
215
215
216
216
<details>
217
217
<summary>Solution</summary>
@@ -247,13 +247,13 @@ Hint: Locating cells in tables is sometimes easier if you know how to [go up](ht
247
247
248
248
### Scrape authors of F1 news articles
249
249
250
-
This is a follow-up to an exercise from the previous lesson, so feel free to reuse code. Scrape links to Guardian's latest F1 news. Follow the link for each article and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
250
+
This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
251
251
252
252
```text
253
253
https://www.theguardian.com/sport/formulaone
254
254
```
255
255
256
-
Your program should print something like the following:
256
+
Your program should print something like this:
257
257
258
258
```text
259
259
Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics
@@ -266,8 +266,8 @@ PA Media: Lewis Hamilton reveals lifelong battle with depression after school bu
266
266
267
267
Hints:
268
268
269
-
- You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on values of their attributes.
270
-
-Notice that sometimes a person authors the article, but sometimes it's a contribution by a news agency.
269
+
- You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values.
270
+
-Sometimes a person authors the article, but other times it's contributed by a news agency.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -309,10 +309,10 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
309
309
310
310
### Build a scraper for watching Python jobs
311
311
312
-
You're now able to build a scraper, are you? Let's build another one, then! Python's official website features a [job board](https://www.python.org/jobs/). Scrape job postings which match the following criteria:
312
+
You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
313
313
314
-
- Tagged as Database
315
-
-Not older than 60 days
314
+
- Tagged as "Database"
315
+
-Posted within the last 60 days
316
316
317
317
For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data:
318
318
@@ -321,7 +321,7 @@ For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprin
321
321
- URL to the job posting
322
322
- Date of posting
323
323
324
-
Your program should print something like the following:
324
+
Your output should look something like this:
325
325
326
326
```text
327
327
{'title': 'Senior Full Stack Developer',
@@ -335,12 +335,12 @@ Your program should print something like the following:
335
335
...
336
336
```
337
337
338
-
In Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module you should find everything you need for manipulating time: `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, `timedelta()`.
338
+
You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and`timedelta()`.
339
339
340
340
<details>
341
341
<summary>Solution</summary>
342
342
343
-
After inspecting how the job board works, we can notice that job postings tagged as Database have their own URL. We'll use it as the starting point, as it'll save us from needing to scrape and check the tags.
343
+
After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually.
344
344
345
345
```py
346
346
from pprint import pp
@@ -376,13 +376,13 @@ In Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module
376
376
377
377
Scrape the [CNN Sports](https://edition.cnn.com/sport) homepage. For each linked article, calculate its length in characters:
378
378
379
-
- Locate element which holds the main content of the article.
380
-
- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to get all its content as a plain text.
381
-
- Use `len()` to calculate the length.
379
+
- Locate the element that holds the main content of the article.
380
+
- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to extract all the content as plain text.
381
+
- Use `len()` to calculate the character count.
382
382
383
-
Skip pages without text, e.g. those which contain only a video. Sort the results and print URL to the shortest article which made it to the homepage.
383
+
Skip pages without text (like those that only have a video). Sort the results and print the URL of the shortest article that made it to the homepage.
384
384
385
-
At the time of writing this exercise, the shortest article which made it to the CNN Sports homepage is [one about a donation to the Augusta National Golf Club](https://edition.cnn.com/2024/10/03/sport/masters-donation-hurricane-helene-relief-spt-intl/). It's just 1,642 characters long.
385
+
At the time of writing, the shortest article on the CNN Sports homepage is [about a donation to the Augusta National Golf Club](https://edition.cnn.com/2024/10/03/sport/masters-donation-hurricane-helene-relief-spt-intl/), which is just 1,642 characters long.
0 commit comments