Skip to content

Commit a8cfbec

Browse files
authored
feat: lessons about crawling and scraping product detail pages in Python (#1244)
### Done This PR introduces two new lessons to the Python course, including real-world exercises. These two lessons conclude the base of the course, as at the end of the course the reader should be able to build their own scraper. The final exercises focus on that fact and test the reader's ability of independent building. The PR also includes some edits to the previous parts of the course (code examples, exercises). ### Next Before the course is done, there should be two more lessons: One about building the very same scraper using a framework (Crawlee), and one about deploying the scraper to a platform (Apify). Then I should return back to the beginning and complete the three initial lessons about DevTools.
2 parents 8b29182 + 30a75bd commit a8cfbec

File tree

10 files changed

+784
-16
lines changed

10 files changed

+784
-16
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -140,12 +140,12 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'
140140

141141
<Exercises />
142142

143-
### Scrape Amazon
143+
### Scrape AliExpress
144144

145-
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
145+
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:
146146

147147
```text
148-
https://www.amazon.com/s?k=darth+vader
148+
https://www.aliexpress.com/w/wholesale-darth-vader.html
149149
```
150150

151151
<details>
@@ -154,13 +154,12 @@ https://www.amazon.com/s?k=darth+vader
154154
```py
155155
import httpx
156156

157-
url = "https://www.amazon.com/s?k=darth+vader"
157+
url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
158158
response = httpx.get(url)
159159
response.raise_for_status()
160160
print(response.text)
161161
```
162162

163-
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
164163
</details>
165164

166165
### Save downloaded HTML as a file

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):
122122

123123
This program does the same as the one we already had, but its code is more concise.
124124

125+
:::note Fragile code
126+
127+
We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this.
128+
129+
Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.
130+
131+
:::
132+
125133
## Precisely locating price
126134

127135
In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -199,8 +199,12 @@ def export_json(file, data):
199199
json.dump(data, file, default=serialize, indent=2)
200200

201201
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
202-
soup = download(listing_url)
203-
data = [parse_product(product) for product in soup.select(".product-item")]
202+
listing_soup = download(listing_url)
203+
204+
data = []
205+
for product in listing_soup.select(".product-item"):
206+
item = parse_product(product)
207+
data.append(item)
204208

205209
with open("products.csv", "w") as file:
206210
export_csv(file, data)
@@ -209,7 +213,7 @@ with open("products.json", "w") as file:
209213
export_json(file, data)
210214
```
211215

212-
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
216+
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code.
213217

214218
:::tip Refactoring
215219

@@ -300,9 +304,13 @@ Now we'll pass the base URL to the function in the main body of our program:
300304

301305
```py
302306
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
303-
soup = download(listing_url)
304-
# highlight-next-line
305-
data = [parse_product(product, listing_url) for product in soup.select(".product-item")]
307+
listing_soup = download(listing_url)
308+
309+
data = []
310+
for product in listing_soup.select(".product-item"):
311+
# highlight-next-line
312+
item = parse_product(product, listing_url)
313+
data.append(item)
306314
```
307315

308316
When we run the scraper now, we should see full URLs in our exports:
Lines changed: 295 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,305 @@
11
---
22
title: Crawling websites with Python
33
sidebar_label: Crawling websites
4-
description: TODO
4+
description: Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages.
55
sidebar_position: 10
66
slug: /scraping-basics-python/crawling
77
---
88

9-
:::danger Work in progress
9+
import Exercises from './_exercises.mdx';
1010

11-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11+
**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.**
1212

13-
This particular page is a placeholder for several lessons which should teach crawling.
13+
---
14+
15+
In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products.
16+
17+
Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now:
18+
19+
```py
20+
import httpx
21+
from bs4 import BeautifulSoup
22+
from decimal import Decimal
23+
import csv
24+
import json
25+
from urllib.parse import urljoin
26+
27+
def download(url):
28+
response = httpx.get(url)
29+
response.raise_for_status()
30+
31+
html_code = response.text
32+
return BeautifulSoup(html_code, "html.parser")
33+
34+
def parse_product(product, base_url):
35+
title_element = product.select_one(".product-item__title")
36+
title = title_element.text.strip()
37+
url = urljoin(base_url, title_element["href"])
38+
39+
price_text = (
40+
product
41+
.select_one(".price")
42+
.contents[-1]
43+
.strip()
44+
.replace("$", "")
45+
.replace(",", "")
46+
)
47+
if price_text.startswith("From "):
48+
min_price = Decimal(price_text.removeprefix("From "))
49+
price = None
50+
else:
51+
min_price = Decimal(price_text)
52+
price = min_price
53+
54+
return {"title": title, "min_price": min_price, "price": price, "url": url}
55+
56+
def export_csv(file, data):
57+
fieldnames = list(data[0].keys())
58+
writer = csv.DictWriter(file, fieldnames=fieldnames)
59+
writer.writeheader()
60+
for row in data:
61+
writer.writerow(row)
62+
63+
def export_json(file, data):
64+
def serialize(obj):
65+
if isinstance(obj, Decimal):
66+
return str(obj)
67+
raise TypeError("Object not JSON serializable")
68+
69+
json.dump(data, file, default=serialize, indent=2)
70+
71+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
72+
listing_soup = download(listing_url)
73+
74+
data = []
75+
for product in listing_soup.select(".product-item"):
76+
item = parse_product(product, listing_url)
77+
data.append(item)
78+
79+
with open("products.csv", "w") as file:
80+
export_csv(file, data)
81+
82+
with open("products.json", "w") as file:
83+
export_json(file, data)
84+
```
85+
86+
## Extracting vendor name
87+
88+
Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more.
89+
90+
![Product detail page](./images/pdp.png)
91+
92+
Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure:
93+
94+
```html
95+
<div class="product-meta">
96+
<h1 class="product-meta__title heading h1">
97+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
98+
</h1>
99+
<div class="product-meta__label-list">
100+
...
101+
</div>
102+
<div class="product-meta__reference">
103+
<!-- highlight-next-line -->
104+
<a class="product-meta__vendor link link--accented" href="/collections/sony">
105+
<!-- highlight-next-line -->
106+
Sony
107+
<!-- highlight-next-line -->
108+
</a>
109+
<span class="product-meta__sku">
110+
SKU:
111+
<span class="product-meta__sku-number">SON-985594-XBR-65</span>
112+
</span>
113+
</div>
114+
<a href="#product-reviews" class="product-meta__reviews-badge link" data-offset="30">
115+
<div class="rating">
116+
<div class="rating__stars" role="img" aria-label="4.0 out of 5.0 stars">
117+
...
118+
</div>
119+
<span class="rating__caption">3 reviews</span>
120+
</div>
121+
</a>
122+
...
123+
</div>
124+
```
125+
126+
It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string:
127+
128+
```py
129+
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
130+
```
131+
132+
But where do we put this line in our program?
133+
134+
## Crawling product detail pages
135+
136+
In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary:
137+
138+
```py
139+
...
140+
141+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
142+
listing_soup = download(listing_url)
143+
144+
data = []
145+
for product in listing_soup.select(".product-item"):
146+
item = parse_product(product, listing_url)
147+
# highlight-next-line
148+
product_soup = download(item["url"])
149+
# highlight-next-line
150+
item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
151+
data.append(item)
152+
153+
...
154+
```
155+
156+
If you run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name:
157+
158+
<!-- eslint-skip -->
159+
```json title=products.json
160+
[
161+
{
162+
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
163+
"min_price": "74.95",
164+
"price": "74.95",
165+
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
166+
"vendor": "JBL"
167+
},
168+
{
169+
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
170+
"min_price": "1398.00",
171+
"price": null,
172+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
173+
"vendor": "Sony"
174+
},
175+
...
176+
]
177+
```
178+
179+
## Extracting price
180+
181+
Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices!
182+
183+
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
184+
185+
![Morpheus revealing the existence of product variants](images/variants.png)
186+
187+
In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset.
188+
189+
---
190+
191+
<Exercises />
192+
193+
### Scrape calling codes of African countries
194+
195+
This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the calling code from the info table. Print the URL and the calling code for each country. Start with this URL:
196+
197+
```text
198+
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
199+
```
200+
201+
Your program should print the following:
202+
203+
```text
204+
https://en.wikipedia.org/wiki/Algeria +213
205+
https://en.wikipedia.org/wiki/Angola +244
206+
https://en.wikipedia.org/wiki/Benin +229
207+
https://en.wikipedia.org/wiki/Botswana +267
208+
https://en.wikipedia.org/wiki/Burkina_Faso +226
209+
https://en.wikipedia.org/wiki/Burundi None
210+
https://en.wikipedia.org/wiki/Cameroon +237
211+
...
212+
```
213+
214+
Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup.
215+
216+
<details>
217+
<summary>Solution</summary>
218+
219+
```py
220+
import httpx
221+
from bs4 import BeautifulSoup
222+
from urllib.parse import urljoin
223+
224+
def download(url):
225+
response = httpx.get(url)
226+
response.raise_for_status()
227+
return BeautifulSoup(response.text, "html.parser")
228+
229+
def parse_calling_code(soup):
230+
for label in soup.select("th.infobox-label"):
231+
if label.text.strip() == "Calling code":
232+
data = label.parent.select_one("td.infobox-data")
233+
return data.text.strip()
234+
return None
235+
236+
listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
237+
listing_soup = download(listing_url)
238+
for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"):
239+
link = name_cell.select_one("a")
240+
country_url = urljoin(listing_url, link["href"])
241+
country_soup = download(country_url)
242+
calling_code = parse_calling_code(country_soup)
243+
print(country_url, calling_code)
244+
```
245+
246+
</details>
247+
248+
### Scrape authors of F1 news articles
249+
250+
This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
251+
252+
```text
253+
https://www.theguardian.com/sport/formulaone
254+
```
255+
256+
Your program should print something like this:
257+
258+
```text
259+
Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics
260+
Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire?
261+
Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1
262+
Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull
263+
PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying
264+
...
265+
```
266+
267+
Hints:
268+
269+
- You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values.
270+
- Sometimes a person authors the article, but other times it's contributed by a news agency.
271+
272+
<details>
273+
<summary>Solution</summary>
274+
275+
```py
276+
import httpx
277+
from bs4 import BeautifulSoup
278+
from urllib.parse import urljoin
279+
280+
def download(url):
281+
response = httpx.get(url)
282+
response.raise_for_status()
283+
return BeautifulSoup(response.text, "html.parser")
284+
285+
def parse_author(article_soup):
286+
link = article_soup.select_one('aside a[rel="author"]')
287+
if link:
288+
return link.text.strip()
289+
address = article_soup.select_one('aside address')
290+
if address:
291+
return address.text.strip()
292+
return None
293+
294+
listing_url = "https://www.theguardian.com/sport/formulaone"
295+
listing_soup = download(listing_url)
296+
for item in listing_soup.select("#maincontent ul li"):
297+
link = item.select_one("a")
298+
article_url = urljoin(listing_url, link["href"])
299+
article_soup = download(article_url)
300+
title = article_soup.select_one("h1").text.strip()
301+
author = parse_author(article_soup)
302+
print(f"{author}: {title}")
303+
```
14304

15-
:::
305+
</details>

0 commit comments

Comments
 (0)