Skip to content

Commit bd0edad

Browse files
committed
draft further course development
1 parent 1ab71bf commit bd0edad

File tree

2 files changed

+151
-6
lines changed

2 files changed

+151
-6
lines changed

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -199,8 +199,11 @@ def export_json(file, data):
199199
json.dump(data, file, default=serialize, indent=2)
200200

201201
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
202-
soup = download(listing_url)
203-
data = [parse_product(product) for product in soup.select(".product-item")]
202+
listing_soup = download(listing_url)
203+
data = [
204+
parse_product(product)
205+
for product in listing_soup.select(".product-item")
206+
]
204207

205208
with open("products.csv", "w") as file:
206209
export_csv(file, data)
@@ -300,9 +303,12 @@ Now we'll pass the base URL to the function in the main body of our program:
300303

301304
```py
302305
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
303-
soup = download(listing_url)
304-
# highlight-next-line
305-
data = [parse_product(product, listing_url) for product in soup.select(".product-item")]
306+
listing_soup = download(listing_url)
307+
data = [
308+
# highlight-next-line
309+
parse_product(product, listing_url)
310+
for product in listing_soup.select(".product-item")
311+
]
306312
```
307313

308314
When we run the scraper now, we should see full URLs in our exports:

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Lines changed: 140 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,150 @@
11
---
22
title: Crawling websites with Python
33
sidebar_label: Crawling websites
4-
description: TODO
4+
description: Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages.
55
sidebar_position: 10
66
slug: /scraping-basics-python/crawling
77
---
88

9+
import Exercises from './_exercises.mdx';
10+
11+
**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them, and BeautifulSoup to process them.**
12+
13+
---
14+
15+
In previous lessons we've managed to download HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products.
16+
17+
Thanks to the refactoring we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now:
18+
19+
```python
20+
import httpx
21+
from bs4 import BeautifulSoup
22+
from decimal import Decimal
23+
import csv
24+
import json
25+
from urllib.parse import urljoin
26+
27+
def download(url):
28+
response = httpx.get(url)
29+
response.raise_for_status()
30+
31+
html_code = response.text
32+
return BeautifulSoup(html_code, "html.parser")
33+
34+
def parse_product(product, base_url):
35+
title_element = product.select_one(".product-item__title")
36+
title = title_element.text.strip()
37+
url = urljoin(base_url, title_element["href"])
38+
39+
price_text = (
40+
product
41+
.select_one(".price")
42+
.contents[-1]
43+
.strip()
44+
.replace("$", "")
45+
.replace(",", "")
46+
)
47+
if price_text.startswith("From "):
48+
min_price = Decimal(price_text.removeprefix("From "))
49+
price = None
50+
else:
51+
min_price = Decimal(price_text)
52+
price = min_price
53+
54+
return {"title": title, "min_price": min_price, "price": price, "url": url}
55+
56+
def export_csv(file, data):
57+
fieldnames = list(data[0].keys())
58+
writer = csv.DictWriter(file, fieldnames=fieldnames)
59+
writer.writeheader()
60+
for row in data:
61+
writer.writerow(row)
62+
63+
def export_json(file, data):
64+
def serialize(obj):
65+
if isinstance(obj, Decimal):
66+
return str(obj)
67+
raise TypeError("Object not JSON serializable")
68+
69+
json.dump(data, file, default=serialize, indent=2)
70+
71+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
72+
listing_soup = download(listing_url)
73+
data = [
74+
parse_product(product, listing_url)
75+
for product in listing_soup.select(".product-item")
76+
]
77+
78+
with open("products.csv", "w") as file:
79+
export_csv(file, data)
80+
81+
with open("products.json", "w") as file:
82+
export_json(file, data)
83+
```
84+
85+
## Crawling product URLs
86+
87+
In a new loop below the list comprehension we'll go through the product URLs, download and parse each of them, and extract some new data, e.g. name of the vendor. Then we'll save the data to the `product` dictionary as a new key.
88+
89+
```python
90+
...
91+
92+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
93+
listing_soup = download(listing_url)
94+
data = [
95+
parse_product(product, listing_url)
96+
for product in listing_soup.select(".product-item")
97+
]
98+
99+
# highlight-next-line
100+
for product in data:
101+
# highlight-next-line
102+
product_soup = download(product["url"])
103+
# highlight-next-line
104+
product["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
105+
106+
...
107+
```
108+
109+
If you run the program now, it will take longer to finish, but should produce exports with a new field containing the vendor:
110+
111+
<!-- eslint-skip -->
112+
```json title=products.json
113+
[
114+
{
115+
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
116+
"min_price": "74.95",
117+
"price": "74.95",
118+
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
119+
"vendor": "JBL"
120+
},
121+
{
122+
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
123+
"min_price": "1398.00",
124+
"price": null,
125+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
126+
"vendor": "Sony"
127+
},
128+
...
129+
]
130+
```
131+
132+
<!--
133+
- show image of how we figured out the vendor or have a note about devtools
134+
135+
caveats:
136+
- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
137+
- scrape price for the variants
138+
139+
caveats and reasons for framework:
140+
- it's slow
141+
- logging
142+
- a lot of boilerplate code
143+
- anti-scraping protection
144+
- browser crawling support
145+
-->
146+
147+
9148
:::danger Work in progress
10149

11150
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.

0 commit comments

Comments
 (0)