Skip to content

Commit 82a9655

Browse files
committed
the crawling lesson and more
1 parent bd0edad commit 82a9655

File tree

7 files changed

+169
-40
lines changed

7 files changed

+169
-40
lines changed

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):
122122

123123
This program does the same as the one we already had, but its code is more concise.
124124

125+
:::note Fragile code
126+
127+
We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this.
128+
129+
Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.
130+
131+
:::
132+
125133
## Precisely locating price
126134

127135
In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -200,10 +200,11 @@ def export_json(file, data):
200200

201201
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
202202
listing_soup = download(listing_url)
203-
data = [
204-
parse_product(product)
205-
for product in listing_soup.select(".product-item")
206-
]
203+
204+
data = []
205+
for product in listing_soup.select(".product-item"):
206+
item = parse_product(product)
207+
data.append(item)
207208

208209
with open("products.csv", "w") as file:
209210
export_csv(file, data)
@@ -212,7 +213,7 @@ with open("products.json", "w") as file:
212213
export_json(file, data)
213214
```
214215

215-
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
216+
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code.
216217

217218
:::tip Refactoring
218219

@@ -304,11 +305,12 @@ Now we'll pass the base URL to the function in the main body of our program:
304305
```py
305306
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
306307
listing_soup = download(listing_url)
307-
data = [
308+
309+
data = []
310+
for product in listing_soup.select(".product-item"):
308311
# highlight-next-line
309-
parse_product(product, listing_url)
310-
for product in listing_soup.select(".product-item")
311-
]
312+
item = parse_product(product, listing_url)
313+
data.append(item)
312314
```
313315

314316
When we run the scraper now, we should see full URLs in our exports:

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Lines changed: 70 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,11 @@ def export_json(file, data):
7070

7171
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
7272
listing_soup = download(listing_url)
73-
data = [
74-
parse_product(product, listing_url)
75-
for product in listing_soup.select(".product-item")
76-
]
73+
74+
data = []
75+
for product in listing_soup.select(".product-item"):
76+
item = parse_product(product, listing_url)
77+
data.append(item)
7778

7879
with open("products.csv", "w") as file:
7980
export_csv(file, data)
@@ -82,31 +83,77 @@ with open("products.json", "w") as file:
8283
export_json(file, data)
8384
```
8485

85-
## Crawling product URLs
86+
## Extracting vendor name
87+
88+
Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more.
89+
90+
![Product detail page](./images/pdp.png)
91+
92+
Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools we can see that the HTML around the vendor name has the following structure:
93+
94+
```html
95+
<div class="product-meta">
96+
<h1 class="product-meta__title heading h1">
97+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
98+
</h1>
99+
<div class="product-meta__label-list">
100+
...
101+
</div>
102+
<div class="product-meta__reference">
103+
<!-- highlight-next-line -->
104+
<a class="product-meta__vendor link link--accented" href="/collections/sony">
105+
<!-- highlight-next-line -->
106+
Sony
107+
<!-- highlight-next-line -->
108+
</a>
109+
<span class="product-meta__sku">
110+
SKU:
111+
<span class="product-meta__sku-number">SON-985594-XBR-65</span>
112+
</span>
113+
</div>
114+
<a href="#product-reviews" class="product-meta__reviews-badge link" data-offset="30">
115+
<div class="rating">
116+
<div class="rating__stars" role="img" aria-label="4.0 out of 5.0 stars">
117+
...
118+
</div>
119+
<span class="rating__caption">3 reviews</span>
120+
</div>
121+
</a>
122+
...
123+
</div>
124+
```
125+
126+
It looks like using a CSS selector to locate element having the `product-meta__vendor` class and extracting its text should be enough to get the vendor name as a string:
127+
128+
```python
129+
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
130+
```
131+
132+
But where do we put this line in our program?
86133

87-
In a new loop below the list comprehension we'll go through the product URLs, download and parse each of them, and extract some new data, e.g. name of the vendor. Then we'll save the data to the `product` dictionary as a new key.
134+
## Crawling product detail pages
135+
136+
In the `data` loop we already go through all the products. Let's expand it so it also includes downloading the product detail page, parsing it, extracting the name of the vendor, and adding it as a new dictionary key to the item:
88137

89138
```python
90139
...
91140

92141
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
93142
listing_soup = download(listing_url)
94-
data = [
95-
parse_product(product, listing_url)
96-
for product in listing_soup.select(".product-item")
97-
]
98143

99-
# highlight-next-line
100-
for product in data:
144+
data = []
145+
for product in listing_soup.select(".product-item"):
146+
item = parse_product(product, listing_url)
101147
# highlight-next-line
102-
product_soup = download(product["url"])
148+
product_soup = download(item["url"])
103149
# highlight-next-line
104-
product["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
150+
item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
151+
data.append(item)
105152

106153
...
107154
```
108155

109-
If you run the program now, it will take longer to finish, but should produce exports with a new field containing the vendor:
156+
If you run the program now, it will take longer to finish, as it now makes 24 more HTTP requests, but in the end it should produce exports with a new field containing the vendor:
110157

111158
<!-- eslint-skip -->
112159
```json title=products.json
@@ -129,26 +176,18 @@ If you run the program now, it will take longer to finish, but should produce ex
129176
]
130177
```
131178

132-
<!--
133-
- show image of how we figured out the vendor or have a note about devtools
179+
## Extracting price
134180

135-
caveats:
136-
- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
137-
- scrape price for the variants
181+
Being able to scrape vendor name is nice, but the main reason we started peeking at the detail pages in the first place was to figure out how to get a price for each product, because from the product listing we could only scrape the min price. And we're building a Python application for watching prices, remember?
138182

139-
caveats and reasons for framework:
140-
- it's slow
141-
- logging
142-
- a lot of boilerplate code
143-
- anti-scraping protection
144-
- browser crawling support
145-
-->
183+
Looking at [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's apparent that the listing features only min prices, because some of the products have variants, each with a different price. And different stock availability. And different SKU…
146184

185+
![Morpheus revealing the existence of product variants](images/variants.png)
147186

148-
:::danger Work in progress
187+
In the next lesson we'll scrape the product detail pages in such way that each product variant gets represented as a separate item in our dataset.
149188

150-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
189+
---
151190

152-
This particular page is a placeholder for several lessons which should teach crawling.
191+
<Exercises />
153192

154-
:::
193+
TODO
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
title: Parsing product variants with Python
3+
sidebar_label: Parsing product variants
4+
description: Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to parse product variants and exporting them as separate items.
5+
sidebar_position: 11
6+
slug: /scraping-basics-python/parsing-variants
7+
---
8+
9+
:::danger Work in progress
10+
11+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
12+
13+
:::
14+
15+
<!--
16+
17+
import Exercises from './_exercises.mdx';
18+
19+
**Blah blah.**
20+
21+
---
22+
23+
We'll need to change our code so that instead of having one item per product in the listing, we let the code which handles product detail pages to decide how many items it generates.
24+
25+
But first let's see if we can
26+
27+
The design of our program now assumes that a single URL from the products listing represents a single product. As it turns out, each URL from the product listing can represent one or more products. Instead of having one item per product in the listing, we should let the code which handles product detail pages to decide how many items it generates.
28+
29+
```python
30+
...
31+
32+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
33+
listing_soup = download(listing_url)
34+
35+
data = []
36+
for product in listing_soup.select(".product-item"):
37+
item = parse_product(product, listing_url)
38+
# highlight-next-line
39+
product_soup = download(item["url"])
40+
# highlight-next-line
41+
item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
42+
data.append(item)
43+
44+
...
45+
```
46+
-->
47+
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
title: Using a scraping framework with Python
3+
sidebar_label: Using a framework
4+
description: Lesson about building a Python application for watching prices. Using the Crawlee framework to simplify creating a scraper.
5+
sidebar_position: 11
6+
slug: /scraping-basics-python/framework
7+
---
8+
9+
:::danger Work in progress
10+
11+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
12+
13+
:::
14+
15+
<!--
16+
17+
import Exercises from './_exercises.mdx';
18+
19+
**Blah blah.**
20+
21+
---
22+
23+
caveats:
24+
- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
25+
26+
caveats and reasons for framework:
27+
- it's slow
28+
- logging
29+
- a lot of boilerplate code
30+
- anti-scraping protection
31+
- browser crawling support
32+
33+
-->
1.11 MB
Loading
1.38 MB
Loading

0 commit comments

Comments
 (0)