Skip to content

Commit 310d837

Browse files
committed
lesson about scraping variants
1 parent 82a9655 commit 310d837

File tree

3 files changed

+312
-50
lines changed

3 files changed

+312
-50
lines changed

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ In previous lessons we've managed to download HTML code of a single page, parse
1616

1717
Thanks to the refactoring we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now:
1818

19-
```python
19+
```py
2020
import httpx
2121
from bs4 import BeautifulSoup
2222
from decimal import Decimal
@@ -125,7 +125,7 @@ Depending on what's valuable for our use case, we can now use the same technique
125125

126126
It looks like using a CSS selector to locate element having the `product-meta__vendor` class and extracting its text should be enough to get the vendor name as a string:
127127

128-
```python
128+
```py
129129
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
130130
```
131131

@@ -135,7 +135,7 @@ But where do we put this line in our program?
135135

136136
In the `data` loop we already go through all the products. Let's expand it so it also includes downloading the product detail page, parsing it, extracting the name of the vendor, and adding it as a new dictionary key to the item:
137137

138-
```python
138+
```py
139139
...
140140

141141
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"

sources/academy/webscraping/scraping_basics_python/11_parsing_variants.md

Lines changed: 0 additions & 47 deletions
This file was deleted.
Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
---
2+
title: Scraping product variants with Python
3+
sidebar_label: Scraping product variants
4+
description: Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to extract product variants and exporting them as separate items.
5+
sidebar_position: 11
6+
slug: /scraping-basics-python/scraping-variants
7+
---
8+
9+
import Exercises from './_exercises.mdx';
10+
11+
**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.**
12+
13+
---
14+
15+
We'll need to figure out how to extract variants from the product detail page, and then change the way we add items to the data list, so that we can add multiple items after scraping one product URL.
16+
17+
## Locating variants
18+
19+
First let's extract information about the variants. If we go to [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv) and open the DevTools, we can see that the buttons for switching between variants look like this:
20+
21+
```html
22+
<div class="block-swatch-list">
23+
<div class="block-swatch">
24+
<input class="block-swatch__radio product-form__single-selector is-filled" type="radio" name="template--14851594125363__main-1916221128755-1" id="template--14851594125363__main-1916221128755-1-1" value="55&quot;" checked="" data-option-position="1">
25+
<label class="block-swatch__item" for="template--14851594125363__main-1916221128755-1-1" title="55&quot;">
26+
<!-- highlight-next-line -->
27+
<span class="block-swatch__item-text">55"</span>
28+
</label>
29+
</div>
30+
<div class="block-swatch">
31+
<input class="block-swatch__radio product-form__single-selector" type="radio" name="template--14851594125363__main-1916221128755-1" id="template--14851594125363__main-1916221128755-1-2" value="65&quot;" data-option-position="1">
32+
<label class="block-swatch__item" for="template--14851594125363__main-1916221128755-1-2" title="65&quot;">
33+
<!-- highlight-next-line -->
34+
<span class="block-swatch__item-text">65"</span>
35+
</label>
36+
</div>
37+
</div>
38+
```
39+
40+
Nice, we can extract names of the variants! But we also need to extract price for each of the variants. Clicking on the buttons, we can see that the HTML changes dynamically though. This means the page uses JavaScript to display information about the variants.
41+
42+
If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible - scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible.
43+
44+
After a bit of detective work, we can notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data!
45+
46+
```html
47+
<div class="no-js product-form__option">
48+
<label class="product-form__option-name text--strong" for="product-select-1916221128755">Variant</label>
49+
<div class="select-wrapper select-wrapper--primary is-filled">
50+
<select id="product-select-1916221128755" name="id">
51+
<!-- highlight-next-line -->
52+
<option value="17550242349107" data-sku="SON-695219-XBR-55">
53+
<!-- highlight-next-line -->
54+
55" - $1,398.00
55+
</option>
56+
<!-- highlight-next-line -->
57+
<option value="17550242414643" data-sku="SON-985594-XBR-65" selected="selected">
58+
<!-- highlight-next-line -->
59+
65" - $2,198.00
60+
</option>
61+
</select>
62+
</div>
63+
</div>
64+
```
65+
66+
These elements aren't visible to a regular visitor. They're there just for the eventuality that JavaScript fails to work, otherwise they're hidden. This is a great find which allows us to stay lean with our scraper.
67+
68+
## Extracting variants
69+
70+
Using our knowledge of Beautiful Soup we can locate the options and extract the data we need:
71+
72+
```py
73+
...
74+
75+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
76+
listing_soup = download(listing_url)
77+
78+
data = []
79+
for product in listing_soup.select(".product-item"):
80+
item = parse_product(product, listing_url)
81+
product_soup = download(item["url"])
82+
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
83+
84+
if variants := product_soup.select(".product-form__option.no-js option"):
85+
for variant in variants:
86+
data.append(item | {"variant_name": variant.text.strip()})
87+
else:
88+
item["variant_name"] = None
89+
data.append(item)
90+
91+
...
92+
```
93+
94+
The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to actually match all `option` elements, which are somewhere inside the `.product-form__option.no-js` wrapper.
95+
96+
Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we would always overwrite the values. Instead of saving an item for each variant we'd always get the last variant, several times. To avoid this pitfall, we create a new dictionary for each variant and merge it with the `item` data before adding to `data`. In case we don't find any variants, we add the `item` as is, with the `variant_name` key left empty.
97+
98+
:::tip Python syntax you might not know
99+
100+
Since Python 3.8 you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called _assignment expression_ or _walrus_ and you can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/).
101+
102+
Since Python 3.9 you can use `|` to merge two dictionaries. If [docs](https://docs.python.org/3/library/stdtypes.html#dict) don't feel explanatory enough, there's again a whole [proposal document](https://peps.python.org/pep-0584/) about it.
103+
104+
:::
105+
106+
If you run the program, you should see 34 items in total. Some items should have no variant:
107+
108+
<!-- eslint-skip -->
109+
```json title=products.json
110+
[
111+
...
112+
{
113+
"variant_name": null,
114+
"title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit",
115+
"min_price": "324.00",
116+
"price": "324.00",
117+
"url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1",
118+
"vendor": "Klipsch"
119+
},
120+
...
121+
]
122+
```
123+
124+
Some products where we're missing the actual price should now have several variants:
125+
126+
<!-- eslint-skip -->
127+
```json title=products.json
128+
[
129+
...
130+
{
131+
"variant_name": "Red - $178.00",
132+
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
133+
"min_price": "128.00",
134+
"price": null,
135+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
136+
"vendor": "Sony"
137+
},
138+
{
139+
"variant_name": "Black - $178.00",
140+
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
141+
"min_price": "128.00",
142+
"price": null,
143+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
144+
"vendor": "Sony"
145+
},
146+
...
147+
]
148+
```
149+
150+
However, some products with variants will have the `price` field set. That's because the shop sells all these variants for the same price, so the product listing displayed the price as an exact number:
151+
152+
<!-- eslint-skip -->
153+
```json title=products.json
154+
[
155+
...
156+
{
157+
"variant_name": "Red - $74.95",
158+
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
159+
"min_price": "74.95",
160+
"price": "74.95",
161+
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
162+
"vendor": "JBL"
163+
},
164+
...
165+
]
166+
```
167+
168+
## Parsing price
169+
170+
The items now contain the variant as a text, which is good for a start, but it would be more useful if we could set the price to the `price` key. Let's introduce a new function which will take care of that:
171+
172+
```py
173+
def parse_variant(variant):
174+
text = variant.text.strip()
175+
name, price_text = text.split(" - ")
176+
price = Decimal(
177+
price_text
178+
.replace("$", "")
179+
.replace(",", "")
180+
)
181+
return {"variant_name": name, "price": price}
182+
```
183+
184+
First we split the text in two parts, then we parse the price as a decimal number. That part is similar to what we already have for parsing the product listing prices. The function then returns a dictionary which we can merge with `item`.
185+
186+
## Saving price
187+
188+
Now if we use our new function, we should finally get a program which is able to scrape exact prices for all products, even if they have variants. The whole code should look like this now:
189+
190+
```py
191+
import httpx
192+
from bs4 import BeautifulSoup
193+
from decimal import Decimal
194+
import csv
195+
import json
196+
from urllib.parse import urljoin
197+
198+
def download(url):
199+
response = httpx.get(url)
200+
response.raise_for_status()
201+
202+
html_code = response.text
203+
return BeautifulSoup(html_code, "html.parser")
204+
205+
def parse_product(product, base_url):
206+
title_element = product.select_one(".product-item__title")
207+
title = title_element.text.strip()
208+
url = urljoin(base_url, title_element["href"])
209+
210+
price_text = (
211+
product
212+
.select_one(".price")
213+
.contents[-1]
214+
.strip()
215+
.replace("$", "")
216+
.replace(",", "")
217+
)
218+
if price_text.startswith("From "):
219+
min_price = Decimal(price_text.removeprefix("From "))
220+
price = None
221+
else:
222+
min_price = Decimal(price_text)
223+
price = min_price
224+
225+
return {"title": title, "min_price": min_price, "price": price, "url": url}
226+
227+
def parse_variant(variant):
228+
text = variant.text.strip()
229+
name, price_text = text.split(" - ")
230+
price = Decimal(
231+
price_text
232+
.replace("$", "")
233+
.replace(",", "")
234+
)
235+
return {"variant_name": name, "price": price}
236+
237+
def export_csv(file, data):
238+
fieldnames = list(data[0].keys())
239+
writer = csv.DictWriter(file, fieldnames=fieldnames)
240+
writer.writeheader()
241+
for row in data:
242+
writer.writerow(row)
243+
244+
def export_json(file, data):
245+
def serialize(obj):
246+
if isinstance(obj, Decimal):
247+
return str(obj)
248+
raise TypeError("Object not JSON serializable")
249+
250+
json.dump(data, file, default=serialize, indent=2)
251+
252+
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
253+
listing_soup = download(listing_url)
254+
255+
data = []
256+
for product in listing_soup.select(".product-item"):
257+
item = parse_product(product, listing_url)
258+
product_soup = download(item["url"])
259+
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
260+
261+
if variants := product_soup.select(".product-form__option.no-js option"):
262+
for variant in variants:
263+
# highlight-next-line
264+
data.append(item | parse_variant(variant))
265+
else:
266+
item["variant_name"] = None
267+
data.append(item)
268+
269+
with open("products.csv", "w") as file:
270+
export_csv(file, data)
271+
272+
with open("products.json", "w") as file:
273+
export_json(file, data)
274+
```
275+
276+
Run the scraper and see for yourself if all items in the data contains prices:
277+
278+
<!-- eslint-skip -->
279+
```json title=products.json
280+
[
281+
...
282+
{
283+
"variant_name": "Red",
284+
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
285+
"min_price": "128.00",
286+
"price": "178.00",
287+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
288+
"vendor": "Sony"
289+
},
290+
{
291+
"variant_name": "Black",
292+
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
293+
"min_price": "128.00",
294+
"price": "178.00",
295+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
296+
"vendor": "Sony"
297+
},
298+
...
299+
]
300+
```
301+
302+
Success! We managed to build a Python application for watching prices! Is this the end? Maybe. In the next lesson we'll use scraping framework to build the same application, but with less code, faster requests, and visibility into what's actually happening when you wait for the program to finish.
303+
304+
---
305+
306+
<Exercises />
307+
308+
TODO
309+

0 commit comments

Comments
 (0)