Skip to content

Commit 8736bbc

Browse files
committed
feat: update first half of scraping variants to be about JS
1 parent 3c354c9 commit 8736bbc

File tree

2 files changed

+70
-38
lines changed

2 files changed

+70
-38
lines changed

sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md

Lines changed: 69 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Nice! We can extract the variant names, but we also need to extract the price fo
4141

4242
![Switching variants](images/variants-js.gif)
4343

44-
If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible.
44+
If we can't find a workaround, we'd need our scraper to run browser JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Cheerio as much as possible.
4545

4646
After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data!
4747

@@ -65,41 +65,73 @@ After a bit of detective work, we notice that not far below the `block-swatch-li
6565
</div>
6666
```
6767

68-
These elements aren't visible to regular visitors. They're there just in case JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight.
68+
These elements aren't visible to regular visitors. They're there just in case browser JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight.
6969

7070
## Extracting variants
7171

72-
Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need:
72+
Using our knowledge of Cheerio, we can locate the `option` elements and extract the data we need. We'll loop over the options, extract variant names, and create a corresponding array of items for each product:
7373

74-
```py
75-
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
76-
listing_soup = download(listing_url)
74+
```js
75+
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
76+
const $ = await download(listingURL);
7777

78-
data = []
79-
for product in listing_soup.select(".product-item"):
80-
item = parse_product(product, listing_url)
81-
product_soup = download(item["url"])
82-
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
78+
const $promises = $(".product-item").map(async (i, element) => {
79+
const $productItem = $(element);
80+
const item = parseProduct($productItem, listingURL);
8381

84-
if variants := product_soup.select(".product-form__option.no-js option"):
85-
for variant in variants:
86-
data.append(item | {"variant_name": variant.text.strip()})
87-
else:
88-
item["variant_name"] = None
89-
data.append(item)
82+
const $p = await download(item.url);
83+
item.vendor = $p(".product-meta__vendor").text().trim();
84+
85+
// highlight-start
86+
const $items = $p(".product-form__option.no-js option").map((j, element) => {
87+
const $option = $(element);
88+
const variantName = $option.text().trim();
89+
return { variantName, ...item };
90+
});
91+
// highlight-end
92+
93+
return item;
94+
});
95+
const data = await Promise.all($promises.get());
9096
```
9197

92-
The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper.
98+
The CSS selector `.product-form__option.no-js` targets elements that have both the `product-form__option` and `no-js` classes. We then use the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements nested within the `.product-form__option.no-js` wrapper.
9399

94-
Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty.
100+
We loop over the variants using Cheerio's `.map()` method to create a collection of item copies for each `variantName`. We now need to pass all these items onward, but the function currently returns just one item per product. And what if there are no variants?
95101

96-
:::tip Modern Python syntax
102+
Let's adjust the loop so it returns a promise that resolves to an array of items instead of a single item. If a product has no variants, we'll return an array with a single item, setting `variantName` to `null`:
97103

98-
Since Python 3.8, you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called an _assignment expression_ or _walrus operator_. You can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/).
104+
```js
105+
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
106+
const $ = await download(listingURL);
99107

100-
Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https://docs.python.org/3/library/stdtypes.html#dict) aren't clear enough, check out the [proposal document](https://peps.python.org/pep-0584/) for more details.
108+
const $promises = $(".product-item").map(async (i, element) => {
109+
const $productItem = $(element);
110+
const item = parseProduct($productItem, listingURL);
101111

102-
:::
112+
const $p = await download(item.url);
113+
item.vendor = $p(".product-meta__vendor").text().trim();
114+
115+
const $items = $p(".product-form__option.no-js option").map((j, element) => {
116+
const $option = $(element);
117+
const variantName = $option.text().trim();
118+
return { variantName, ...item };
119+
});
120+
121+
// highlight-start
122+
if ($items.length > 0) {
123+
return $items.get();
124+
}
125+
return [{ variantName: null, ...item }];
126+
// highlight-end
127+
});
128+
// highlight-start
129+
const itemLists = await Promise.all($promises.get());
130+
const data = itemLists.flat();
131+
// highlight-end
132+
```
133+
134+
After modifying the loop, we also updated how we collect the items into the `data` array. Since the loop now produces an array of items per product, the result of `await Promise.all()` is an array of arrays. We use [`.flat()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat) to merge them into a single, non-nested array.
103135

104136
If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page.
105137

@@ -108,11 +140,11 @@ If we run the program now, we'll see 34 items in total. Some items don't have va
108140
[
109141
...
110142
{
111-
"variant_name": null,
112-
"title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit",
113-
"min_price": "324.00",
114-
"price": "324.00",
143+
"variant": null,
115144
"url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1",
145+
"title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit",
146+
"minPrice": 32400,
147+
"price": 32400,
116148
"vendor": "Klipsch"
117149
},
118150
...
@@ -126,19 +158,19 @@ Some products will break into several items, each with a different variant name.
126158
[
127159
...
128160
{
129-
"variant_name": "Red - $178.00",
161+
"variant": "Red - $178.00",
162+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
130163
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
131-
"min_price": "128.00",
164+
"minPrice": 12800,
132165
"price": null,
133-
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
134166
"vendor": "Sony"
135167
},
136168
{
137-
"variant_name": "Black - $178.00",
169+
"variant": "Black - $178.00",
170+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
138171
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
139-
"min_price": "128.00",
172+
"minPrice": 12800,
140173
"price": null,
141-
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
142174
"vendor": "Sony"
143175
},
144176
...
@@ -152,11 +184,11 @@ Perhaps surprisingly, some products with variants will have the price field set.
152184
[
153185
...
154186
{
155-
"variant_name": "Red - $74.95",
156-
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
157-
"min_price": "74.95",
158-
"price": "74.95",
187+
"variant": "Red - $74.95",
159188
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
189+
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
190+
"minPrice": 7495,
191+
"price": 7495,
160192
"vendor": "JBL"
161193
},
162194
...

sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ for product in listing_soup.select(".product-item"):
8888
data.append(item)
8989
```
9090

91-
The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper.
91+
The CSS selector `.product-form__option.no-js` targets elements that have both the `product-form__option` and `no-js` classes. We then use the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements nested within the `.product-form__option.no-js` wrapper.
9292

9393
Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty.
9494

0 commit comments

Comments
 (0)