Skip to content

Commit 92e2b3c

Browse files
committed
feat: update crawling to be about JS
1 parent 602b7aa commit 92e2b3c

File tree

1 file changed

+104
-90
lines changed

1 file changed

+104
-90
lines changed

sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md

Lines changed: 104 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -12,75 +12,71 @@ import Exercises from './_exercises.mdx';
1212

1313
---
1414

15-
In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products.
15+
In previous lessons we've managed to download the HTML code of a single page, parse it with Cheerio, and extract relevant data from it. We'll do the same now for each of the products.
1616

1717
Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now:
1818

19-
```py
20-
import httpx
21-
from bs4 import BeautifulSoup
22-
from decimal import Decimal
23-
import json
24-
import csv
25-
from urllib.parse import urljoin
26-
27-
def download(url):
28-
response = httpx.get(url)
29-
response.raise_for_status()
30-
31-
html_code = response.text
32-
return BeautifulSoup(html_code, "html.parser")
33-
34-
def parse_product(product, base_url):
35-
title_element = product.select_one(".product-item__title")
36-
title = title_element.text.strip()
37-
url = urljoin(base_url, title_element["href"])
38-
39-
price_text = (
40-
product
41-
.select_one(".price")
42-
.contents[-1]
43-
.strip()
44-
.replace("$", "")
45-
.replace(",", "")
46-
)
47-
if price_text.startswith("From "):
48-
min_price = Decimal(price_text.removeprefix("From "))
49-
price = None
50-
else:
51-
min_price = Decimal(price_text)
52-
price = min_price
53-
54-
return {"title": title, "min_price": min_price, "price": price, "url": url}
55-
56-
def export_csv(file, data):
57-
fieldnames = list(data[0].keys())
58-
writer = csv.DictWriter(file, fieldnames=fieldnames)
59-
writer.writeheader()
60-
for row in data:
61-
writer.writerow(row)
62-
63-
def export_json(file, data):
64-
def serialize(obj):
65-
if isinstance(obj, Decimal):
66-
return str(obj)
67-
raise TypeError("Object not JSON serializable")
68-
69-
json.dump(data, file, default=serialize, indent=2)
70-
71-
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
72-
listing_soup = download(listing_url)
73-
74-
data = []
75-
for product in listing_soup.select(".product-item"):
76-
item = parse_product(product, listing_url)
77-
data.append(item)
78-
79-
with open("products.csv", "w") as file:
80-
export_csv(file, data)
81-
82-
with open("products.json", "w") as file:
83-
export_json(file, data)
19+
```js
20+
import * as cheerio from 'cheerio';
21+
import { writeFile } from 'fs/promises';
22+
import { AsyncParser } from '@json2csv/node';
23+
24+
async function download(url) {
25+
const response = await fetch(url);
26+
if (response.ok) {
27+
const html = await response.text();
28+
return cheerio.load(html);
29+
} else {
30+
throw new Error(`HTTP ${response.status}`);
31+
}
32+
}
33+
34+
function parseProduct(productItem, baseURL) {
35+
const $title = $productItem.find(".product-item__title");
36+
const title = $title.text().trim();
37+
const url = new URL($title.attr("href"), baseURL).href;
38+
39+
const $price = $productItem.find(".price").contents().last();
40+
const priceRange = { minPrice: null, price: null };
41+
const priceText = $price
42+
.text()
43+
.trim()
44+
.replace("$", "")
45+
.replace(".", "")
46+
.replace(",", "");
47+
48+
if (priceText.startsWith("From ")) {
49+
priceRange.minPrice = parseInt(priceText.replace("From ", ""));
50+
} else {
51+
priceRange.minPrice = parseInt(priceText);
52+
priceRange.price = priceRange.minPrice;
53+
}
54+
55+
return { url, title, ...priceRange };
56+
}
57+
58+
function exportJSON(data) {
59+
return JSON.stringify(data, null, 2);
60+
}
61+
62+
async function exportCSV(data) {
63+
const parser = new AsyncParser();
64+
return await parser.parse(data).promise();
65+
}
66+
67+
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
68+
const $ = await download(listingURL);
69+
70+
const $items = $(".product-item").map((i, element) => {
71+
const $productItem = $(element);
72+
// highlight-next-line
73+
const item = parseProduct($productItem, listingURL);
74+
return item;
75+
});
76+
const data = $items.get();
77+
78+
await writeFile('products.json', exportJSON(data));
79+
await writeFile('products.csv', await exportCSV(data));
8480
```
8581

8682
## Extracting vendor name
@@ -125,51 +121,69 @@ Depending on what's valuable for our use case, we can now use the same technique
125121

126122
It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string:
127123

128-
```py
129-
vendor = product_soup.select_one(".product-meta__vendor").text.strip()
124+
```js
125+
const vendor = $(".product-meta__vendor").text().trim();
130126
```
131127

132128
But where do we put this line in our program?
133129

134130
## Crawling product detail pages
135131

136-
In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary:
132+
In the `.map()` loop, we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it to the item object.
137133

138-
```py
139-
...
134+
First, we need to make the loop asynchronous so that we can use `await download()` for each product. We'll add the `async` keyword to the inner function and rename the collection to `$promises`, since it will now store promises that resolve to items rather than the items themselves. We'll still convert the collection to a standard JavaScript array, but this time we'll pass it to `await Promise.all()` to resolve all the promises and retrieve the actual items.
140135

141-
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
142-
listing_soup = download(listing_url)
136+
```js
137+
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
138+
const $ = await download(listingURL);
143139

144-
data = []
145-
for product in listing_soup.select(".product-item"):
146-
item = parse_product(product, listing_url)
147-
# highlight-next-line
148-
product_soup = download(item["url"])
149-
# highlight-next-line
150-
item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
151-
data.append(item)
140+
// highlight-next-line
141+
const $promises = $(".product-item").map(async (i, element) => {
142+
const $productItem = $(element);
143+
const item = parseProduct($productItem, listingURL);
144+
return item;
145+
});
146+
// highlight-next-line
147+
const data = await Promise.all($promises.get());
148+
```
152149

153-
...
150+
The program behaves the same as before, but now the code is prepared to make HTTP requests from within the inner function. Let's do it:
151+
152+
```js
153+
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
154+
const $ = await download(listingURL);
155+
156+
const $promises = $(".product-item").map(async (i, element) => {
157+
const $productItem = $(element);
158+
const item = parseProduct($productItem, listingURL);
159+
// highlight-next-line
160+
const $p = await download(item.url);
161+
// highlight-next-line
162+
item.vendor = $p(".product-meta__vendor").text().trim();
163+
return item;
164+
});
165+
const data = await Promise.all($promises.get());
154166
```
155167

168+
We download each product detail page and parse its HTML using Cheerio. The `$p` variable is the root of a Cheerio object tree, similar to but distinct from the `$` used for the listing page. That's why we use `$p()` instead of `$p.find()`.
169+
156170
If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name:
157171

158172
<!-- eslint-skip -->
159173
```json title=products.json
160174
[
161175
{
162-
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
163-
"min_price": "74.95",
164-
"price": "74.95",
165176
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
177+
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
178+
"minPrice": 7495,
179+
"price": 7495,
166180
"vendor": "JBL"
167181
},
168182
{
183+
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
169184
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
170-
"min_price": "1398.00",
185+
"minPrice": 139800,
171186
"price": null,
172-
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
173187
"vendor": "Sony"
174188
},
175189
...
@@ -178,7 +192,7 @@ If we run the program now, it'll take longer to finish since it's making 24 more
178192

179193
## Extracting price
180194

181-
Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—were building a Python app to track prices!
195+
Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we're building a Node.js app to track prices!
182196

183197
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
184198

@@ -206,12 +220,12 @@ https://en.wikipedia.org/wiki/Angola +244
206220
https://en.wikipedia.org/wiki/Benin +229
207221
https://en.wikipedia.org/wiki/Botswana +267
208222
https://en.wikipedia.org/wiki/Burkina_Faso +226
209-
https://en.wikipedia.org/wiki/Burundi None
223+
https://en.wikipedia.org/wiki/Burundi null
210224
https://en.wikipedia.org/wiki/Cameroon +237
211225
...
212226
```
213227

214-
Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup.
228+
Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://cheerio.js.org/docs/api/classes/Cheerio#parent) in the HTML element tree.
215229

216230
<details>
217231
<summary>Solution</summary>

0 commit comments

Comments
 (0)