Skip to content

Commit bc2ff07

Browse files
committed
fix: update parsing to be about JS
1 parent ca304ee commit bc2ff07

File tree

4 files changed

+138
-86
lines changed

4 files changed

+138
-86
lines changed

sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,15 @@ All is OK, mate
6868

6969
:::info Troubleshooting
7070

71-
If you see `ReferenceError: require is not defined in ES module scope, you can use import instead`, double check that in your `package.json` the type property is set to `module`.
71+
If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
7272

73-
If you see other errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
73+
Double check that in your `package.json` the type property is set to `module`, otherwise you'll get the following warning:
74+
75+
```text
76+
[MODULE_TYPELESS_PACKAGE_JSON] Warning: Module type of file:///Users/.../product-scraper/index.js is not specified and it doesn't parse as CommonJS.
77+
Reparsing as ES module because module syntax was detected. This incurs a performance overhead.
78+
To eliminate this warning, add "type": "module" to /Users/.../product-scraper/package.json.
79+
```
7480

7581
:::
7682

@@ -84,6 +90,12 @@ const response = await fetch(url);
8490
console.log(await response.text());
8591
```
8692

93+
:::tip Asynchronous flow
94+
95+
First time you see `await`? It's a modern syntax for working with promises. See the [JavaScript Asynchronous Programming and Callbacks](https://nodejs.org/en/learn/asynchronous-work/javascript-asynchronous-programming-and-callbacks) and [Discover Promises in Node.js](https://nodejs.org/en/learn/asynchronous-work/discover-promises-in-nodejs) tutorials in the official Node.js documentation for more.
96+
97+
:::
98+
8799
If we run the program now, it should print the downloaded HTML:
88100

89101
```text
@@ -224,7 +236,7 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
224236

225237
### Download an image as a file
226238

227-
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV:
239+
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) and the [Writing files with Node.js](https://nodejs.org/en/learn/manipulating-files/writing-files-with-nodejs) tutorial for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV:
228240

229241
```text
230242
https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg

sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md

Lines changed: 111 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -19,162 +19,202 @@ As a first step, let's try counting how many products are on the listing page.
1919

2020
## Processing HTML
2121

22-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
22+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
2323

24-
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
24+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
2525

2626
:::info Why regex can't parse HTML
2727

2828
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
2929

3030
:::
3131

32-
We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
32+
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
3333

3434
```text
35-
$ pip install beautifulsoup4
35+
$ npm install cheerio
36+
37+
added 23 packages, and audited 24 packages in 1s
3638
...
37-
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3839
```
3940

40-
<!--
4141
:::tip Installing packages
4242

43-
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
43+
Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
4444

4545
:::
4646

47-
:::info Troubleshooting
48-
49-
If you see other errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
50-
51-
:::
52-
-->
53-
54-
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
47+
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
5548

5649
![Element of the main heading](./images/h1.png)
5750

5851
We'll update our code to the following:
5952

60-
```py
61-
import httpx
62-
from bs4 import BeautifulSoup
53+
```js
54+
import * as cheerio from 'cheerio';
6355

64-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
65-
response = httpx.get(url)
66-
response.raise_for_status()
56+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
57+
const response = await fetch(url);
6758

68-
html_code = response.text
69-
soup = BeautifulSoup(html_code, "html.parser")
70-
print(soup.select("h1"))
59+
if (response.ok) {
60+
const html = await response.text();
61+
const $ = cheerio.load(html);
62+
console.log($("h1"));
63+
} else {
64+
throw new Error(`HTTP ${response.status}`);
65+
}
7166
```
7267

7368
Then let's run the program:
7469

7570
```text
76-
$ python main.py
77-
[<h1 class="collection__title heading h1">Sales</h1>]
71+
$ node index.js
72+
LoadedCheerio {
73+
'0': <ref *1> Element {
74+
parent: Element { ... },
75+
prev: Text { ... },
76+
next: Element { ... },
77+
startIndex: null,
78+
endIndex: null,
79+
# highlight-next-line
80+
children: [ [Text] ],
81+
# highlight-next-line
82+
name: 'h1',
83+
attribs: [Object: null prototype] { class: 'collection__title heading h1' },
84+
type: 'tag',
85+
namespace: 'http://www.w3.org/1999/xhtml',
86+
'x-attribsNamespace': [Object: null prototype] { class: undefined },
87+
'x-attribsPrefix': [Object: null prototype] { class: undefined }
88+
},
89+
length: 1,
90+
...
91+
}
7892
```
7993

80-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
94+
Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the container.
95+
96+
The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
97+
98+
```js
99+
import * as cheerio from 'cheerio';
81100

82-
```py
83-
headings = soup.select("h1")
84-
first_heading = headings[0]
85-
print(first_heading.text)
101+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
102+
const response = await fetch(url);
103+
104+
if (response.ok) {
105+
const html = await response.text();
106+
const $ = cheerio.load(html);
107+
// highlight-next-line
108+
console.log($("h1").text());
109+
} else {
110+
throw new Error(`HTTP ${response.status}`);
111+
}
86112
```
87113

88-
If we run our scraper again, it prints the text of the first `h1` element:
114+
Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. If we call `.text()`, it automatically assumes we want to work with the first element in the collection. Thus, if we run our scraper again, it prints the text of the first `h1` element:
89115

90116
```text
91-
$ python main.py
117+
$ node index.js
92118
Sales
93119
```
94120

95121
:::note Dynamic websites
96122

97-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
123+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
98124

99125
:::
100126

101127
## Using CSS selectors
102128

103-
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
129+
Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
104130

105-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
131+
Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
106132

107-
```py
108-
import httpx
109-
from bs4 import BeautifulSoup
133+
```js
134+
import * as cheerio from 'cheerio';
110135

111-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
112-
response = httpx.get(url)
113-
response.raise_for_status()
136+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
137+
const response = await fetch(url);
114138

115-
html_code = response.text
116-
soup = BeautifulSoup(html_code, "html.parser")
117-
products = soup.select(".product-item")
118-
print(len(products))
139+
if (response.ok) {
140+
const html = await response.text();
141+
const $ = cheerio.load(html);
142+
// highlight-next-line
143+
console.log($(".product-item").length);
144+
} else {
145+
throw new Error(`HTTP ${response.status}`);
146+
}
119147
```
120148

121-
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
149+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back a container of matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the container.
122150

123151
```text
124-
$ python main.py
152+
$ node index.js
125153
24
126154
```
127155

128156
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
129157

158+
:::info Cheerio and jQuery
159+
160+
The Cheerio documentation frequently mentions something called jQuery. In the medieval days of the internet, when so-called Internet Explorers roamed the untamed plains of simple websites, developers created the first JavaScript frameworks to improve their crude tools and overcome the wild inconsistencies between browsers. Imagine a time when things like `document.querySelectorAll()` didn't even exist. jQuery was the most popular of these frameworks, granting great power to those who knew how to wield it.
161+
162+
Cheerio was deliberately designed to mimic jQuery's interface. At the time, nearly everyone was familiar with it, and it felt like the most natural way to walk through HTML elements. jQuery was used in the browser, Cheerio in Node.js. But as time passed, jQuery gradually faded from relevance. In a twist of history, we now learn its syntax only to use Cheerio.
163+
164+
:::
165+
130166
---
131167

132168
<Exercises />
133169

134-
### Scrape F1 teams
170+
### Scrape F1 Academy teams
135171

136-
Print a total count of F1 teams listed on this page:
172+
Print a total count of F1 Academy teams listed on this page:
137173

138174
```text
139-
https://www.formula1.com/en/teams
175+
https://www.f1academy.com/Racing-Series/Teams
140176
```
141177

142178
<details>
143179
<summary>Solution</summary>
144180

145-
```py
146-
import httpx
147-
from bs4 import BeautifulSoup
181+
```js
182+
import * as cheerio from 'cheerio';
148183

149-
url = "https://www.formula1.com/en/teams"
150-
response = httpx.get(url)
151-
response.raise_for_status()
184+
const url = "https://www.f1academy.com/Racing-Series/Teams";
185+
const response = await fetch(url);
152186

153-
html_code = response.text
154-
soup = BeautifulSoup(html_code, "html.parser")
155-
print(len(soup.select(".group")))
187+
if (response.ok) {
188+
const html = await response.text();
189+
const $ = cheerio.load(html);
190+
console.log($(".teams-driver-item").length);
191+
} else {
192+
throw new Error(`HTTP ${response.status}`);
193+
}
156194
```
157195

158196
</details>
159197

160-
### Scrape F1 drivers
198+
### Scrape F1 Academy drivers
161199

162-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
200+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
163201

164202
<details>
165203
<summary>Solution</summary>
166204

167-
```py
168-
import httpx
169-
from bs4 import BeautifulSoup
205+
```js
206+
import * as cheerio from 'cheerio';
170207

171-
url = "https://www.formula1.com/en/teams"
172-
response = httpx.get(url)
173-
response.raise_for_status()
208+
const url = "https://www.f1academy.com/Racing-Series/Teams";
209+
const response = await fetch(url);
174210

175-
html_code = response.text
176-
soup = BeautifulSoup(html_code, "html.parser")
177-
print(len(soup.select(".f1-team-driver-name")))
211+
if (response.ok) {
212+
const html = await response.text();
213+
const $ = cheerio.load(html);
214+
console.log($(".driver").length);
215+
} else {
216+
throw new Error(`HTTP ${response.status}`);
217+
}
178218
```
179219

180220
</details>

sources/academy/webscraping/scraping_basics_javascript2/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Anyone with basic knowledge of developing programs in JavaScript who wants to st
3232
## Requirements
3333

3434
- A macOS, Linux, or Windows machine with a web browser and Node.js installed.
35-
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
35+
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, arrays, objects, files, classes, promises, imports, and exceptions.
3636
- Comfort with building a Node.js package and installing dependencies with `npm`.
3737
- Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
3838

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ $ python main.py
6363
[<h1 class="collection__title heading h1">Sales</h1>]
6464
```
6565

66-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66+
Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
6767

6868
```py
6969
headings = soup.select("h1")
@@ -80,7 +80,7 @@ Sales
8080

8181
:::note Dynamic websites
8282

83-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
83+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
8484

8585
:::
8686

@@ -117,12 +117,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
117117

118118
<Exercises />
119119

120-
### Scrape F1 teams
120+
### Scrape F1 Academy teams
121121

122-
Print a total count of F1 teams listed on this page:
122+
Print a total count of F1 Academy teams listed on this page:
123123

124124
```text
125-
https://www.formula1.com/en/teams
125+
https://www.f1academy.com/Racing-Series/Teams
126126
```
127127

128128
<details>
@@ -132,20 +132,20 @@ https://www.formula1.com/en/teams
132132
import httpx
133133
from bs4 import BeautifulSoup
134134

135-
url = "https://www.formula1.com/en/teams"
135+
url = "https://www.f1academy.com/Racing-Series/Teams"
136136
response = httpx.get(url)
137137
response.raise_for_status()
138138

139139
html_code = response.text
140140
soup = BeautifulSoup(html_code, "html.parser")
141-
print(len(soup.select(".group")))
141+
print(len(soup.select(".teams-driver-item")))
142142
```
143143

144144
</details>
145145

146-
### Scrape F1 drivers
146+
### Scrape F1 Academy drivers
147147

148-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
148+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
149149

150150
<details>
151151
<summary>Solution</summary>
@@ -154,13 +154,13 @@ Use the same URL as in the previous exercise, but this time print a total count
154154
import httpx
155155
from bs4 import BeautifulSoup
156156

157-
url = "https://www.formula1.com/en/teams"
157+
url = "https://www.f1academy.com/Racing-Series/Teams"
158158
response = httpx.get(url)
159159
response.raise_for_status()
160160

161161
html_code = response.text
162162
soup = BeautifulSoup(html_code, "html.parser")
163-
print(len(soup.select(".f1-team-driver-name")))
163+
print(len(soup.select(".driver")))
164164
```
165165

166166
</details>

0 commit comments

Comments
 (0)