Skip to content

Commit 2edfe14

Browse files
committed
fix: update parsing to be about JS
1 parent 084dfec commit 2edfe14

File tree

3 files changed

+128
-74
lines changed

3 files changed

+128
-74
lines changed

sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md

Lines changed: 116 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -20,148 +20,202 @@ As a first step, let's try counting how many products are on the listing page.
2020

2121
## Processing HTML
2222

23-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
23+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
2424

25-
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
25+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
2626

2727
:::info Why regex can't parse HTML
2828

2929
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
3030

3131
:::
3232

33-
We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
33+
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
3434

3535
```text
36-
$ pip install beautifulsoup4
36+
$ npm install cheerio
37+
38+
added 23 packages, and audited 24 packages in 1s
3739
...
38-
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3940
```
4041

41-
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
42+
:::tip Installing packages
43+
44+
Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
45+
46+
:::
47+
48+
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
4249

4350
![Element of the main heading](./images/h1.png)
4451

4552
We'll update our code to the following:
4653

47-
```py
48-
import httpx
49-
from bs4 import BeautifulSoup
54+
```js
55+
import * as cheerio from 'cheerio';
5056

51-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
52-
response = httpx.get(url)
53-
response.raise_for_status()
57+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
58+
const response = await fetch(url);
5459

55-
html_code = response.text
56-
soup = BeautifulSoup(html_code, "html.parser")
57-
print(soup.select("h1"))
60+
if (response.ok) {
61+
const html = await response.text();
62+
const $ = cheerio.load(html);
63+
console.log($("h1"));
64+
} else {
65+
throw new Error(`HTTP ${response.status}`);
66+
}
5867
```
5968

6069
Then let's run the program:
6170

6271
```text
63-
$ python main.py
64-
[<h1 class="collection__title heading h1">Sales</h1>]
72+
$ node index.js
73+
LoadedCheerio {
74+
'0': <ref *1> Element {
75+
parent: Element { ... },
76+
prev: Text { ... },
77+
next: Element { ... },
78+
startIndex: null,
79+
endIndex: null,
80+
# highlight-next-line
81+
children: [ [Text] ],
82+
# highlight-next-line
83+
name: 'h1',
84+
attribs: [Object: null prototype] { class: 'collection__title heading h1' },
85+
type: 'tag',
86+
namespace: 'http://www.w3.org/1999/xhtml',
87+
'x-attribsNamespace': [Object: null prototype] { class: undefined },
88+
'x-attribsPrefix': [Object: null prototype] { class: undefined }
89+
},
90+
length: 1,
91+
...
92+
}
6593
```
6694

67-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
95+
Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the container.
96+
97+
The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
98+
99+
```js
100+
import * as cheerio from 'cheerio';
68101

69-
```py
70-
headings = soup.select("h1")
71-
first_heading = headings[0]
72-
print(first_heading.text)
102+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
103+
const response = await fetch(url);
104+
105+
if (response.ok) {
106+
const html = await response.text();
107+
const $ = cheerio.load(html);
108+
// highlight-next-line
109+
console.log($("h1").text());
110+
} else {
111+
throw new Error(`HTTP ${response.status}`);
112+
}
73113
```
74114

75-
If we run our scraper again, it prints the text of the first `h1` element:
115+
Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. If we call `.text()`, it automatically assumes we want to work with the first element in the collection. Thus, if we run our scraper again, it prints the text of the first `h1` element:
76116

77117
```text
78-
$ python main.py
118+
$ node index.js
79119
Sales
80120
```
81121

82122
:::note Dynamic websites
83123

84-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
124+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
85125

86126
:::
87127

88128
## Using CSS selectors
89129

90-
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
130+
Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
91131

92-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
132+
Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
93133

94-
```py
95-
import httpx
96-
from bs4 import BeautifulSoup
134+
```js
135+
import * as cheerio from 'cheerio';
97136

98-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
99-
response = httpx.get(url)
100-
response.raise_for_status()
137+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
138+
const response = await fetch(url);
101139

102-
html_code = response.text
103-
soup = BeautifulSoup(html_code, "html.parser")
104-
products = soup.select(".product-item")
105-
print(len(products))
140+
if (response.ok) {
141+
const html = await response.text();
142+
const $ = cheerio.load(html);
143+
// highlight-next-line
144+
console.log($(".product-item").length);
145+
} else {
146+
throw new Error(`HTTP ${response.status}`);
147+
}
106148
```
107149

108-
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
150+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back a container of matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the container.
109151

110152
```text
111-
$ python main.py
153+
$ node index.js
112154
24
113155
```
114156

115157
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
116158

159+
:::info Cheerio and jQuery
160+
161+
The Cheerio documentation frequently mentions something called jQuery. In the medieval days of the internet, when so-called Internet Explorers roamed the untamed plains of simple websites, developers created the first JavaScript frameworks to improve their crude tools and overcome the wild inconsistencies between browsers. Imagine a time when things like `document.querySelectorAll()` didn't even exist. jQuery was the most popular of these frameworks, granting great power to those who knew how to wield it.
162+
163+
Cheerio was deliberately designed to mimic jQuery's interface. At the time, nearly everyone was familiar with it, and it felt like the most natural way to walk through HTML elements. jQuery was used in the browser, Cheerio in Node.js. But as time passed, jQuery gradually faded from relevance. In a twist of history, we now learn its syntax only to use Cheerio.
164+
165+
:::
166+
117167
---
118168

119169
<Exercises />
120170

121-
### Scrape F1 teams
171+
### Scrape F1 Academy teams
122172

123-
Print a total count of F1 teams listed on this page:
173+
Print a total count of F1 Academy teams listed on this page:
124174

125175
```text
126-
https://www.formula1.com/en/teams
176+
https://www.f1academy.com/Racing-Series/Teams
127177
```
128178

129179
<details>
130180
<summary>Solution</summary>
131181

132-
```py
133-
import httpx
134-
from bs4 import BeautifulSoup
182+
```js
183+
import * as cheerio from 'cheerio';
135184

136-
url = "https://www.formula1.com/en/teams"
137-
response = httpx.get(url)
138-
response.raise_for_status()
185+
const url = "https://www.f1academy.com/Racing-Series/Teams";
186+
const response = await fetch(url);
139187

140-
html_code = response.text
141-
soup = BeautifulSoup(html_code, "html.parser")
142-
print(len(soup.select(".group")))
188+
if (response.ok) {
189+
const html = await response.text();
190+
const $ = cheerio.load(html);
191+
console.log($(".teams-driver-item").length);
192+
} else {
193+
throw new Error(`HTTP ${response.status}`);
194+
}
143195
```
144196

145197
</details>
146198

147-
### Scrape F1 drivers
199+
### Scrape F1 Academy drivers
148200

149-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
201+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
150202

151203
<details>
152204
<summary>Solution</summary>
153205

154-
```py
155-
import httpx
156-
from bs4 import BeautifulSoup
206+
```js
207+
import * as cheerio from 'cheerio';
157208

158-
url = "https://www.formula1.com/en/teams"
159-
response = httpx.get(url)
160-
response.raise_for_status()
209+
const url = "https://www.f1academy.com/Racing-Series/Teams";
210+
const response = await fetch(url);
161211

162-
html_code = response.text
163-
soup = BeautifulSoup(html_code, "html.parser")
164-
print(len(soup.select(".f1-team-driver-name")))
212+
if (response.ok) {
213+
const html = await response.text();
214+
const $ = cheerio.load(html);
215+
console.log($(".driver").length);
216+
} else {
217+
throw new Error(`HTTP ${response.status}`);
218+
}
165219
```
166220

167221
</details>

sources/academy/webscraping/scraping_basics_javascript2/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Anyone with basic knowledge of developing programs in JavaScript who wants to st
3333
## Requirements
3434

3535
- A macOS, Linux, or Windows machine with a web browser and Node.js installed.
36-
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
36+
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, arrays, objects, files, classes, promises, imports, and exceptions.
3737
- Comfort with building a Node.js package and installing dependencies with `npm`.
3838
- Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
3939

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ $ python main.py
6363
[<h1 class="collection__title heading h1">Sales</h1>]
6464
```
6565

66-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66+
Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
6767

6868
```py
6969
headings = soup.select("h1")
@@ -80,7 +80,7 @@ Sales
8080

8181
:::note Dynamic websites
8282

83-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
83+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
8484

8585
:::
8686

@@ -117,12 +117,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
117117

118118
<Exercises />
119119

120-
### Scrape F1 teams
120+
### Scrape F1 Academy teams
121121

122-
Print a total count of F1 teams listed on this page:
122+
Print a total count of F1 Academy teams listed on this page:
123123

124124
```text
125-
https://www.formula1.com/en/teams
125+
https://www.f1academy.com/Racing-Series/Teams
126126
```
127127

128128
<details>
@@ -132,20 +132,20 @@ https://www.formula1.com/en/teams
132132
import httpx
133133
from bs4 import BeautifulSoup
134134

135-
url = "https://www.formula1.com/en/teams"
135+
url = "https://www.f1academy.com/Racing-Series/Teams"
136136
response = httpx.get(url)
137137
response.raise_for_status()
138138

139139
html_code = response.text
140140
soup = BeautifulSoup(html_code, "html.parser")
141-
print(len(soup.select(".group")))
141+
print(len(soup.select(".teams-driver-item")))
142142
```
143143

144144
</details>
145145

146-
### Scrape F1 drivers
146+
### Scrape F1 Academy drivers
147147

148-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
148+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
149149

150150
<details>
151151
<summary>Solution</summary>
@@ -154,13 +154,13 @@ Use the same URL as in the previous exercise, but this time print a total count
154154
import httpx
155155
from bs4 import BeautifulSoup
156156

157-
url = "https://www.formula1.com/en/teams"
157+
url = "https://www.f1academy.com/Racing-Series/Teams"
158158
response = httpx.get(url)
159159
response.raise_for_status()
160160

161161
html_code = response.text
162162
soup = BeautifulSoup(html_code, "html.parser")
163-
print(len(soup.select(".f1-team-driver-name")))
163+
print(len(soup.select(".driver")))
164164
```
165165

166166
</details>

0 commit comments

Comments
 (0)