Skip to content

Commit 30d0148

Browse files
honzajavorekTC-MO
andauthored
feat: update the parsing lesson of the JS2 course to be about JavaScript (#1760)
Part of #1584, fixes #1648 ⚠️ 🐍 Includes respective changes also to the original `scraping_basics_python` lesson --------- Co-authored-by: Michał Olender <[email protected]>
1 parent 0f2d0e3 commit 30d0148

File tree

3 files changed

+136
-76
lines changed

3 files changed

+136
-76
lines changed

sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md

Lines changed: 120 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -20,148 +20,205 @@ As a first step, let's try counting how many products are on the listing page.
2020

2121
## Processing HTML
2222

23-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
23+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
2424

25-
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
25+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
2626

2727
:::info Why regex can't parse HTML
2828

29-
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
29+
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go very deep into the reasoning:
30+
31+
- In **formal language theory**, HTML's hierarchical, nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). **Regular expressions**, by contrast, match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler.
32+
- Because of this difference, regex alone struggles with HTML's nested tags. On top of that, HTML has **complex syntax rules** and countless **edge cases**, which only add to the difficulty.
3033

3134
:::
3235

33-
We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
36+
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
3437

3538
```text
36-
$ pip install beautifulsoup4
39+
$ npm install cheerio --save
40+
41+
added 23 packages, and audited 24 packages in 1s
3742
...
38-
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3943
```
4044

41-
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
45+
:::tip Installing packages
46+
47+
Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
48+
49+
:::
50+
51+
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
4252

4353
![Element of the main heading](./images/h1.png)
4454

4555
We'll update our code to the following:
4656

47-
```py
48-
import httpx
49-
from bs4 import BeautifulSoup
57+
```js
58+
import * as cheerio from 'cheerio';
5059

51-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
52-
response = httpx.get(url)
53-
response.raise_for_status()
60+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
61+
const response = await fetch(url);
5462

55-
html_code = response.text
56-
soup = BeautifulSoup(html_code, "html.parser")
57-
print(soup.select("h1"))
63+
if (response.ok) {
64+
const html = await response.text();
65+
const $ = cheerio.load(html);
66+
console.log($("h1"));
67+
} else {
68+
throw new Error(`HTTP ${response.status}`);
69+
}
5870
```
5971

6072
Then let's run the program:
6173

6274
```text
63-
$ python main.py
64-
[<h1 class="collection__title heading h1">Sales</h1>]
75+
$ node index.js
76+
LoadedCheerio {
77+
'0': <ref *1> Element {
78+
parent: Element { ... },
79+
prev: Text { ... },
80+
next: Element { ... },
81+
startIndex: null,
82+
endIndex: null,
83+
# highlight-next-line
84+
children: [ [Text] ],
85+
# highlight-next-line
86+
name: 'h1',
87+
attribs: [Object: null prototype] { class: 'collection__title heading h1' },
88+
type: 'tag',
89+
namespace: 'http://www.w3.org/1999/xhtml',
90+
'x-attribsNamespace': [Object: null prototype] { class: undefined },
91+
'x-attribsPrefix': [Object: null prototype] { class: undefined }
92+
},
93+
length: 1,
94+
...
95+
}
6596
```
6697

67-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
98+
Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the selection.
99+
100+
The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
68101

69-
```py
70-
headings = soup.select("h1")
71-
first_heading = headings[0]
72-
print(first_heading.text)
102+
```js
103+
import * as cheerio from 'cheerio';
104+
105+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
106+
const response = await fetch(url);
107+
108+
if (response.ok) {
109+
const html = await response.text();
110+
const $ = cheerio.load(html);
111+
// highlight-next-line
112+
console.log($("h1").text());
113+
} else {
114+
throw new Error(`HTTP ${response.status}`);
115+
}
73116
```
74117

75-
If we run our scraper again, it prints the text of the first `h1` element:
118+
Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. Calling `.text()` combines texts of all elements in the selection. If we run our scraper again, it prints the text of the `h1` element:
76119

77120
```text
78-
$ python main.py
121+
$ node index.js
79122
Sales
80123
```
81124

82125
:::note Dynamic websites
83126

84-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
127+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
85128

86129
:::
87130

88131
## Using CSS selectors
89132

90-
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
133+
Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
91134

92-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
135+
Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
93136

94-
```py
95-
import httpx
96-
from bs4 import BeautifulSoup
137+
```js
138+
import * as cheerio from 'cheerio';
97139

98-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
99-
response = httpx.get(url)
100-
response.raise_for_status()
140+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
141+
const response = await fetch(url);
101142

102-
html_code = response.text
103-
soup = BeautifulSoup(html_code, "html.parser")
104-
products = soup.select(".product-item")
105-
print(len(products))
143+
if (response.ok) {
144+
const html = await response.text();
145+
const $ = cheerio.load(html);
146+
// highlight-next-line
147+
console.log($(".product-item").length);
148+
} else {
149+
throw new Error(`HTTP ${response.status}`);
150+
}
106151
```
107152

108-
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
153+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the selection.
109154

110155
```text
111-
$ python main.py
156+
$ node index.js
112157
24
113158
```
114159

115160
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
116161

162+
:::info Cheerio and jQuery
163+
164+
The Cheerio documentation frequently mentions jQuery. Back when browsers were wildly inconsistent and basic DOM methods like `document.querySelectorAll()` didn't exist, jQuery was the most popular JavaScript framework for web development. It provided a consistent API that worked across all browsers.
165+
166+
Cheerio was designed to mimic jQuery's interface because nearly every developer knew jQuery at the time. jQuery worked in browsers, Cheerio in Node.js. While jQuery has largely faded from modern web development, we now learn its syntax specifically to use Cheerio for server-side HTML manipulation.
167+
168+
:::
169+
117170
---
118171

119172
<Exercises />
120173

121-
### Scrape F1 teams
174+
### Scrape F1 Academy teams
122175

123-
Print a total count of F1 teams listed on this page:
176+
Print a total count of F1 Academy teams listed on this page:
124177

125178
```text
126-
https://www.formula1.com/en/teams
179+
https://www.f1academy.com/Racing-Series/Teams
127180
```
128181

129182
<details>
130183
<summary>Solution</summary>
131184

132-
```py
133-
import httpx
134-
from bs4 import BeautifulSoup
185+
```js
186+
import * as cheerio from 'cheerio';
135187

136-
url = "https://www.formula1.com/en/teams"
137-
response = httpx.get(url)
138-
response.raise_for_status()
188+
const url = "https://www.f1academy.com/Racing-Series/Teams";
189+
const response = await fetch(url);
139190

140-
html_code = response.text
141-
soup = BeautifulSoup(html_code, "html.parser")
142-
print(len(soup.select(".group")))
191+
if (response.ok) {
192+
const html = await response.text();
193+
const $ = cheerio.load(html);
194+
console.log($(".teams-driver-item").length);
195+
} else {
196+
throw new Error(`HTTP ${response.status}`);
197+
}
143198
```
144199

145200
</details>
146201

147-
### Scrape F1 drivers
202+
### Scrape F1 Academy drivers
148203

149-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
204+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
150205

151206
<details>
152207
<summary>Solution</summary>
153208

154-
```py
155-
import httpx
156-
from bs4 import BeautifulSoup
209+
```js
210+
import * as cheerio from 'cheerio';
157211

158-
url = "https://www.formula1.com/en/teams"
159-
response = httpx.get(url)
160-
response.raise_for_status()
212+
const url = "https://www.f1academy.com/Racing-Series/Teams";
213+
const response = await fetch(url);
161214

162-
html_code = response.text
163-
soup = BeautifulSoup(html_code, "html.parser")
164-
print(len(soup.select(".f1-team-driver-name")))
215+
if (response.ok) {
216+
const html = await response.text();
217+
const $ = cheerio.load(html);
218+
console.log($(".driver").length);
219+
} else {
220+
throw new Error(`HTTP ${response.status}`);
221+
}
165222
```
166223

167224
</details>

sources/academy/webscraping/scraping_basics_javascript2/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Anyone with basic knowledge of developing programs in JavaScript who wants to st
3333
## Requirements
3434

3535
- A macOS, Linux, or Windows machine with a web browser and Node.js installed.
36-
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
36+
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, arrays, objects, files, classes, promises, imports, and exceptions.
3737
- Comfort with building a Node.js package and installing dependencies with `npm`.
3838
- Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
3939

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ While somewhat possible, such an approach is tedious, fragile, and unreliable. T
2525

2626
:::info Why regex can't parse HTML
2727

28-
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
28+
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go very deep into the reasoning:
29+
30+
- In **formal language theory**, HTML's hierarchical, nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). **Regular expressions**, by contrast, match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler.
31+
- Because of this difference, regex alone struggles with HTML's nested tags. On top of that, HTML has **complex syntax rules** and countless **edge cases**, which only add to the difficulty.
2932

3033
:::
3134

@@ -63,7 +66,7 @@ $ python main.py
6366
[<h1 class="collection__title heading h1">Sales</h1>]
6467
```
6568

66-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
69+
Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
6770

6871
```py
6972
headings = soup.select("h1")
@@ -80,7 +83,7 @@ Sales
8083

8184
:::note Dynamic websites
8285

83-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
86+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
8487

8588
:::
8689

@@ -117,12 +120,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
117120

118121
<Exercises />
119122

120-
### Scrape F1 teams
123+
### Scrape F1 Academy teams
121124

122-
Print a total count of F1 teams listed on this page:
125+
Print a total count of F1 Academy teams listed on this page:
123126

124127
```text
125-
https://www.formula1.com/en/teams
128+
https://www.f1academy.com/Racing-Series/Teams
126129
```
127130

128131
<details>
@@ -132,20 +135,20 @@ https://www.formula1.com/en/teams
132135
import httpx
133136
from bs4 import BeautifulSoup
134137

135-
url = "https://www.formula1.com/en/teams"
138+
url = "https://www.f1academy.com/Racing-Series/Teams"
136139
response = httpx.get(url)
137140
response.raise_for_status()
138141

139142
html_code = response.text
140143
soup = BeautifulSoup(html_code, "html.parser")
141-
print(len(soup.select(".group")))
144+
print(len(soup.select(".teams-driver-item")))
142145
```
143146

144147
</details>
145148

146-
### Scrape F1 drivers
149+
### Scrape F1 Academy drivers
147150

148-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
151+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
149152

150153
<details>
151154
<summary>Solution</summary>
@@ -154,13 +157,13 @@ Use the same URL as in the previous exercise, but this time print a total count
154157
import httpx
155158
from bs4 import BeautifulSoup
156159

157-
url = "https://www.formula1.com/en/teams"
160+
url = "https://www.f1academy.com/Racing-Series/Teams"
158161
response = httpx.get(url)
159162
response.raise_for_status()
160163

161164
html_code = response.text
162165
soup = BeautifulSoup(html_code, "html.parser")
163-
print(len(soup.select(".f1-team-driver-name")))
166+
print(len(soup.select(".driver")))
164167
```
165168

166169
</details>

0 commit comments

Comments
 (0)