Skip to content

Commit d6d4f31

Browse files
committed
fix: update parsing to be about JS
1 parent 7b63fe7 commit d6d4f31

File tree

4 files changed

+138
-86
lines changed

4 files changed

+138
-86
lines changed

sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,9 +69,15 @@ All is OK, mate
6969

7070
:::info Troubleshooting
7171

72-
If you see `ReferenceError: require is not defined in ES module scope, you can use import instead`, double check that in your `package.json` the type property is set to `module`.
72+
If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
7373

74-
If you see other errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
74+
Double check that in your `package.json` the type property is set to `module`, otherwise you'll get the following warning:
75+
76+
```text
77+
[MODULE_TYPELESS_PACKAGE_JSON] Warning: Module type of file:///Users/.../product-scraper/index.js is not specified and it doesn't parse as CommonJS.
78+
Reparsing as ES module because module syntax was detected. This incurs a performance overhead.
79+
To eliminate this warning, add "type": "module" to /Users/.../product-scraper/package.json.
80+
```
7581

7682
:::
7783

@@ -85,6 +91,12 @@ const response = await fetch(url);
8591
console.log(await response.text());
8692
```
8793

94+
:::tip Asynchronous flow
95+
96+
First time you see `await`? It's a modern syntax for working with promises. See the [JavaScript Asynchronous Programming and Callbacks](https://nodejs.org/en/learn/asynchronous-work/javascript-asynchronous-programming-and-callbacks) and [Discover Promises in Node.js](https://nodejs.org/en/learn/asynchronous-work/discover-promises-in-nodejs) tutorials in the official Node.js documentation for more.
97+
98+
:::
99+
88100
If we run the program now, it should print the downloaded HTML:
89101

90102
```text
@@ -225,7 +237,7 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
225237

226238
### Download an image as a file
227239

228-
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV:
240+
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) and the [Writing files with Node.js](https://nodejs.org/en/learn/manipulating-files/writing-files-with-nodejs) tutorial for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV:
229241

230242
```text
231243
https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg

sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md

Lines changed: 111 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -20,162 +20,202 @@ As a first step, let's try counting how many products are on the listing page.
2020

2121
## Processing HTML
2222

23-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
23+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
2424

25-
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
25+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
2626

2727
:::info Why regex can't parse HTML
2828

2929
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
3030

3131
:::
3232

33-
We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
33+
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
3434

3535
```text
36-
$ pip install beautifulsoup4
36+
$ npm install cheerio
37+
38+
added 23 packages, and audited 24 packages in 1s
3739
...
38-
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3940
```
4041

41-
<!--
4242
:::tip Installing packages
4343

44-
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
44+
Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
4545

4646
:::
4747

48-
:::info Troubleshooting
49-
50-
If you see other errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
51-
52-
:::
53-
-->
54-
55-
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
48+
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
5649

5750
![Element of the main heading](./images/h1.png)
5851

5952
We'll update our code to the following:
6053

61-
```py
62-
import httpx
63-
from bs4 import BeautifulSoup
54+
```js
55+
import * as cheerio from 'cheerio';
6456

65-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
66-
response = httpx.get(url)
67-
response.raise_for_status()
57+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
58+
const response = await fetch(url);
6859

69-
html_code = response.text
70-
soup = BeautifulSoup(html_code, "html.parser")
71-
print(soup.select("h1"))
60+
if (response.ok) {
61+
const html = await response.text();
62+
const $ = cheerio.load(html);
63+
console.log($("h1"));
64+
} else {
65+
throw new Error(`HTTP ${response.status}`);
66+
}
7267
```
7368

7469
Then let's run the program:
7570

7671
```text
77-
$ python main.py
78-
[<h1 class="collection__title heading h1">Sales</h1>]
72+
$ node index.js
73+
LoadedCheerio {
74+
'0': <ref *1> Element {
75+
parent: Element { ... },
76+
prev: Text { ... },
77+
next: Element { ... },
78+
startIndex: null,
79+
endIndex: null,
80+
# highlight-next-line
81+
children: [ [Text] ],
82+
# highlight-next-line
83+
name: 'h1',
84+
attribs: [Object: null prototype] { class: 'collection__title heading h1' },
85+
type: 'tag',
86+
namespace: 'http://www.w3.org/1999/xhtml',
87+
'x-attribsNamespace': [Object: null prototype] { class: undefined },
88+
'x-attribsPrefix': [Object: null prototype] { class: undefined }
89+
},
90+
length: 1,
91+
...
92+
}
7993
```
8094

81-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
95+
Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the container.
96+
97+
The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
98+
99+
```js
100+
import * as cheerio from 'cheerio';
82101

83-
```py
84-
headings = soup.select("h1")
85-
first_heading = headings[0]
86-
print(first_heading.text)
102+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
103+
const response = await fetch(url);
104+
105+
if (response.ok) {
106+
const html = await response.text();
107+
const $ = cheerio.load(html);
108+
// highlight-next-line
109+
console.log($("h1").text());
110+
} else {
111+
throw new Error(`HTTP ${response.status}`);
112+
}
87113
```
88114

89-
If we run our scraper again, it prints the text of the first `h1` element:
115+
Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. If we call `.text()`, it automatically assumes we want to work with the first element in the collection. Thus, if we run our scraper again, it prints the text of the first `h1` element:
90116

91117
```text
92-
$ python main.py
118+
$ node index.js
93119
Sales
94120
```
95121

96122
:::note Dynamic websites
97123

98-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
124+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
99125

100126
:::
101127

102128
## Using CSS selectors
103129

104-
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
130+
Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
105131

106-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
132+
Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
107133

108-
```py
109-
import httpx
110-
from bs4 import BeautifulSoup
134+
```js
135+
import * as cheerio from 'cheerio';
111136

112-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
113-
response = httpx.get(url)
114-
response.raise_for_status()
137+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
138+
const response = await fetch(url);
115139

116-
html_code = response.text
117-
soup = BeautifulSoup(html_code, "html.parser")
118-
products = soup.select(".product-item")
119-
print(len(products))
140+
if (response.ok) {
141+
const html = await response.text();
142+
const $ = cheerio.load(html);
143+
// highlight-next-line
144+
console.log($(".product-item").length);
145+
} else {
146+
throw new Error(`HTTP ${response.status}`);
147+
}
120148
```
121149

122-
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
150+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back a container of matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the container.
123151

124152
```text
125-
$ python main.py
153+
$ node index.js
126154
24
127155
```
128156

129157
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
130158

159+
:::info Cheerio and jQuery
160+
161+
The Cheerio documentation frequently mentions something called jQuery. In the medieval days of the internet, when so-called Internet Explorers roamed the untamed plains of simple websites, developers created the first JavaScript frameworks to improve their crude tools and overcome the wild inconsistencies between browsers. Imagine a time when things like `document.querySelectorAll()` didn't even exist. jQuery was the most popular of these frameworks, granting great power to those who knew how to wield it.
162+
163+
Cheerio was deliberately designed to mimic jQuery's interface. At the time, nearly everyone was familiar with it, and it felt like the most natural way to walk through HTML elements. jQuery was used in the browser, Cheerio in Node.js. But as time passed, jQuery gradually faded from relevance. In a twist of history, we now learn its syntax only to use Cheerio.
164+
165+
:::
166+
131167
---
132168

133169
<Exercises />
134170

135-
### Scrape F1 teams
171+
### Scrape F1 Academy teams
136172

137-
Print a total count of F1 teams listed on this page:
173+
Print a total count of F1 Academy teams listed on this page:
138174

139175
```text
140-
https://www.formula1.com/en/teams
176+
https://www.f1academy.com/Racing-Series/Teams
141177
```
142178

143179
<details>
144180
<summary>Solution</summary>
145181

146-
```py
147-
import httpx
148-
from bs4 import BeautifulSoup
182+
```js
183+
import * as cheerio from 'cheerio';
149184

150-
url = "https://www.formula1.com/en/teams"
151-
response = httpx.get(url)
152-
response.raise_for_status()
185+
const url = "https://www.f1academy.com/Racing-Series/Teams";
186+
const response = await fetch(url);
153187

154-
html_code = response.text
155-
soup = BeautifulSoup(html_code, "html.parser")
156-
print(len(soup.select(".group")))
188+
if (response.ok) {
189+
const html = await response.text();
190+
const $ = cheerio.load(html);
191+
console.log($(".teams-driver-item").length);
192+
} else {
193+
throw new Error(`HTTP ${response.status}`);
194+
}
157195
```
158196

159197
</details>
160198

161-
### Scrape F1 drivers
199+
### Scrape F1 Academy drivers
162200

163-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
201+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
164202

165203
<details>
166204
<summary>Solution</summary>
167205

168-
```py
169-
import httpx
170-
from bs4 import BeautifulSoup
206+
```js
207+
import * as cheerio from 'cheerio';
171208

172-
url = "https://www.formula1.com/en/teams"
173-
response = httpx.get(url)
174-
response.raise_for_status()
209+
const url = "https://www.f1academy.com/Racing-Series/Teams";
210+
const response = await fetch(url);
175211

176-
html_code = response.text
177-
soup = BeautifulSoup(html_code, "html.parser")
178-
print(len(soup.select(".f1-team-driver-name")))
212+
if (response.ok) {
213+
const html = await response.text();
214+
const $ = cheerio.load(html);
215+
console.log($(".driver").length);
216+
} else {
217+
throw new Error(`HTTP ${response.status}`);
218+
}
179219
```
180220

181221
</details>

sources/academy/webscraping/scraping_basics_javascript2/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Anyone with basic knowledge of developing programs in JavaScript who wants to st
3333
## Requirements
3434

3535
- A macOS, Linux, or Windows machine with a web browser and Node.js installed.
36-
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
36+
- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, arrays, objects, files, classes, promises, imports, and exceptions.
3737
- Comfort with building a Node.js package and installing dependencies with `npm`.
3838
- Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
3939

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ $ python main.py
6363
[<h1 class="collection__title heading h1">Sales</h1>]
6464
```
6565

66-
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66+
Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
6767

6868
```py
6969
headings = soup.select("h1")
@@ -80,7 +80,7 @@ Sales
8080

8181
:::note Dynamic websites
8282

83-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
83+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
8484

8585
:::
8686

@@ -117,12 +117,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
117117

118118
<Exercises />
119119

120-
### Scrape F1 teams
120+
### Scrape F1 Academy teams
121121

122-
Print a total count of F1 teams listed on this page:
122+
Print a total count of F1 Academy teams listed on this page:
123123

124124
```text
125-
https://www.formula1.com/en/teams
125+
https://www.f1academy.com/Racing-Series/Teams
126126
```
127127

128128
<details>
@@ -132,20 +132,20 @@ https://www.formula1.com/en/teams
132132
import httpx
133133
from bs4 import BeautifulSoup
134134

135-
url = "https://www.formula1.com/en/teams"
135+
url = "https://www.f1academy.com/Racing-Series/Teams"
136136
response = httpx.get(url)
137137
response.raise_for_status()
138138

139139
html_code = response.text
140140
soup = BeautifulSoup(html_code, "html.parser")
141-
print(len(soup.select(".group")))
141+
print(len(soup.select(".teams-driver-item")))
142142
```
143143

144144
</details>
145145

146-
### Scrape F1 drivers
146+
### Scrape F1 Academy drivers
147147

148-
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
148+
Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
149149

150150
<details>
151151
<summary>Solution</summary>
@@ -154,13 +154,13 @@ Use the same URL as in the previous exercise, but this time print a total count
154154
import httpx
155155
from bs4 import BeautifulSoup
156156

157-
url = "https://www.formula1.com/en/teams"
157+
url = "https://www.f1academy.com/Racing-Series/Teams"
158158
response = httpx.get(url)
159159
response.raise_for_status()
160160

161161
html_code = response.text
162162
soup = BeautifulSoup(html_code, "html.parser")
163-
print(len(soup.select(".f1-team-driver-name")))
163+
print(len(soup.select(".driver")))
164164
```
165165

166166
</details>

0 commit comments

Comments
 (0)