You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -20,148 +20,202 @@ As a first step, let's try counting how many products are on the listing page.
20
20
21
21
## Processing HTML
22
22
23
-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
23
+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
24
24
25
-
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
25
+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
26
26
27
27
:::info Why regex can't parse HTML
28
28
29
29
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
30
30
31
31
:::
32
32
33
-
We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
33
+
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
42
+
:::tip Installing packages
43
+
44
+
Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
45
+
46
+
:::
47
+
48
+
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
95
+
Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the container.
96
+
97
+
The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
If we run our scraper again, it prints the text of the first `h1` element:
115
+
Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. If we call `.text()`, it automatically assumes we want to work with the first element in the collection. Thus, if we run our scraper again, it prints the text of the first `h1` element:
76
116
77
117
```text
78
-
$ python main.py
118
+
$ node index.js
79
119
Sales
80
120
```
81
121
82
122
:::note Dynamic websites
83
123
84
-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
124
+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
85
125
86
126
:::
87
127
88
128
## Using CSS selectors
89
129
90
-
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
130
+
Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
91
131
92
-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
132
+
Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
150
+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back a container of matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the container.
109
151
110
152
```text
111
-
$ python main.py
153
+
$ node index.js
112
154
24
113
155
```
114
156
115
157
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
116
158
159
+
:::info Cheerio and jQuery
160
+
161
+
The Cheerio documentation frequently mentions something called jQuery. In the medieval days of the internet, when so-called Internet Explorers roamed the untamed plains of simple websites, developers created the first JavaScript frameworks to improve their crude tools and overcome the wild inconsistencies between browsers. Imagine a time when things like `document.querySelectorAll()` didn't even exist. jQuery was the most popular of these frameworks, granting great power to those who knew how to wield it.
162
+
163
+
Cheerio was deliberately designed to mimic jQuery's interface. At the time, nearly everyone was familiar with it, and it felt like the most natural way to walk through HTML elements. jQuery was used in the browser, Cheerio in Node.js. But as time passed, jQuery gradually faded from relevance. In a twist of history, we now learn its syntax only to use Cheerio.
164
+
165
+
:::
166
+
117
167
---
118
168
119
169
<Exercises />
120
170
121
-
### Scrape F1 teams
171
+
### Scrape F1 Academy teams
122
172
123
-
Print a total count of F1 teams listed on this page:
173
+
Print a total count of F1 Academy teams listed on this page:
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66
+
Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
67
67
68
68
```py
69
69
headings = soup.select("h1")
@@ -80,7 +80,7 @@ Sales
80
80
81
81
:::note Dynamic websites
82
82
83
-
The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
83
+
The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
84
84
85
85
:::
86
86
@@ -117,12 +117,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
117
117
118
118
<Exercises />
119
119
120
-
### Scrape F1 teams
120
+
### Scrape F1 Academy teams
121
121
122
-
Print a total count of F1 teams listed on this page:
122
+
Print a total count of F1 Academy teams listed on this page:
0 commit comments