apify · honzajavorek · Aug 25, 2025 · Jun 24, 2025 · Jun 27, 2025 · Aug 5, 2025
@@ -20,148 +20,205 @@ As a first step, let's try counting how many products are on the listing page.
 
 ## Processing HTML
 
-After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
+After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
 
-While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
+While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
 
 :::info Why regex can't parse HTML
 
-While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
+While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go very deep into the reasoning:
+
+- In **formal language theory**, HTML's hierarchical, nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). **Regular expressions**, by contrast, match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler.
+- Because of this difference, regex alone struggles with HTML's nested tags. On top of that, HTML has **complex syntax rules** and countless **edge cases**, which only add to the difficulty.
 
 :::
 
-We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
+We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
 
 ```text
-$ pip install beautifulsoup4
+$ npm install cheerio --save
+
+added 23 packages, and audited 24 packages in 1s
 ...
-Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
 ```
 
-Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
+:::tip Installing packages
+
+Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
+
+:::
+
+Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
 
 ![Element of the main heading](./images/h1.png)
 
 We'll update our code to the following:
 
-```py
-import httpx
-from bs4 import BeautifulSoup
+```js
+import * as cheerio from 'cheerio';
 
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
 
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-print(soup.select("h1"))
+if (response.ok) {
+  const html = await response.text();
+  const $ = cheerio.load(html);
+  console.log($("h1"));
+} else {
+  throw new Error(`HTTP ${response.status}`);
+}
 ```
 
 Then let's run the program:
 
 ```text
-$ python main.py
-[<h1 class="collection__title heading h1">Sales</h1>]
+$ node index.js
+LoadedCheerio {
+  '0': <ref *1> Element {
+    parent: Element { ... },
+    prev: Text { ... },
+    next: Element { ... },
+    startIndex: null,
+    endIndex: null,
+# highlight-next-line
+    children: [ [Text] ],
+# highlight-next-line
+    name: 'h1',
+    attribs: [Object: null prototype] { class: 'collection__title heading h1' },
+    type: 'tag',
+    namespace: 'http://www.w3.org/1999/xhtml',
+    'x-attribsNamespace': [Object: null prototype] { class: undefined },
+    'x-attribsPrefix': [Object: null prototype] { class: undefined }
+  },
+  length: 1,
+  ...
+}
 ```
 
-Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
+Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the selection.
+
+The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
 
-```py
-headings = soup.select("h1")
-first_heading = headings[0]
-print(first_heading.text)
+```js
+import * as cheerio from 'cheerio';
+
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
+
+if (response.ok) {
+  const html = await response.text();
+  const $ = cheerio.load(html);
+  // highlight-next-line
+  console.log($("h1").text());
+} else {
+  throw new Error(`HTTP ${response.status}`);
+}
 ```
 
-If we run our scraper again, it prints the text of the first `h1` element:
+Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. Calling `.text()` combines texts of all elements in the selection. If we run our scraper again, it prints the text of the `h1` element:
 
 ```text
-$ python main.py
+$ node index.js
 Sales
 ```
 
 :::note Dynamic websites
 
-The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
+The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
 
 :::
 
 ## Using CSS selectors
 
-Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
+Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
 
-Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
+Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
 
-```py
-import httpx
-from bs4 import BeautifulSoup
+```js
+import * as cheerio from 'cheerio';
 
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
 
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-products = soup.select(".product-item")
-print(len(products))
+if (response.ok) {
+  const html = await response.text();
+  const $ = cheerio.load(html);
+  // highlight-next-line
+  console.log($(".product-item").length);
+} else {
+  throw new Error(`HTTP ${response.status}`);
+}
 ```
 
-In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
+In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the selection.
 
 ```text
-$ python main.py
+$ node index.js
 24
 ```
 
 That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
 
+:::info Cheerio and jQuery
+
+The Cheerio documentation frequently mentions jQuery. Back when browsers were wildly inconsistent and basic DOM methods like `document.querySelectorAll()` didn't exist, jQuery was the most popular JavaScript framework for web development. It provided a consistent API that worked across all browsers.
+
+Cheerio was designed to mimic jQuery's interface because nearly every developer knew jQuery at the time. jQuery worked in browsers, Cheerio in Node.js. While jQuery has largely faded from modern web development, we now learn its syntax specifically to use Cheerio for server-side HTML manipulation.
+
+:::
+
 ---
 
 <Exercises />
 
-### Scrape F1 teams
+### Scrape F1 Academy teams
 
-Print a total count of F1 teams listed on this page:
+Print a total count of F1 Academy teams listed on this page:
 
 ```text
-https://www.formula1.com/en/teams
+https://www.f1academy.com/Racing-Series/Teams
 ```
 
 <details>
   <summary>Solution</summary>
 
-  ```py
-  import httpx
-  from bs4 import BeautifulSoup
+  ```js
+  import * as cheerio from 'cheerio';
 
-  url = "https://www.formula1.com/en/teams"
-  response = httpx.get(url)
-  response.raise_for_status()
+  const url = "https://www.f1academy.com/Racing-Series/Teams";
+  const response = await fetch(url);
 
-  html_code = response.text
-  soup = BeautifulSoup(html_code, "html.parser")
-  print(len(soup.select(".group")))
+  if (response.ok) {
+    const html = await response.text();
+    const $ = cheerio.load(html);
+    console.log($(".teams-driver-item").length);
+  } else {
+    throw new Error(`HTTP ${response.status}`);
+  }
   ```
 
 </details>
 
-### Scrape F1 drivers
+### Scrape F1 Academy drivers
 
-Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
+Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
 
 <details>
   <summary>Solution</summary>
 
-  ```py
-  import httpx
-  from bs4 import BeautifulSoup
+  ```js
+  import * as cheerio from 'cheerio';
 
-  url = "https://www.formula1.com/en/teams"
-  response = httpx.get(url)
-  response.raise_for_status()
+  const url = "https://www.f1academy.com/Racing-Series/Teams";
+  const response = await fetch(url);
 
-  html_code = response.text
-  soup = BeautifulSoup(html_code, "html.parser")
-  print(len(soup.select(".f1-team-driver-name")))
+  if (response.ok) {
+    const html = await response.text();
+    const $ = cheerio.load(html);
+    console.log($(".driver").length);
+  } else {
+    throw new Error(`HTTP ${response.status}`);
+  }
   ```
 
 </details>
@@ -33,7 +33,7 @@ Anyone with basic knowledge of developing programs in JavaScript who wants to st
 ## Requirements
 
 - A macOS, Linux, or Windows machine with a web browser and Node.js installed.
-- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
+- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, arrays, objects, files, classes, promises, imports, and exceptions.
 - Comfort with building a Node.js package and installing dependencies with `npm`.
 - Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
 

@@ -25,7 +25,10 @@ While somewhat possible, such an approach is tedious, fragile, and unreliable. T
 
 :::info Why regex can't parse HTML
 
-While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
+While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go very deep into the reasoning:
+
+- In **formal language theory**, HTML's hierarchical, nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). **Regular expressions**, by contrast, match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler.
+- Because of this difference, regex alone struggles with HTML's nested tags. On top of that, HTML has **complex syntax rules** and countless **edge cases**, which only add to the difficulty.
 
 :::
 
@@ -63,7 +66,7 @@ $ python main.py
 [<h1 class="collection__title heading h1">Sales</h1>]
 ```
 
-Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
+Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
 
 ```py
 headings = soup.select("h1")
@@ -80,7 +83,7 @@ Sales
 
 :::note Dynamic websites
 
-The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
+The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
 
 :::
 
@@ -117,12 +120,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
 
 <Exercises />
 
-### Scrape F1 teams
+### Scrape F1 Academy teams
 
-Print a total count of F1 teams listed on this page:
+Print a total count of F1 Academy teams listed on this page:
 
 ```text
-https://www.formula1.com/en/teams
+https://www.f1academy.com/Racing-Series/Teams
 ```
 
 <details>
@@ -132,20 +135,20 @@ https://www.formula1.com/en/teams
   import httpx
   from bs4 import BeautifulSoup
 
-  url = "https://www.formula1.com/en/teams"
+  url = "https://www.f1academy.com/Racing-Series/Teams"
   response = httpx.get(url)
   response.raise_for_status()
 
   html_code = response.text
   soup = BeautifulSoup(html_code, "html.parser")
-  print(len(soup.select(".group")))
+  print(len(soup.select(".teams-driver-item")))
   ```
 
 </details>
 
-### Scrape F1 drivers
+### Scrape F1 Academy drivers
 
-Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
+Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
 
 <details>
   <summary>Solution</summary>
@@ -154,13 +157,13 @@ Use the same URL as in the previous exercise, but this time print a total count
   import httpx
   from bs4 import BeautifulSoup
 
-  url = "https://www.formula1.com/en/teams"
+  url = "https://www.f1academy.com/Racing-Series/Teams"
   response = httpx.get(url)
   response.raise_for_status()
 
   html_code = response.text
   soup = BeautifulSoup(html_code, "html.parser")
-  print(len(soup.select(".f1-team-driver-name")))
+  print(len(soup.select(".driver")))
   ```
 
 </details>