diff --git a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
index 6f96ed2c7..750bf4f1b 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
@@ -20,9 +20,9 @@ As a first step, let's try counting how many products are on the listing page.
## Processing HTML
-After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
+After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#instance_methods) or [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to count the products?
-While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
+While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of JavaScript objects.
:::info Why regex can't parse HTML
@@ -30,138 +30,192 @@ While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732
:::
-We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
+We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:
```text
-$ pip install beautifulsoup4
+$ npm install cheerio --save
+
+added 123 packages, and audited 123 packages in 0s
...
-Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
```
-Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `
` element, which represents the main heading of the page.
+:::tip Installing packages
+
+Being comfortable around installing Node.js packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend [An introduction to the npm package manager](https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager) tutorial from the official Node.js documentation.
+
+:::
+
+Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `` element, which represents the main heading of the page.

We'll update our code to the following:
-```py
-import httpx
-from bs4 import BeautifulSoup
+```js
+import * as cheerio from 'cheerio';
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-print(soup.select("h1"))
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ console.log($("h1"));
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
Then let's run the program:
```text
-$ python main.py
-[Sales
]
+$ node index.js
+LoadedCheerio {
+ '0': [ Element {
+ parent: Element { ... },
+ prev: Text { ... },
+ next: Element { ... },
+ startIndex: null,
+ endIndex: null,
+# highlight-next-line
+ children: [ [Text] ],
+# highlight-next-line
+ name: 'h1',
+ attribs: [Object: null prototype] { class: 'collection__title heading h1' },
+ type: 'tag',
+ namespace: 'http://www.w3.org/1999/xhtml',
+ 'x-attribsNamespace': [Object: null prototype] { class: undefined },
+ 'x-attribsPrefix': [Object: null prototype] { class: undefined }
+ },
+ length: 1,
+ ...
+}
```
-Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
+Our code prints a Cheerio object. It's something like an array of all `h1` elements Cheerio can find in the HTML we gave it. It's the case that there's just one, so we can see only a single item in the selection.
+
+The item has many properties, such as references to its parent or sibling elements, but most importantly, its name is `h1` and in the `children` property, it contains a single text element. Now let's print just the text. Let's change our program to the following:
+
+```js
+import * as cheerio from 'cheerio';
-```py
-headings = soup.select("h1")
-first_heading = headings[0]
-print(first_heading.text)
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
+
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ // highlight-next-line
+ console.log($("h1").text());
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
-If we run our scraper again, it prints the text of the first `h1` element:
+Thanks to the nature of the Cheerio object we don't have to explicitly find the first element. Calling `.text()` combines texts of all elements in the selection. If we run our scraper again, it prints the text of the `h1` element:
```text
-$ python main.py
+$ node index.js
Sales
```
:::note Dynamic websites
-The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
+The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `await response.text()` in Node.js. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
:::
## Using CSS selectors
-Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
+Cheerio's `$()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
-Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
+Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting) will help us to figure out code for counting the product cards:
-```py
-import httpx
-from bs4 import BeautifulSoup
+```js
+import * as cheerio from 'cheerio';
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-products = soup.select(".product-item")
-print(len(products))
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ // highlight-next-line
+ console.log($(".product-item").length);
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
-In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
+In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `$()` with the selector and get back matching elements. Cheerio handles all the complexity of understanding the HTML markup for us. Then we use `.length` to count how many items there is in the selection.
```text
-$ python main.py
+$ node index.js
24
```
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
+:::info Cheerio and jQuery
+
+The Cheerio documentation frequently mentions something called jQuery. In the medieval days of the internet, when so-called Internet Explorers roamed the untamed plains of simple websites, developers created the first JavaScript frameworks to improve their crude tools and overcome the wild inconsistencies between browsers. Imagine a time when things like `document.querySelectorAll()` didn't even exist. jQuery was the most popular of these frameworks, granting great power to those who knew how to wield it.
+
+Cheerio was deliberately designed to mimic jQuery's interface. At the time, nearly everyone was familiar with it, and it felt like the most natural way to walk through HTML elements. jQuery was used in the browser, Cheerio in Node.js. But as time passed, jQuery gradually faded from relevance. In a twist of history, we now learn its syntax only to use Cheerio.
+
+:::
+
---
-### Scrape F1 teams
+### Scrape F1 Academy teams
-Print a total count of F1 teams listed on this page:
+Print a total count of F1 Academy teams listed on this page:
```text
-https://www.formula1.com/en/teams
+https://www.f1academy.com/Racing-Series/Teams
```
]
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+ ```js
+ import * as cheerio from 'cheerio';
- url = "https://www.formula1.com/en/teams"
- response = httpx.get(url)
- response.raise_for_status()
+ const url = "https://www.f1academy.com/Racing-Series/Teams";
+ const response = await fetch(url);
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
- print(len(soup.select(".group")))
+ if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ console.log($(".teams-driver-item").length);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
```
-### Scrape F1 drivers
+### Scrape F1 Academy drivers
-Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
+Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+ ```js
+ import * as cheerio from 'cheerio';
- url = "https://www.formula1.com/en/teams"
- response = httpx.get(url)
- response.raise_for_status()
+ const url = "https://www.f1academy.com/Racing-Series/Teams";
+ const response = await fetch(url);
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
- print(len(soup.select(".f1-team-driver-name")))
+ if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ console.log($(".driver").length);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
```
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md
index 2aa3100e7..9b210d527 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md
@@ -12,46 +12,49 @@ import Exercises from './_exercises.mdx';
---
-In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two. What happens if we print `.text` for each product card?
-
-```py
-import httpx
-from bs4 import BeautifulSoup
-
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
-
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-
-for product in soup.select(".product-item"):
- print(product.text)
+In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two. What happens if we print `.text()` for each product card?
+
+```js
+import * as cheerio from 'cheerio';
+
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
+
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ // highlight-start
+ for (const element of $(".product-item").toArray()) {
+ console.log($(element).text());
+ }
+ // highlight-end
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
-Well, it definitely prints _something_…
+Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/Cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element.
-```text
-$ python main.py
-Save $25.00
+Cheerio requires us to wrap each element with `$()` again before we can work with it further, and then we call `.text()`. If we run the code, it… well, it definitely prints _something_…
+```text
+$ node index.js
-JBL
-JBL Flip 4 Waterproof Portable Bluetooth Speaker
+ JBL
+JBL Flip 4 Waterproof Portable Bluetooth Speaker
-Black
-+7
+ Black
-Blue
+ +7
-+6
+ Blue
-Grey
+ +6
...
```
@@ -65,84 +68,54 @@ As in the browser DevTools lessons, we need to change the code so that it locate
We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:
-```py
-import httpx
-from bs4 import BeautifulSoup
+```js
+import * as cheerio from 'cheerio';
+
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
+ for (const element of $(".product-item").toArray()) {
+ const $productItem = $(element);
-for product in soup.select(".product-item"):
- titles = product.select(".product-item__title")
- first_title = titles[0].text
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text();
- prices = product.select(".price")
- first_price = prices[0].text
+ const $price = $productItem.find(".price");
+ const price = $price.text();
- print(first_title, first_price)
+ console.log(`${title} | ${price}`);
+ }
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
Let's run the program now:
```text
$ python main.py
-JBL Flip 4 Waterproof Portable Bluetooth Speaker
-Sale price$74.95
-Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
-Sale priceFrom $1,398.00
+JBL Flip 4 Waterproof Portable Bluetooth Speaker |
+ Sale price$74.95
+Sony XBR-950G BRAVIA 4K HDR Ultra HD TV |
+ Sale priceFrom $1,398.00
...
```
There's still some room for improvement, but it's already much better!
-## Locating a single element
-
-Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code!
-
-```py
-import httpx
-from bs4 import BeautifulSoup
-
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
-
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-
-for product in soup.select(".product-item"):
- title = product.select_one(".product-item__title").text
- price = product.select_one(".price").text
- print(title, price)
-```
-
-This program does the same as the one we already had, but its code is more concise.
-
-:::note Fragile code
+:::info Dollar sign variable names
-We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above will trigger warnings about this.
-
-Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.
+In jQuery and Cheerio, the core idea is a collection that wraps selected objects, usually HTML elements. To tell these wrapped selections apart from plain arrays, strings or other objects, it's common to start variable names with a dollar sign. This is just a naming convention to improve readability. The dollar sign has no special meaning and works like any other character in a variable name.
:::
## Precisely locating price
-In the output we can see that the price isn't located precisely:
-
-```text
-JBL Flip 4 Waterproof Portable Bluetooth Speaker
-Sale price$74.95
-Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
-Sale priceFrom $1,398.00
-...
-```
-
-For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:
+In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:
```html
@@ -151,58 +124,77 @@ For each product, our scraper also prints the text `Sale price`. Let's look at t
```
-When translated to a tree of Python objects, the element with class `price` will contain several _nodes_:
+When translated to a tree of JavaScript objects, the element with class `price` will contain several _nodes_:
- Textual node with white space,
- a `span` HTML element,
- a textual node representing the actual amount and possibly also white space.
-We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this:
+We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/Cheerio#contents) method to access individual nodes. It returns a list of nodes like this:
-```py
-["\n", Sale price, "$74.95"]
+```text
+LoadedCheerio {
+ '0': [ Text {
+ parent: Element { ... },
+ prev: null,
+ next: Element { ... },
+ data: '\n ',
+ type: 'text'
+ },
+ '1': ][ Element {
+ parent: Element { ... },
+ prev: ][ Text { ... },
+ next: Text { ... },
+ children: [ [Text] ],
+ name: 'span',
+ type: 'tag',
+ ...
+ },
+ '2': ][ Text {
+ parent: Element { ... },
+ prev: ][ Element { ... },
+ next: null,
+ data: '$74.95',
+ type: 'text'
+ },
+ length: 3,
+ ...
+}
```
-It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program:
+It seems like we can read the last element to get the actual amount. Let's fix our program:
-```py
-import httpx
-from bs4 import BeautifulSoup
+```js
+import * as cheerio from 'cheerio';
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
-for product in soup.select(".product-item"):
- title = product.select_one(".product-item__title").text
- price = product.select_one(".price").contents[-1]
- print(title, price)
-```
+ for (const element of $(".product-item").toArray()) {
+ const $productItem = $(element);
-If we run the scraper now, it should print prices as only amounts:
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text();
-```text
-$ python main.py
-JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
-Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
-...
-```
-
-## Formatting output
-
-The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function:
+ // highlight-next-line
+ const $price = $productItem.find(".price").contents().last();
+ const price = $price.text();
-```py
-print(title, price, sep=" | ")
+ console.log(`${title} | ${price}`);
+ }
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
-The output is much nicer this way:
+We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/Cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/Cheerio#last). If we run the scraper now, it should print prices as only amounts:
```text
-$ python main.py
+$ node index.js
JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
...
@@ -216,7 +208,7 @@ Great! We have managed to use CSS selectors and walk the HTML tree to get a list
### Scrape Wikipedia
-Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL:
+Download Wikipedia's page with the list of African countries, use Cheerio to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL:
```text
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
@@ -229,30 +221,50 @@ Algeria
Angola
Benin
Botswana
+Burkina Faso
+Burundi
+Cameroon
+Cape Verde
+Central African Republic
+Chad
+Comoros
+Democratic Republic of the Congo
+Republic of the Congo
+Djibouti
...
```
]
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
-
- url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- response = httpx.get(url)
- response.raise_for_status()
-
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
-
- for table in soup.select(".wikitable"):
- for row in table.select("tr"):
- cells = row.select("td")
- if cells:
- third_column = cells[2]
- title_link = third_column.select_one("a")
- print(title_link.text)
+ ```js
+ import * as cheerio from 'cheerio';
+
+ const url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
+ const response = await fetch(url);
+
+ if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+
+ for (const tableElement of $(".wikitable").toArray()) {
+ const $table = $(tableElement);
+ const $rows = $table.find("tr");
+
+ for (const rowElement of $rows.toArray()) {
+ const $row = $(rowElement);
+ const $cells = $row.find("td");
+
+ if ($cells.length > 0) {
+ const $thirdColumn = $($cells[2]);
+ const $link = $thirdColumn.find("a").first();
+ console.log($link.text());
+ }
+ }
+ }
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
```
Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
@@ -269,26 +281,31 @@ Simplify the code from previous exercise. Use a single for loop and a single CSS
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+ ```js
+ import * as cheerio from 'cheerio';
- url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- response = httpx.get(url)
- response.raise_for_status()
+ const url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
+ const response = await fetch(url);
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+ if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
- for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
- print(name_cell.select_one("a").text)
+ for (const element of $(".wikitable tr td:nth-child(3)").toArray()) {
+ const $nameCell = $(element);
+ const $link = $nameCell.find("a").first();
+ console.log($link.text());
+ }
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
```
### Scrape F1 news
-Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print titles of all the listed articles. This is the URL:
+Download Guardian's page with the latest F1 news, use Cheerio to parse it, and print titles of all the listed articles. This is the URL:
```text
https://www.theguardian.com/sport/formulaone
@@ -306,19 +323,22 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+ ```js
+ import * as cheerio from 'cheerio';
- url = "https://www.theguardian.com/sport/formulaone"
- response = httpx.get(url)
- response.raise_for_status()
+ const url = "https://www.theguardian.com/sport/formulaone";
+ const response = await fetch(url);
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+ if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
- for title in soup.select("#maincontent ul li h3"):
- print(title.text)
+ for (const element of $("#maincontent ul li h3").toArray()) {
+ console.log($(element).text());
+ }
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
```
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md
index b84685cf0..7ca821eef 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md
@@ -36,14 +36,14 @@ It's because some products have variants with different prices. Later in the cou
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix?
```js
-const priceText = price.text().replace("From ", "");
+const priceText = $price.text().replace("From ", "");
```
In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty:
```js
const priceRange = { minPrice: null, price: null };
-const priceText = price.text()
+const priceText = $price.text()
if (priceText.startsWith("From ")) {
priceRange.minPrice = priceText.replace("From ", "");
} else {
@@ -70,15 +70,15 @@ if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- $(".product-item").each((i, element) => {
- const productItem = $(element);
+ for (const element of $(".product-item").toArray()) {
+ const $productItem = $(element);
- const title = productItem.find(".product-item__title");
- const titleText = title.text();
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text();
- const price = productItem.find(".price").contents().last();
+ const $price = $productItem.find(".price").contents().last();
const priceRange = { minPrice: null, price: null };
- const priceText = price.text();
+ const priceText = $price.text();
if (priceText.startsWith("From ")) {
priceRange.minPrice = priceText.replace("From ", "");
} else {
@@ -86,8 +86,8 @@ if (response.ok) {
priceRange.price = priceRange.minPrice;
}
- console.log(`${titleText} | ${priceRange.minPrice} | ${priceRange.price}`);
- });
+ console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
@@ -100,9 +100,9 @@ Often, the strings we extract from a web page start or end with some amount of w
We call the operation of removing whitespace _trimming_ or _stripping_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add JavaScript's built-in [.trim()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim):
```js
-const titleText = title.text().trim();
+const title = $title.text().trim();
-const priceText = price.text().trim();
+const priceText = $price.text().trim();
```
## Removing dollar sign and commas
@@ -124,7 +124,7 @@ The demonstration above is inside the Node.js' [interactive REPL](https://nodejs
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) are often the best tool for the job, but in this case [`.replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) is also sufficient:
```js
-const priceText = price
+const priceText = $price
.text()
.trim()
.replace("$", "")
@@ -137,7 +137,7 @@ Now we should be able to add `parseFloat()`, so that we have the prices not as a
```js
const priceRange = { minPrice: null, price: null };
-const priceText = price.text()
+const priceText = $price.text()
if (priceText.startsWith("From ")) {
priceRange.minPrice = parseFloat(priceText.replace("From ", ""));
} else {
@@ -156,7 +156,7 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. We won't store dollars, but cents:
```js
-const priceText = price
+const priceText = $price
.text()
.trim()
.replace("$", "")
@@ -177,15 +177,15 @@ if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- $(".product-item").each((i, element) => {
- const productItem = $(element);
+ for (const element of $(".product-item").toArray()) {
+ const $productItem = $(element);
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+ const $title = $productItem.find(".product-item__title");
+ const titleText = $title.text().trim();
- const price = productItem.find(".price").contents().last();
+ const $price = $productItem.find(".price").contents().last();
const priceRange = { minPrice: null, price: null };
- const priceText = price
+ const priceText = $price
.text()
.trim()
.replace("$", "")
@@ -199,8 +199,8 @@ if (response.ok) {
priceRange.price = priceRange.minPrice;
}
- console.log(`${titleText} | ${priceRange.minPrice} | ${priceRange.price}`);
- });
+ console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
@@ -258,17 +258,17 @@ Denon AH-C720 In-Ear Headphones | 236
const html = await response.text();
const $ = cheerio.load(html);
- $(".product-item").each((i, element) => {
- const productItem = $(element);
+ for (const element of $(".product-item").toArray()) {
+ const $productItem = $(element);
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+ const title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
- const unitsText = productItem.find(".product-item__inventory").text();
+ const unitsText = $productItem.find(".product-item__inventory").text();
const unitsCount = parseUnitsText(unitsText);
- console.log(`${titleText} | ${unitsCount}`);
- });
+ console.log(`${title} | ${unitsCount}`);
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
@@ -307,17 +307,17 @@ Simplify the code from previous exercise. Use [regular expressions](https://deve
const html = await response.text();
const $ = cheerio.load(html);
- $(".product-item").each((i, element) => {
- const productItem = $(element);
+ for (const element of $(".product-item").toArray()) {
+ const $productItem = $(element);
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
- const unitsText = productItem.find(".product-item__inventory").text();
+ const unitsText = $productItem.find(".product-item__inventory").text();
const unitsCount = parseUnitsText(unitsText);
- console.log(`${titleText} | ${unitsCount}`);
- });
+ console.log(`${title} | ${unitsCount}`);
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
@@ -369,21 +369,21 @@ Hints:
const html = await response.text();
const $ = cheerio.load(html);
- $("#maincontent ul li").each((i, element) => {
- const article = $(element);
+ for (const element of $("#maincontent ul li").toArray()) {
+ const $article = $(element);
- const titleText = article
+ const title = $article
.find("h3")
.text()
.trim();
- const dateText = article
+ const dateText = $article
.find("time")
.attr("datetime")
.trim();
const date = new Date(dateText);
- console.log(`${titleText} | ${date.toDateString()}`);
- });
+ console.log(`${title} | ${date.toDateString()}`);
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md
index e1ad7365a..b98138722 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md
@@ -13,9 +13,9 @@ unlisted: true
We managed to scrape data about products and print it, with each product separated by a new line and each field separated by the `|` character. This already produces structured text that can be parsed, i.e., read programmatically.
```text
-$ python main.py
-JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
-Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
+$ node index.js
+JBL Flip 4 Waterproof Portable Bluetooth Speaker | 7495 | 7495
+Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 139800 | null
...
```
@@ -25,222 +25,213 @@ We should use widely popular formats that have well-defined solutions for all th
## Collecting data
-Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program:
-
-```py
-import httpx
-from bs4 import BeautifulSoup
-from decimal import Decimal
-
-url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-response = httpx.get(url)
-response.raise_for_status()
-
-html_code = response.text
-soup = BeautifulSoup(html_code, "html.parser")
-
-# highlight-next-line
-data = []
-for product in soup.select(".product-item"):
- title = product.select_one(".product-item__title").text.strip()
-
- price_text = (
- product
- .select_one(".price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- if price_text.startswith("From "):
- min_price = Decimal(price_text.removeprefix("From "))
- price = None
- else:
- min_price = Decimal(price_text)
- price = min_price
-
- # highlight-next-line
- data.append({"title": title, "min_price": min_price, "price": price})
-
-# highlight-next-line
-print(data)
+Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take four changes to our program:
+
+```js
+import * as cheerio from 'cheerio';
+
+const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const response = await fetch(url);
+
+if (response.ok) {
+ const html = await response.text();
+ const $ = cheerio.load(html);
+
+ // highlight-next-line
+ const data = $(".product-item").toArray().map(element => {
+ const $productItem = $(element);
+
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
+
+ const $price = $productItem.find(".price").contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "");
+
+ if (priceText.startsWith("From ")) {
+ priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ // highlight-next-line
+ return { title, ...priceRange };
+ });
+ // highlight-next-line
+ console.log(data);
+} else {
+ throw new Error(`HTTP ${response.status}`);
+}
```
-Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once.
+Instead of printing each line, we now return the data for each product as a JavaScript object. We've replaced the `for` loop with [`.map()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map), which also iterates over the selection but, in addition, collects all the results and returns them as another array. Near the end of the program, we print this entire array.
-```text
-$ python main.py
-[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...]
-```
+:::tip Advanced syntax
-:::tip Pretty print
+When returning the item object, we use [shorthand property syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Object_initializer#property_definitions) to set the title, and [spread syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax) to set the prices. It's the same as if we wrote the following:
-If you find the complex data structures printed by `print()` difficult to read, try using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) from the `pprint` module instead.
+```js
+{
+ title: title,
+ minPrice: priceRange.minPrice,
+ price: priceRange.price,
+}
+```
:::
-## Saving data as CSV
+The program should now print the results as a single large JavaScript array:
-The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
+```text
+$ node index.js
+[
+ {
+ title: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker',
+ minPrice: 7495,
+ price: 7495
+ },
+ {
+ title: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV',
+ minPrice: 139800,
+ price: null
+ },
+ ...
+]
+```
-In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
+## Saving data as JSON
-```py
->>> import csv
->>> with open("data.csv", "w") as file:
-... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
-... writer.writeheader()
-... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
-... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
-...
-```
+The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of JavaScript objects, but people now use it accross programming languages.
-We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
+We'll begin with importing the `writeFile` function from the Node.js standard library, so that we can, well, write files:
-```csv title=data.csv
-name,age,hobbies
-Alice,24,"kickbox, Python"
-Bob,42,"reading, TypeScript"
+```js
+import * as cheerio from 'cheerio';
+// highlight-next-line
+import { writeFile } from "fs/promises";
```
-In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
+Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `console.log(data)` with the following:
-When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
-
-
+```js
+const jsonData = JSON.stringify(data);
+await writeFile('products.json', jsonData);
+```
-Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
+That's it! If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
-```py
-import httpx
-from bs4 import BeautifulSoup
-from decimal import Decimal
-# highlight-next-line
-import csv
+
+```json title=products.json
+[{"title":"JBL Flip 4 Waterproof Portable Bluetooth Speaker","minPrice":7495,"price":7495},{"title":"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV","minPrice":139800,"price":null},...]
```
-Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:
+If you skim through the data, you'll notice that the `JSON.stringify()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
-```py
-with open("products.csv", "w") as file:
- writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
- writer.writeheader()
- for row in data:
- writer.writerow(row)
+```json
+{"title":"Sony SACS9 10\" Active Subwoofer","minPrice":15800,"price":15800}
```
-If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.
+:::tip Pretty JSON
-
+While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can call `JSON.stringify(data, null, 2)` for prettier output. See [documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify) for explanation of the parameters and more examples.
-## Saving data as JSON
+:::
-The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.
+## Saving data as CSV
-In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
+The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
-```py
-import httpx
-from bs4 import BeautifulSoup
-from decimal import Decimal
-import csv
-# highlight-next-line
-import json
-```
+Neither JavaScript itself nor Node.js offers anything built-in to read and write CSV, so we'll need to install a library. We'll use [json2csv](https://juanjodiaz.github.io/json2csv/), a _de facto_ standard for working with CSV in JavaScript:
-Next, let’s append one more export to end of the source code of our scraper:
+```text
+$ npm install @json2csv/node --save
-```py
-with open("products.json", "w") as file:
- json.dump(data, file)
+added 123 packages, and audited 123 packages in 0s
+...
```
-That’s it! If we run the program now, it should also create a `products.json` file in the current working directory:
+Once installed, we can add the following line to our imports:
-```text
-$ python main.py
-Traceback (most recent call last):
- ...
- raise TypeError(f'Object of type {o.__class__.__name__} '
-TypeError: Object of type Decimal is not JSON serializable
+```js
+import * as cheerio from 'cheerio';
+import { writeFile } from "fs/promises";
+// highlight-next-line
+import { AsyncParser } from '@json2csv/node';
```
-Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly:
+Then, let's add one more data export near the end of the source code of our scraper:
-```py
-def serialize(obj):
- if isinstance(obj, Decimal):
- return str(obj)
- raise TypeError("Object not JSON serializable")
+```js
+const jsonData = JSON.stringify(data);
+await writeFile('products.json', jsonData);
-with open("products.json", "w") as file:
- json.dump(data, file, default=serialize)
+const parser = new AsyncParser();
+const csvData = await parser.parse(data).promise();
+await writeFile("products.csv", csvData);
```
-Now the program should work as expected, producing a JSON file with the following content:
+The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
-
-```json title=products.json
-[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...]
-```
+
-If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
+In the CSV format, if a value contains commas, we should enclose it in quotes. If it contains quotes, we should double them. When we open the file in a text editor of our choice, we can see that the library automatically handled this:
-```json
-{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"}
+```csv title=data.csv
+"title","minPrice","price"
+"JBL Flip 4 Waterproof Portable Bluetooth Speaker",7495,7495
+"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",139800,
+"Sony SACS9 10"" Active Subwoofer",15800,15800
+...
+"Samsung Surround Sound Bar Home Speaker, Set of 7 (HW-NW700/ZA)",64799,64799
+...
```
-:::tip Pretty JSON
-
-While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output.
-
-Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`.
-
-:::
-
-We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
+We've built a Node.js application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
---
## Exercises
-In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
+In this lesson, we created export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
-### Process your CSV
+### Process your JSON
-Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
+Write a new Node.js program that reads the `products.json` file we created in the lesson, finds all products with a min price greater than $500, and prints each of them.
Solution
- Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
+ ```js
+ import { readFile } from "fs/promises";
- 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
- 2. Select the header row. Go to **Data > Create filter**.
- 3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
-
- 
+ const jsonData = await readFile("products.json");
+ const data = JSON.parse(jsonData);
+ data
+ .filter(row => row.minPrice > 50000)
+ .forEach(row => console.log(row));
+ ```
-### Process your JSON
+### Process your CSV
-Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp).
+Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500.
Solution
- ```py
- import json
- from pprint import pp
- from decimal import Decimal
+ Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
- with open("products.json", "r") as file:
- products = json.load(file)
+ 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
+ 2. Select the header row. Go to **Data > Create filter**.
+ 3. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
- for product in products:
- if Decimal(product["min_price"]) > 500:
- pp(product)
- ```
+ 
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
index 7eb6f6618..bf8e29714 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
@@ -43,16 +43,15 @@ if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- const data = [];
- $(".product-item").each((i, element) => {
- const productItem = $(element);
+ const data = $(".product-item").toArray().map(element => {
+ const $productItem = $(element);
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
- const price = productItem.find(".price").contents().last();
+ const $price = $productItem.find(".price").contents().last();
const priceRange = { minPrice: null, price: null };
- const priceText = price
+ const priceText = $price
.text()
.trim()
.replace("$", "")
@@ -66,7 +65,7 @@ if (response.ok) {
priceRange.price = priceRange.minPrice;
}
- data.push({ title: titleText, ...priceRange });
+ return { title, ...priceRange };
});
const jsonData = JSON.stringify(data);
@@ -97,13 +96,13 @@ async function download(url) {
Next, we can put parsing into a `parseProduct()` function, which takes the product item element and returns the object with data:
```js
-function parseProduct(productItem) {
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+function parseProduct($productItem) {
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
- const price = productItem.find(".price").contents().last();
+ const $price = $productItem.find(".price").contents().last();
const priceRange = { minPrice: null, price: null };
- const priceText = price
+ const priceText = $price
.text()
.trim()
.replace("$", "")
@@ -117,24 +116,18 @@ function parseProduct(productItem) {
priceRange.price = priceRange.minPrice;
}
- return { title: titleText, ...priceRange };
+ return { title, ...priceRange };
}
```
Now the JSON export. For better readability, let's make a small change here and set the indentation level to two spaces:
```js
-async function exportJSON(data) {
+function exportJSON(data) {
return JSON.stringify(data, null, 2);
}
```
-:::note Why asynchronous?
-
-The `exportJSON()` function doesn't need to be `async` now, but keeping it makes future changes easier — like switching to an async JSON parser. It also stays consistent with the upcoming `exportCSV()` function, which must be asynchronous.
-
-:::
-
The last function we'll add will take care of the CSV export:
```js
@@ -161,13 +154,13 @@ async function download(url) {
}
}
-function parseProduct(productItem) {
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+function parseProduct($productItem) {
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
- const price = productItem.find(".price").contents().last();
+ const $price = $productItem.find(".price").contents().last();
const priceRange = { minPrice: null, price: null };
- const priceText = price
+ const priceText = $price
.text()
.trim()
.replace("$", "")
@@ -181,10 +174,10 @@ function parseProduct(productItem) {
priceRange.price = priceRange.minPrice;
}
- return { title: titleText, ...priceRange };
+ return { title, ...priceRange };
}
-async function exportJSON(data) {
+function exportJSON(data) {
return JSON.stringify(data, null, 2);
}
@@ -196,14 +189,13 @@ async function exportCSV(data) {
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
const $ = await download(listingURL);
-const data = []
-$(".product-item").each((i, element) => {
- const productItem = $(element);
- const item = parseProduct(productItem);
- data.push(item);
+const data = $(".product-item").toArray().map(element => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem);
+ return item;
});
-await writeFile('products.json', await exportJSON(data));
+await writeFile('products.json', exportJSON(data));
await writeFile('products.csv', await exportCSV(data));
```
@@ -232,14 +224,14 @@ Several methods exist for transitioning from one page to another, but the most c
In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Cheerio selections support accessing attributes using the `.attr()` method:
```js
-function parseProduct(productItem) {
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
- const url = title.attr("href");
+function parseProduct($productItem) {
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
+ const url = $title.attr("href");
...
- return { url, title: titleText, ...priceRange };
+ return { url, title, ...priceRange };
}
```
@@ -274,15 +266,15 @@ We'll change the `parseProduct()` function so that it also takes the base URL as
```js
// highlight-next-line
-function parseProduct(productItem, baseURL) {
- const title = productItem.find(".product-item__title");
- const titleText = title.text().trim();
+function parseProduct($productItem, baseURL) {
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
// highlight-next-line
- const url = new URL(title.attr("href"), baseURL).href;
+ const url = new URL($title.attr("href"), baseURL).href;
...
- return { url, title: titleText, ...priceRange };
+ return { url, title, ...priceRange };
}
```
@@ -292,12 +284,11 @@ Now we'll pass the base URL to the function in the main body of our program:
const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
const $ = await download(listingURL);
-const data = []
-$(".product-item").each((i, element) => {
- const productItem = $(element);
+const data = $(".product-item").toArray().map(element => {
+ const $productItem = $(element);
// highlight-next-line
- const item = parseProduct(productItem, listingURL);
- data.push(item);
+ const item = parseProduct($productItem, listingURL);
+ return item;
});
```
@@ -359,12 +350,12 @@ https://en.wikipedia.org/wiki/Botswana
const html = await response.text();
const $ = cheerio.load(html);
- $(".wikitable tr td:nth-child(3)").each((i, element) => {
+ for (const element of $(".wikitable tr td:nth-child(3)").toArray()) {
const nameCell = $(element);
const link = nameCell.find("a").first();
const url = new URL(link.attr("href"), listingURL).href;
console.log(url);
- });
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
@@ -403,11 +394,11 @@ https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-u
const html = await response.text();
const $ = cheerio.load(html);
- $("#maincontent ul li").each((i, element) => {
+ for (const element of $("#maincontent ul li").toArray()) {
const link = $(element).find("a").first();
const url = new URL(link.attr("href"), listingURL).href;
console.log(url);
- });
+ }
} else {
throw new Error(`HTTP ${response.status}`);
}
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md
index 98d47b54e..513873f98 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md
@@ -12,75 +12,70 @@ import Exercises from './_exercises.mdx';
---
-In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products.
+In previous lessons we've managed to download the HTML code of a single page, parse it with Cheerio, and extract relevant data from it. We'll do the same now for each of the products.
Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now:
-```py
-import httpx
-from bs4 import BeautifulSoup
-from decimal import Decimal
-import json
-import csv
-from urllib.parse import urljoin
-
-def download(url):
- response = httpx.get(url)
- response.raise_for_status()
-
- html_code = response.text
- return BeautifulSoup(html_code, "html.parser")
-
-def parse_product(product, base_url):
- title_element = product.select_one(".product-item__title")
- title = title_element.text.strip()
- url = urljoin(base_url, title_element["href"])
-
- price_text = (
- product
- .select_one(".price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- if price_text.startswith("From "):
- min_price = Decimal(price_text.removeprefix("From "))
- price = None
- else:
- min_price = Decimal(price_text)
- price = min_price
-
- return {"title": title, "min_price": min_price, "price": price, "url": url}
-
-def export_csv(file, data):
- fieldnames = list(data[0].keys())
- writer = csv.DictWriter(file, fieldnames=fieldnames)
- writer.writeheader()
- for row in data:
- writer.writerow(row)
-
-def export_json(file, data):
- def serialize(obj):
- if isinstance(obj, Decimal):
- return str(obj)
- raise TypeError("Object not JSON serializable")
-
- json.dump(data, file, default=serialize, indent=2)
-
-listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-listing_soup = download(listing_url)
-
-data = []
-for product in listing_soup.select(".product-item"):
- item = parse_product(product, listing_url)
- data.append(item)
-
-with open("products.csv", "w") as file:
- export_csv(file, data)
-
-with open("products.json", "w") as file:
- export_json(file, data)
+```js
+import * as cheerio from 'cheerio';
+import { writeFile } from 'fs/promises';
+import { AsyncParser } from '@json2csv/node';
+
+async function download(url) {
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
+}
+
+function parseProduct($productItem, baseURL) {
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
+ const url = new URL($title.attr("href"), baseURL).href;
+
+ const $price = $productItem.find(".price").contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "");
+
+ if (priceText.startsWith("From ")) {
+ priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { url, title, ...priceRange };
+}
+
+function exportJSON(data) {
+ return JSON.stringify(data, null, 2);
+}
+
+async function exportCSV(data) {
+ const parser = new AsyncParser();
+ return await parser.parse(data).promise();
+}
+
+const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+const $ = await download(listingURL);
+
+const data = $(".product-item").toArray().map(element => {
+ const $productItem = $(element);
+ // highlight-next-line
+ const item = parseProduct($productItem, listingURL);
+ return item;
+});
+
+await writeFile('products.json', exportJSON(data));
+await writeFile('products.csv', await exportCSV(data));
```
## Extracting vendor name
@@ -125,51 +120,71 @@ Depending on what's valuable for our use case, we can now use the same technique
It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string:
-```py
-vendor = product_soup.select_one(".product-meta__vendor").text.strip()
+```js
+const vendor = $(".product-meta__vendor").text().trim();
```
But where do we put this line in our program?
## Crawling product detail pages
-In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary:
+In the `.map()` loop, we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it to the item object.
-```py
-...
+First, we need to make the loop asynchronous so that we can use `await download()` for each product. We'll add the `async` keyword to the inner function and rename the collection to `promises`, since it will now store promises that resolve to items rather than the items themselves. We'll pass it to `await Promise.all()` to resolve all the promises and retrieve the actual items.
-listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-listing_soup = download(listing_url)
+```js
+const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+const $ = await download(listingURL);
-data = []
-for product in listing_soup.select(".product-item"):
- item = parse_product(product, listing_url)
- # highlight-next-line
- product_soup = download(item["url"])
- # highlight-next-line
- item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
- data.append(item)
+// highlight-next-line
+const promises = $(".product-item").toArray().map(async element => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+ return item;
+});
+// highlight-next-line
+const data = await Promise.all(promises);
+```
-...
+The program behaves the same as before, but now the code is prepared to make HTTP requests from within the inner function. Let's do it:
+
+```js
+const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+const $ = await download(listingURL);
+
+const promises = $(".product-item").toArray().map(async element => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+
+ // highlight-next-line
+ const $p = await download(item.url);
+ // highlight-next-line
+ item.vendor = $p(".product-meta__vendor").text().trim();
+
+ return item;
+});
+const data = await Promise.all($promises.get());
```
+We download each product detail page and parse its HTML using Cheerio. The `$p` variable is the root of a Cheerio object tree, similar to but distinct from the `$` used for the listing page. That's why we use `$p()` instead of `$p.find()`.
+
If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name:
```json title=products.json
[
{
- "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
- "min_price": "74.95",
- "price": "74.95",
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
+ "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
+ "minPrice": 7495,
+ "price": 7495,
"vendor": "JBL"
},
{
+ "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
- "min_price": "1398.00",
+ "minPrice": 139800,
"price": null,
- "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
"vendor": "Sony"
},
...
@@ -178,7 +193,7 @@ If we run the program now, it'll take longer to finish since it's making 24 more
## Extracting price
-Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices!
+Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we're building a Node.js app to track prices!
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
@@ -192,7 +207,7 @@ In the next lesson, we'll scrape the product detail pages so that each product v
### Scrape calling codes of African countries
-This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the _calling code_ from the info table. Print the URL and the calling code for each country. Start with this URL:
+Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the _calling code_ from the info table. Print the URL and the calling code for each country. Start with this URL:
```text
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
@@ -206,48 +221,59 @@ https://en.wikipedia.org/wiki/Angola +244
https://en.wikipedia.org/wiki/Benin +229
https://en.wikipedia.org/wiki/Botswana +267
https://en.wikipedia.org/wiki/Burkina_Faso +226
-https://en.wikipedia.org/wiki/Burundi None
+https://en.wikipedia.org/wiki/Burundi null
https://en.wikipedia.org/wiki/Cameroon +237
...
```
-Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup.
+Hint: Locating cells in tables is sometimes easier if you know how to [filter](https://cheerio.js.org/docs/api/classes/Cheerio#filter) or [navigate up](https://cheerio.js.org/docs/api/classes/Cheerio#parent) in the HTML element tree.
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
-
- def download(url):
- response = httpx.get(url)
- response.raise_for_status()
- return BeautifulSoup(response.text, "html.parser")
-
- def parse_calling_code(soup):
- for label in soup.select("th.infobox-label"):
- if label.text.strip() == "Calling code":
- data = label.parent.select_one("td.infobox-data")
- return data.text.strip()
- return None
-
- listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- listing_soup = download(listing_url)
- for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"):
- link = name_cell.select_one("a")
- country_url = urljoin(listing_url, link["href"])
- country_soup = download(country_url)
- calling_code = parse_calling_code(country_soup)
- print(country_url, calling_code)
+ ```js
+ import * as cheerio from 'cheerio';
+
+ async function download(url) {
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
+ }
+
+ const listingURL = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
+ const $ = await download(listingURL);
+
+ const $cells = $(".wikitable tr td:nth-child(3)");
+ const promises = $cells.toArray().map(async element => {
+ const $nameCell = $(element);
+ const $link = $nameCell.find("a").first();
+ const countryURL = new URL($link.attr("href"), listingURL).href;
+
+ const $c = await download(countryURL);
+ const $label = $c("th.infobox-label")
+ .filter((i, element) => $c(element).text().trim() == "Calling code")
+ .first();
+ const callingCode = $label
+ .parent()
+ .find("td.infobox-data")
+ .first()
+ .text()
+ .trim();
+
+ console.log(`${countryURL} ${callingCode || null}`);
+ });
+ await Promise.all(promises);
```
### Scrape authors of F1 news articles
-This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
+Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
```text
https://www.theguardian.com/sport/formulaone
@@ -272,34 +298,36 @@ Hints:
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
-
- def download(url):
- response = httpx.get(url)
- response.raise_for_status()
- return BeautifulSoup(response.text, "html.parser")
-
- def parse_author(article_soup):
- link = article_soup.select_one('aside a[rel="author"]')
- if link:
- return link.text.strip()
- address = article_soup.select_one('aside address')
- if address:
- return address.text.strip()
- return None
-
- listing_url = "https://www.theguardian.com/sport/formulaone"
- listing_soup = download(listing_url)
- for item in listing_soup.select("#maincontent ul li"):
- link = item.select_one("a")
- article_url = urljoin(listing_url, link["href"])
- article_soup = download(article_url)
- title = article_soup.select_one("h1").text.strip()
- author = parse_author(article_soup)
- print(f"{author}: {title}")
+ ```js
+ import * as cheerio from 'cheerio';
+
+ async function download(url) {
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
+ }
+
+ const listingURL = "https://www.theguardian.com/sport/formulaone";
+ const $ = await download(listingURL);
+
+ const promises = $("#maincontent ul li").toArray().map(async element => {
+ const $item = $(element);
+ const $link = $item.find("a").first();
+ const authorURL = new URL($link.attr("href"), listingURL).href;
+
+ const $a = await download(authorURL);
+ const title = $a("h1").text().trim();
+
+ const author = $a('a[rel="author"]').text().trim();
+ const address = $a('aside address').text().trim();
+
+ console.log(`${author || address || null}: ${title}`);
+ });
+ await Promise.all(promises);
```
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md
index 6cebba658..61d12e994 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md
@@ -41,7 +41,7 @@ Nice! We can extract the variant names, but we also need to extract the price fo

-If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible.
+If we can't find a workaround, we'd need our scraper to run browser JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Cheerio as much as possible.
After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data!
@@ -65,45 +65,70 @@ After a bit of detective work, we notice that not far below the `block-swatch-li
```
-These elements aren't visible to regular visitors. They're there just in case JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight.
+These elements aren't visible to regular visitors. They're there just in case browser JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight.
## Extracting variants
-Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need:
+Using our knowledge of Cheerio, we can locate the `option` elements and extract the data we need. We'll loop over the options, extract variant names, and create a corresponding array of items for each product:
-```py
-...
+```js
+const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+const $ = await download(listingURL);
-listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-listing_soup = download(listing_url)
+const promises = $(".product-item").toArray().map(async element => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
-data = []
-for product in listing_soup.select(".product-item"):
- item = parse_product(product, listing_url)
- product_soup = download(item["url"])
- vendor = product_soup.select_one(".product-meta__vendor").text.strip()
+ const $p = await download(item.url);
+ item.vendor = $p(".product-meta__vendor").text().trim();
- if variants := product_soup.select(".product-form__option.no-js option"):
- for variant in variants:
- data.append(item | {"variant_name": variant.text.strip()})
- else:
- item["variant_name"] = None
- data.append(item)
+ // highlight-start
+ const $options = $p(".product-form__option.no-js option");
+ const items = $options.toArray().map(optionElement => {
+ const $option = $(optionElement);
+ const variantName = $option.text().trim();
+ return { variantName, ...item };
+ });
+ // highlight-end
-...
+ return item;
+});
+const data = await Promise.all(promises);
```
-The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper.
+The CSS selector `.product-form__option.no-js` targets elements that have both the `product-form__option` and `no-js` classes. We then use the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements nested within the `.product-form__option.no-js` wrapper.
-Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty.
+We loop over the variants using `.map()` method to create an array of item copies for each `variantName`. We now need to pass all these items onward, but the function currently returns just one item per product. And what if there are no variants?
-:::tip Modern Python syntax
+Let's adjust the loop so it returns a promise that resolves to an array of items instead of a single item. If a product has no variants, we'll return an array with a single item, setting `variantName` to `null`:
-Since Python 3.8, you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called an _assignment expression_ or _walrus operator_. You can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/).
+```js
+const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+const $ = await download(listingURL);
-Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https://docs.python.org/3/library/stdtypes.html#dict) aren't clear enough, check out the [proposal document](https://peps.python.org/pep-0584/) for more details.
+const promises = $(".product-item").toArray().map(async element => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
-:::
+ const $p = await download(item.url);
+ item.vendor = $p(".product-meta__vendor").text().trim();
+
+ const $options = $p(".product-form__option.no-js option");
+ const items = $options.toArray().map(optionElement => {
+ const $option = $(optionElement);
+ const variantName = $option.text().trim();
+ return { variantName, ...item };
+ });
+ // highlight-next-line
+ return items.length > 0 ? items : [{ variantName: null, ...item }];
+});
+// highlight-start
+const itemLists = await Promise.all(promises);
+const data = itemLists.flat();
+// highlight-end
+```
+
+After modifying the loop, we also updated how we collect the items into the `data` array. Since the loop now produces an array of items per product, the result of `await Promise.all()` is an array of arrays. We use [`.flat()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat) to merge them into a single, non-nested array.
If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page.
@@ -112,11 +137,11 @@ If we run the program now, we'll see 34 items in total. Some items don't have va
[
...
{
- "variant_name": null,
- "title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit",
- "min_price": "324.00",
- "price": "324.00",
+ "variant": null,
"url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1",
+ "title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit",
+ "minPrice": 32400,
+ "price": 32400,
"vendor": "Klipsch"
},
...
@@ -130,19 +155,19 @@ Some products will break into several items, each with a different variant name.
[
...
{
- "variant_name": "Red - $178.00",
+ "variant": "Red - $178.00",
+ "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
- "min_price": "128.00",
+ "minPrice": 12800,
"price": null,
- "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
"vendor": "Sony"
},
{
- "variant_name": "Black - $178.00",
+ "variant": "Black - $178.00",
+ "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
"title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
- "min_price": "128.00",
+ "minPrice": 12800,
"price": null,
- "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
"vendor": "Sony"
},
...
@@ -156,11 +181,11 @@ Perhaps surprisingly, some products with variants will have the price field set.
[
...
{
- "variant_name": "Red - $74.95",
- "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
- "min_price": "74.95",
- "price": "74.95",
+ "variant": "Red - $74.95",
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
+ "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
+ "minPrice": 7495,
+ "price": 7495,
"vendor": "JBL"
},
...
@@ -169,110 +194,118 @@ Perhaps surprisingly, some products with variants will have the price field set.
## Parsing price
-The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` key. Let's introduce a new function to handle that:
-
-```py
-def parse_variant(variant):
- text = variant.text.strip()
- name, price_text = text.split(" - ")
- price = Decimal(
- price_text
- .replace("$", "")
- .replace(",", "")
- )
- return {"variant_name": name, "price": price}
+The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` property. Let's introduce a new function to handle that:
+
+```js
+function parseVariant($option) {
+ const [variantName, priceText] = $option
+ .text()
+ .trim()
+ .split(" - ");
+ const price = parseInt(
+ priceText
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "")
+ );
+ return { variantName, price };
+}
```
-First, we split the text into two parts, then we parse the price as a decimal number. This part is similar to what we already do for parsing product listing prices. The function returns a dictionary we can merge with `item`.
+First, we split the text into two parts, then we parse the price as a number. This part is similar to what we already do for parsing product listing prices. The function returns an object we can merge with `item`.
## Saving price
Now, if we use our new function, we should finally get a program that can scrape exact prices for all products, even if they have variants. The whole code should look like this now:
-```py
-import httpx
-from bs4 import BeautifulSoup
-from decimal import Decimal
-import json
-import csv
-from urllib.parse import urljoin
-
-def download(url):
- response = httpx.get(url)
- response.raise_for_status()
-
- html_code = response.text
- return BeautifulSoup(html_code, "html.parser")
-
-def parse_product(product, base_url):
- title_element = product.select_one(".product-item__title")
- title = title_element.text.strip()
- url = urljoin(base_url, title_element["href"])
-
- price_text = (
- product
- .select_one(".price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- if price_text.startswith("From "):
- min_price = Decimal(price_text.removeprefix("From "))
- price = None
- else:
- min_price = Decimal(price_text)
- price = min_price
-
- return {"title": title, "min_price": min_price, "price": price, "url": url}
-
-def parse_variant(variant):
- text = variant.text.strip()
- name, price_text = text.split(" - ")
- price = Decimal(
- price_text
- .replace("$", "")
- .replace(",", "")
- )
- return {"variant_name": name, "price": price}
-
-def export_csv(file, data):
- fieldnames = list(data[0].keys())
- writer = csv.DictWriter(file, fieldnames=fieldnames)
- writer.writeheader()
- for row in data:
- writer.writerow(row)
-
-def export_json(file, data):
- def serialize(obj):
- if isinstance(obj, Decimal):
- return str(obj)
- raise TypeError("Object not JSON serializable")
-
- json.dump(data, file, default=serialize, indent=2)
-
-listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-listing_soup = download(listing_url)
-
-data = []
-for product in listing_soup.select(".product-item"):
- item = parse_product(product, listing_url)
- product_soup = download(item["url"])
- vendor = product_soup.select_one(".product-meta__vendor").text.strip()
-
- if variants := product_soup.select(".product-form__option.no-js option"):
- for variant in variants:
- # highlight-next-line
- data.append(item | parse_variant(variant))
- else:
- item["variant_name"] = None
- data.append(item)
-
-with open("products.csv", "w") as file:
- export_csv(file, data)
-
-with open("products.json", "w") as file:
- export_json(file, data)
+```js
+import * as cheerio from 'cheerio';
+import { writeFile } from 'fs/promises';
+import { AsyncParser } from '@json2csv/node';
+
+async function download(url) {
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
+}
+
+function parseProduct($productItem, baseURL) {
+ const $title = $productItem.find(".product-item__title");
+ const title = $title.text().trim();
+ const url = new URL($title.attr("href"), baseURL).href;
+
+ const $price = $productItem.find(".price").contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "");
+
+ if (priceText.startsWith("From ")) {
+ priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { url, title, ...priceRange };
+}
+
+async function exportJSON(data) {
+ return JSON.stringify(data, null, 2);
+}
+
+async function exportCSV(data) {
+ const parser = new AsyncParser();
+ return await parser.parse(data).promise();
+}
+
+// highlight-start
+function parseVariant($option) {
+ const [variantName, priceText] = $option
+ .text()
+ .trim()
+ .split(" - ");
+ const price = parseInt(
+ priceText
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "")
+ );
+ return { variantName, price };
+}
+// highlight-end
+
+const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+const $ = await download(listingURL);
+
+const promises = $(".product-item").toArray().map(async element => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+
+ const $p = await download(item.url);
+ item.vendor = $p(".product-meta__vendor").text().trim();
+
+ const $options = $p(".product-form__option.no-js option");
+ const items = $options.toArray().map(optionElement => {
+ // highlight-next-line
+ const variant = parseVariant($(optionElement));
+ // highlight-next-line
+ return { ...item, ...variant };
+ });
+ return items.length > 0 ? items : [{ variantName: null, ...item }];
+});
+const itemLists = await Promise.all(promises);
+const data = itemLists.flat();
+
+await writeFile('products.json', await exportJSON(data));
+await writeFile('products.csv', await exportCSV(data));
```
Let's run the scraper and see if all the items in the data contain prices:
@@ -282,26 +315,26 @@ Let's run the scraper and see if all the items in the data contain prices:
[
...
{
- "variant_name": "Red",
- "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
- "min_price": "128.00",
- "price": "178.00",
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
- "vendor": "Sony"
+ "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
+ "minPrice": 12800,
+ "price": 17800,
+ "vendor": "Sony",
+ "variantName": "Red"
},
{
- "variant_name": "Black",
- "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
- "min_price": "128.00",
- "price": "178.00",
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control",
- "vendor": "Sony"
+ "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control",
+ "minPrice": 12800,
+ "price": 17800,
+ "vendor": "Sony",
+ "variantName": "Black"
},
...
]
```
-Success! We managed to build a Python application for watching prices!
+Success! We managed to build a Node.js application for watching prices!
Is this the end? Maybe! In the next lesson, we'll use a scraping framework to build the same application, but with less code, faster requests, and better visibility into what's happening while we wait for the program to finish.
@@ -309,69 +342,108 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
-### Build a scraper for watching Python jobs
+### Build a scraper for watching npm packages
-You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
+You can build a scraper now, can't you? Let's build another one! From the registry at [npmjs.com](https://www.npmjs.com/), scrape information about npm packages that match the following criteria:
-- Tagged as "Database"
-- Posted within the last 60 days
+- Have the keyword "llm" (as in _large language model_)
+- Updated within the last two years ("2 years ago" is okay; "3 years ago" is too old)
-For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data:
+Print an array of the top 5 packages with the most dependents. Each package should be represented by an object containing the following data:
-- Job title
-- Company
-- URL to the job posting
-- Date of posting
+- Name
+- Description
+- URL to the package detail page
+- Number of dependents
+- Number of downloads
Your output should look something like this:
-```py
-{'title': 'Senior Full Stack Developer',
- 'company': 'Baserow',
- 'url': 'https://www.python.org/jobs/7705/',
- 'posted_on': datetime.date(2024, 9, 16)}
-{'title': 'Senior Python Engineer',
- 'company': 'Active Prime',
- 'url': 'https://www.python.org/jobs/7699/',
- 'posted_on': datetime.date(2024, 9, 5)}
-...
+```js
+[
+ {
+ name: 'langchain',
+ url: 'https://www.npmjs.com/package/langchain',
+ description: 'Typescript bindings for langchain',
+ dependents: 735,
+ downloads: 3938
+ },
+ {
+ name: '@langchain/core',
+ url: 'https://www.npmjs.com/package/@langchain/core',
+ description: 'Core LangChain.js abstractions and schemas',
+ dependents: 730,
+ downloads: 5994
+ },
+ ...
+]
```
-You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and `timedelta()`.
-
Solution
- After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually.
-
- ```py
- from pprint import pp
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
- from datetime import datetime, date, timedelta
-
- today = date.today()
- jobs_url = "https://www.python.org/jobs/type/database/"
- response = httpx.get(jobs_url)
- response.raise_for_status()
- soup = BeautifulSoup(response.text, "html.parser")
-
- for job in soup.select(".list-recent-jobs li"):
- link = job.select_one(".listing-company-name a")
-
- time = job.select_one(".listing-posted time")
- posted_at = datetime.fromisoformat(time["datetime"])
- posted_on = posted_at.date()
- posted_ago = today - posted_on
-
- if posted_ago <= timedelta(days=60):
- title = link.text.strip()
- company = list(job.select_one(".listing-company-name").stripped_strings)[-1]
- url = urljoin(jobs_url, link["href"])
- pp({"title": title, "company": company, "url": url, "posted_on": posted_on})
+ After inspecting the registry, you'll notice that packages with the keyword "llm" have a dedicated URL. Also, changing the sorting dropdown results in a page with its own URL. We'll use that as our starting point, which saves us from having to scrape the whole registry and then filter by keyword or sort by the number of dependents.
+
+ ```js
+ import * as cheerio from 'cheerio';
+
+ async function download(url) {
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
+ }
+
+ const listingURL = "https://www.npmjs.com/search?page=0&q=keywords%3Allm&sortBy=dependent_count";
+ const $ = await download(listingURL);
+
+ const promises = $("section").toArray().map(async element => {
+ const $card = $(element);
+
+ const details = $card
+ .children()
+ .first()
+ .children()
+ .last()
+ .text()
+ .split("•");
+ const updatedText = details[2].trim();
+ const dependents = parseInt(details[3].replace("dependents", "").trim());
+
+ if (updatedText.includes("years ago")) {
+ const yearsAgo = parseInt(updatedText.replace("years ago", "").trim());
+ if (yearsAgo > 2) {
+ return null;
+ }
+ }
+
+ const $link = $card.find("a").first();
+ const name = $link.text().trim();
+ const url = new URL($link.attr("href"), listingURL).href;
+ const description = $card.find("p").text().trim();
+
+ const downloadsText = $card
+ .children()
+ .last()
+ .text()
+ .replace(",", "")
+ .trim();
+ const downloads = parseInt(downloadsText);
+
+ return { name, url, description, dependents, downloads };
+ });
+
+ const data = await Promise.all(promises);
+ console.log(data.filter(item => item !== null).splice(0, 5));
```
+ Since the HTML doesn't contain any descriptive classes, we must rely on its structure. We're using [`.children()`](https://cheerio.js.org/docs/api/classes/Cheerio#children) to carefully navigate the HTML element tree.
+
+ For items older than 2 years, we return `null` instead of an item. Before printing the results, we use [.filter()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter) to remove these empty values and [.splice()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/splice) the array down to just 5 items.
+
### Find the shortest CNN article which made it to the Sports homepage
@@ -379,8 +451,8 @@ You can find everything you need for working with dates and times in Python's [`
Scrape the [CNN Sports](https://edition.cnn.com/sport) homepage. For each linked article, calculate its length in characters:
- Locate the element that holds the main content of the article.
-- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to extract all the content as plain text.
-- Use `len()` to calculate the character count.
+- Use `.text()` to extract all the content as plain text.
+- Use `.length` to calculate the character count.
Skip pages without text (like those that only have a video). Sort the results and print the URL of the shortest article that made it to the homepage.
@@ -389,32 +461,38 @@ At the time of writing, the shortest article on the CNN Sports homepage is [abou
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
-
- def download(url):
- response = httpx.get(url)
- response.raise_for_status()
- return BeautifulSoup(response.text, "html.parser")
-
- listing_url = "https://edition.cnn.com/sport"
- listing_soup = download(listing_url)
-
- data = []
- for card in listing_soup.select(".layout__main .card"):
- link = card.select_one(".container__link")
- article_url = urljoin(listing_url, link["href"])
- article_soup = download(article_url)
- if content := article_soup.select_one(".article__content"):
- length = len(content.get_text())
- data.append((length, article_url))
-
- data.sort()
- shortest_item = data[0]
- item_url = shortest_item[1]
- print(item_url)
+ ```js
+ import * as cheerio from 'cheerio';
+
+ async function download(url) {
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
+ }
+
+ const listingURL = "https://edition.cnn.com/sport";
+ const $ = await download(listingURL);
+
+ const promises = $(".layout__main .card").toArray().map(async element => {
+ const $link = $(element).find("a").first();
+ const articleURL = new URL($link.attr("href"), listingURL).href;
+
+ const $a = await download(articleURL);
+ const content = $a(".article__content").text().trim();
+
+ return { url: articleURL, length: content.length };
+ });
+
+ const data = await Promise.all(promises);
+ const nonZeroData = data.filter(({ url, length }) => length > 0);
+ nonZeroData.sort((a, b) => a.length - b.length);
+ const shortestItem = nonZeroData[0];
+
+ console.log(shortestItem.url);
```
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md
index fe80fb5fc..4532af095 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md
@@ -15,7 +15,7 @@ import Exercises from './_exercises.mdx';
Before rewriting our code, let's point out several caveats in our current solution:
- _Hard to maintain:_ All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages.
-- _Slow:_ The program runs sequentially, which is generously considerate toward the target website, but extremely inefficient.
+- _Inconsiderate:_ The program sends all requests in parallel, which is efficient but inconsiderate to the target website and may result in us getting blocked.
- _No logging:_ The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs.
- _Boilerplate code:_ We implement downloading and parsing HTML, or exporting data to CSV, although we're not the first people to meet and solve these problems.
- _Prone to anti-scraping:_ If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working.
@@ -24,390 +24,349 @@ Before rewriting our code, let's point out several caveats in our current soluti
In this lesson, we'll tackle all the above issues while keeping the code concise thanks to a scraping framework.
-:::info Why Crawlee and not Scrapy
+## Starting with Crawlee
-From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.
-
-We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
-
-:::
-
-## Installing Crawlee
-
-When starting with the Crawlee framework, we first need to decide which approach to downloading and parsing we prefer. We want the one based on Beautiful Soup, so let's install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies, so expect the installation to take a while.
+First let's install the Crawlee package. The framework has a lot of dependencies, so expect the installation to take a while.
```text
-$ pip install crawlee[beautifulsoup]
+$ npm install crawlee --save
+
+added 123 packages, and audited 123 packages in 0s
...
-Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ...
```
-## Running Crawlee
+Now let's use the framework to create a new version of our scraper. First, let's rename the `index.js` file to `oldindex.js`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, we'll create a new, empty `index.js`. The initial content will look like this:
-Now let's use the framework to create a new version of our scraper. First, let's rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, we'll create a new, empty `main.py`. The initial content will look like this:
+```js
+import { CheerioCrawler } from 'crawlee';
-```py
-import asyncio
-from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+const crawler = new CheerioCrawler({
+ async requestHandler({ $, log }) {
+ const title = $('title').text().trim();
+ log.info(title);
+ },
+});
-async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_listing(context: BeautifulSoupCrawlingContext):
- if title := context.soup.title:
- print(title.text.strip())
-
- await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
-
-if __name__ == '__main__':
- asyncio.run(main())
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
```
In the code, we do the following:
-1. We import the necessary modules and define an asynchronous `main()` function.
-2. Inside `main()`, we first create a crawler object, which manages the scraping process. In this case, it's a crawler based on Beautiful Soup.
-3. Next, we define a nested asynchronous function called `handle_listing()`. It receives a `context` parameter, and Python type hints show it's of type `BeautifulSoupCrawlingContext`. Type hints help editors suggest what we can do with the object.
-4. We use a Python decorator (the line starting with `@`) to register `handle_listing()` as the _default handler_ for processing HTTP responses.
-5. Inside the handler, we extract the page title from the `soup` object and print its text without whitespace.
-6. At the end of the function, we run the crawler on a product listing URL and await its completion.
-7. The last two lines ensure that if the file is executed directly, Python will properly run the `main()` function using its asynchronous event loop.
+1. Import the necessary module.
+2. Create a crawler object, which manages the scraping process. In this case, it's a `CheerioCrawler`, which requests HTML from websites and parses it with Cheerio. Other crawlers, such as `PlaywrightCrawler`, would be suitable if we wanted to scrape by automating a real browser.
+3. Define an asynchronous `requestHandler` function. It receives a context object with Cheerio's `$` instance and a logger.
+4. Extract the page title and log it.
+5. Run the crawler on a product listing URL and await its completion.
-Don't worry if some of this is new. We don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html), decorators, or type hints work. Let's stick to the practical side and observe what the program does when executed:
+Let's see what it does when we run it:
```text
-$ python main.py
-[BeautifulSoupCrawler] INFO Current request statistics:
-┌───────────────────────────────┬──────────┐
-│ requests_finished │ 0 │
-│ requests_failed │ 0 │
-│ retry_histogram │ [0] │
-│ request_avg_failed_duration │ None │
-│ request_avg_finished_duration │ None │
-│ requests_finished_per_minute │ 0 │
-│ requests_failed_per_minute │ 0 │
-│ request_total_duration │ 0.0 │
-│ requests_total │ 0 │
-│ crawler_runtime │ 0.010014 │
-└───────────────────────────────┴──────────┘
-[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
-Sales
-[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
-[BeautifulSoupCrawler] INFO Final request statistics:
-┌───────────────────────────────┬──────────┐
-│ requests_finished │ 1 │
-│ requests_failed │ 0 │
-│ retry_histogram │ [1] │
-│ request_avg_failed_duration │ None │
-│ request_avg_finished_duration │ 0.308998 │
-│ requests_finished_per_minute │ 185 │
-│ requests_failed_per_minute │ 0 │
-│ request_total_duration │ 0.308998 │
-│ requests_total │ 1 │
-│ crawler_runtime │ 0.323721 │
-└───────────────────────────────┴──────────┘
+$ node index.js
+INFO CheerioCrawler: Starting the crawler.
+INFO CheerioCrawler: Sales
+INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
+INFO CheerioCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":388,"requestsFinishedPerMinute":131,"requestsFailedPerMinute":0,"requestTotalDurationMillis":388,"requestsTotal":1,"crawlerRuntimeMillis":458}
+INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
```
-If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with Beautiful Soup, extracts the title, and prints it.
-
-:::tip Advanced Python features
-
-You don't need to be an expert in asynchronous programming, decorators, or type hints to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/), [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/), and [Python Type Checking](https://realpython.com/python-type-checking/).
-
-:::
+If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line with `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with Cheerio, extracts the title, and prints it.
## Crawling product detail pages
-The code now features advanced Python concepts, so it's less accessible to beginners, and the size of the program is about the same as if we worked without a framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive. As we rewrite the rest of the program, the benefits of using Crawlee will become more apparent.
-
-For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages:
-
-```py
-import asyncio
-from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+The code is now less accessible to beginners, and the size of the program is about the same as if we worked without a framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive. As we rewrite the rest of the program, the benefits of using Crawlee will become more apparent.
-async def main():
- crawler = BeautifulSoupCrawler()
+For example, it takes only a few changes to the code to extract and follow links to all the product detail pages:
- @crawler.router.default_handler
- async def handle_listing(context: BeautifulSoupCrawlingContext):
- # highlight-next-line
- await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title")
+```js
+import { CheerioCrawler } from 'crawlee';
- # highlight-next-line
- @crawler.router.handler("DETAIL")
- # highlight-next-line
- async def handle_detail(context: BeautifulSoupCrawlingContext):
- # highlight-next-line
- print(context.request.url)
+const crawler = new CheerioCrawler({
+ // highlight-start
+ async requestHandler({ $, log, request, enqueueLinks }) {
+ if (request.label === 'DETAIL') {
+ log.info(request.url);
+ } else {
+ await enqueueLinks({ label: 'DETAIL', selector: '.product-list a.product-item__title' });
+ }
+ },
+ // highlight-end
+});
- await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
-
-if __name__ == '__main__':
- asyncio.run(main())
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
```
-First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector that allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links and add them to Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
+First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector that allows us to locate links to all the product detail pages. Then we can use the `enqueueLinks()` method to find the links and add them to Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
-Below that, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is print the request URL.
+For each request, Crawlee will run the same handler function. That's why now we need to check the label of the request being processed. For those labeled as `DETAIL`, we'll log the URL, otherwise we assume we're processing the listing page.
-If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way:
+If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, logging their URLs along the way:
```text
-$ python main.py
-[BeautifulSoupCrawler] INFO Current request statistics:
-┌───────────────────────────────┬──────────┐
-...
-└───────────────────────────────┴──────────┘
-[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
-https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv
-https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
-https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer
-https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
+$ node index.js
+INFO CheerioCrawler: Starting the crawler.
+INFO CheerioCrawler: https://warehouse-theme-metal.myshopify.com/products/sony-xbr55a8f-55-inch-4k-ultra-hd-smart-bravia-oled-tv
+INFO CheerioCrawler: https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1
...
-[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
-[BeautifulSoupCrawler] INFO Final request statistics:
-┌───────────────────────────────┬──────────┐
-│ requests_finished │ 25 │
-│ requests_failed │ 0 │
-│ retry_histogram │ [25] │
-│ request_avg_failed_duration │ None │
-│ request_avg_finished_duration │ 0.349434 │
-│ requests_finished_per_minute │ 318 │
-│ requests_failed_per_minute │ 0 │
-│ request_total_duration │ 8.735843 │
-│ requests_total │ 25 │
-│ crawler_runtime │ 4.713262 │
-└───────────────────────────────┴──────────┘
```
-In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially.
+In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in just a few seconds. What we cannot see is that these requests are not made all at once without planning, but are scheduled and sent in a way that doesn't overload the target server. And if they do, Crawlee can automatically retry them.
## Extracting data
-The Beautiful Soup crawler provides handlers with the `context.soup` attribute, which contains the parsed HTML of the handled page. This is the same `soup` object we used in our previous program. Let's locate and extract the same data as before:
-
-```py
-async def main():
- ...
-
- @crawler.router.handler("DETAIL")
- async def handle_detail(context: BeautifulSoupCrawlingContext):
- item = {
- "url": context.request.url,
- "title": context.soup.select_one(".product-meta__title").text.strip(),
- "vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
+The `CheerioCrawler` provides the handler with the `$` attribute, which contains the parsed HTML of the handled page. This is the same `$` object we used in our previous program. Let's locate and extract the same data as before:
+
+```js
+const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, log }) {
+ if (request.label === 'DETAIL') {
+ const item = {
+ url: request.url,
+ title: $('.product-meta__title').text().trim(),
+ vendor: $('.product-meta__vendor').text().trim(),
+ };
+ log.info("Item scraped", item);
+ } else {
+ await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
}
- print(item)
+ },
+});
```
-:::note Fragile code
-
-The code above assumes the `.select_one()` call doesn't return `None`. If your editor checks types, it might even warn that `text` is not a known attribute of `None`. This isn't robust and could break, but in our program, that's fine. We expect the elements to be there, and if they're not, we'd rather the scraper break quickly—it's a sign something's wrong and needs fixing.
-
-:::
-
-Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper.
-
-The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
-
-```py
-async def main():
- ...
-
- @crawler.router.handler("DETAIL")
- async def handle_detail(context: BeautifulSoupCrawlingContext):
- price_text = (
- context.soup
- # highlight-next-line
- .select_one(".product-form__info-content .price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- item = {
- "url": context.request.url,
- "title": context.soup.select_one(".product-meta__title").text.strip(),
- "vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
- "price": Decimal(price_text),
+Now for the price. We're not doing anything new here—just copy-paste the code from our old scraper.
+
+The only change will be in the selector. In `oldindex.js`, we look for `.price` within a `$productItem` object representing a product card. Here, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
+
+```js
+const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, log }) {
+ if (request.label === 'DETAIL') {
+ // highlight-next-line
+ const $price = $(".product-form__info-content .price").contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "");
+
+ if (priceText.startsWith("From ")) {
+ priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ const item = {
+ url: request.url,
+ title: $(".product-meta__title").text().trim(),
+ vendor: $('.product-meta__vendor').text().trim(),
+ ...priceRange,
+ };
+ log.info("Item scraped", item);
+ } else {
+ await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
}
- print(item)
+ },
+});
```
-Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we had in `main.py`. The full program will look like this:
-
-```py
-import asyncio
-from decimal import Decimal
-from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
-
-async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_listing(context: BeautifulSoupCrawlingContext):
- await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
-
- @crawler.router.handler("DETAIL")
- async def handle_detail(context: BeautifulSoupCrawlingContext):
- price_text = (
- context.soup
- .select_one(".product-form__info-content .price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- item = {
- "url": context.request.url,
- "title": context.soup.select_one(".product-meta__title").text.strip(),
- "vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
- "price": Decimal(price_text),
- "variant_name": None,
+Finally, the variants. We can reuse the `parseVariant()` function as-is. In the handler, we'll take some inspiration from what we have in `oldindex.js`, but since we're just logging the items and don't need to return them, the loop can be simpler. First, in the item data, we'll set `variantName` to `null` as a default value. If there are no variants, we'll log the item data as-is. If there are variants, we'll parse each one, merge the variant data with the item data, and log each resulting object. The full program will look like this:
+
+```js
+import { CheerioCrawler } from 'crawlee';
+
+function parseVariant($option) {
+ const [variantName, priceText] = $option
+ .text()
+ .trim()
+ .split(" - ");
+ const price = parseInt(
+ priceText
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "")
+ );
+ return { variantName, price };
+}
+
+const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, log }) {
+ if (request.label === 'DETAIL') {
+ const $price = $(".product-form__info-content .price").contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "");
+
+ if (priceText.startsWith("From ")) {
+ priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ const item = {
+ url: request.url,
+ title: $(".product-meta__title").text().trim(),
+ vendor: $('.product-meta__vendor').text().trim(),
+ ...priceRange,
+ // highlight-next-line
+ variantName: null,
+ };
+
+ // highlight-start
+ const $variants = $(".product-form__option.no-js option");
+ if ($variants.length === 0) {
+ log.info("Item scraped", item);
+ } else {
+ for (const element of $variants.toArray()) {
+ const variant = parseVariant($(element));
+ log.info("Item scraped", { ...item, ...variant });
+ }
+ }
+ // highlight-end
+ } else {
+ await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
}
- if variants := context.soup.select(".product-form__option.no-js option"):
- for variant in variants:
- print(item | parse_variant(variant))
- else:
- print(item)
-
- await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
-
-def parse_variant(variant):
- text = variant.text.strip()
- name, price_text = text.split(" - ")
- price = Decimal(
- price_text
- .replace("$", "")
- .replace(",", "")
- )
- return {"variant_name": name, "price": price}
-
-if __name__ == '__main__':
- asyncio.run(main())
+ },
+});
+
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
```
-If we run this scraper, we should get the same data for the 24 products as before. Crawlee has saved us a lot of effort by managing downloading, parsing, and parallelization. The code is also cleaner, with two separate and labeled handlers.
+If we run this scraper, we should get the same data for the 24 products as before. Crawlee has saved us a lot of effort by managing downloading, parsing, and parallelization.
Crawlee doesn't do much to help with locating and extracting the data—that part of the code remains almost the same, framework or not. This is because the detective work of finding and extracting the right data is the core value of custom scrapers. With Crawlee, we can focus on just that while letting the framework take care of everything else.
## Saving data
-When we're at _letting the framework take care of everything else_, let's take a look at what it can do about saving data. As of now the product detail page handler prints each item as soon as the item is ready. Instead, we can push the item to Crawlee's default dataset:
-
-```py
-async def main():
- ...
-
- @crawler.router.handler("DETAIL")
- async def handle_detail(context: BeautifulSoupCrawlingContext):
- price_text = (
- ...
- )
- item = {
- ...
+When we're at _letting the framework take care of everything else_, let's take a look at what it can do about saving data. As of now the product detail page handler logs each item as soon as the item is ready. Instead, we can push the item to Crawlee's default dataset:
+
+```js
+const crawler = new CheerioCrawler({
+ // highlight-next-line
+ async requestHandler({ $, request, enqueueLinks, pushData, log }) {
+ if (request.label === 'DETAIL') {
+ ...
+
+ const $variants = $(".product-form__option.no-js option");
+ if ($variants.length === 0) {
+ // highlight-next-line
+ pushData(item);
+ } else {
+ for (const element of $variants.toArray()) {
+ const variant = parseVariant($(element));
+ // highlight-next-line
+ pushData({ ...item, ...variant });
}
- if variants := context.soup.select(".product-form__option.no-js option"):
- for variant in variants:
- # highlight-next-line
- await context.push_data(item | parse_variant(variant))
- else:
- # highlight-next-line
- await context.push_data(item)
+ }
+ } else {
+ ...
+ }
+ },
+});
```
-That's it! If we run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item.
+That's it! If we run the program now, there should be a `storage` directory alongside the `index.js` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item.

-We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping:
-
-```py
-async def main():
- ...
+We can also export all the items to a single file of our choice. We'll do it at the end of the program, after the crawler has finished scraping:
- await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
- # highlight-next-line
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
- # highlight-next-line
- await crawler.export_data_csv(path='dataset.csv')
+```js
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
+await crawler.exportData('dataset.json');
+await crawler.exportData('dataset.csv');
```
-After running the scraper again, there should be two new files in your directory, `dataset.json` and `dataset.csv`, containing all the data. If we peek into the JSON file, it should have indentation.
+After running the scraper again, there should be two new files in your directory, `dataset.json` and `dataset.csv`, containing all the data.
## Logging
-Crawlee gives us stats about HTTP requests and concurrency, but we don't get much visibility into the pages we're crawling or the items we're saving. Let's add some custom logging:
-
-```py
-import asyncio
-from decimal import Decimal
-from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
-
-async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_listing(context: BeautifulSoupCrawlingContext):
- # highlight-next-line
- context.log.info("Looking for product detail pages")
- await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
-
- @crawler.router.handler("DETAIL")
- async def handle_detail(context: BeautifulSoupCrawlingContext):
- # highlight-next-line
- context.log.info(f"Product detail page: {context.request.url}")
- price_text = (
- context.soup
- .select_one(".product-form__info-content .price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- item = {
- "url": context.request.url,
- "title": context.soup.select_one(".product-meta__title").text.strip(),
- "vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
- "price": Decimal(price_text),
- "variant_name": None,
+Crawlee gives us stats about HTTP requests and concurrency, but once we started using `pushData()` instead of `log.info()`, we lost visibility into the pages we're crawling and the items we're saving. Let's add back some custom logging:
+
+```js
+import { CheerioCrawler } from 'crawlee';
+
+function parseVariant($option) {
+ const [variantName, priceText] = $option
+ .text()
+ .trim()
+ .split(" - ");
+ const price = parseInt(
+ priceText
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "")
+ );
+ return { variantName, price };
+}
+
+const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, pushData, log }) {
+ if (request.label === 'DETAIL') {
+ // highlight-next-line
+ log.info(`Product detail page: ${request.url}`);
+
+ const $price = $(".product-form__info-content .price").contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace("$", "")
+ .replace(".", "")
+ .replace(",", "");
+
+ if (priceText.startsWith("From ")) {
+ priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ const item = {
+ url: request.url,
+ title: $(".product-meta__title").text().trim(),
+ vendor: $('.product-meta__vendor').text().trim(),
+ ...priceRange,
+ variantName: null,
+ };
+
+ const $variants = $(".product-form__option.no-js option");
+ if ($variants.length === 0) {
+ // highlight-next-line
+ log.info('Saving a product');
+ pushData(item);
+ } else {
+ for (const element of $variants.toArray()) {
+ const variant = parseVariant($(element));
+ // highlight-next-line
+ log.info('Saving a product variant');
+ pushData({ ...item, ...variant });
+ }
+ }
+ } else {
+ // highlight-next-line
+ log.info('Looking for product detail pages');
+ await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
}
- if variants := context.soup.select(".product-form__option.no-js option"):
- for variant in variants:
- # highlight-next-line
- context.log.info("Saving a product variant")
- await context.push_data(item | parse_variant(variant))
- else:
- # highlight-next-line
- context.log.info("Saving a product")
- await context.push_data(item)
-
- await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
-
- # highlight-next-line
- crawler.log.info("Exporting data")
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
- await crawler.export_data_csv(path='dataset.csv')
-
-def parse_variant(variant):
- text = variant.text.strip()
- name, price_text = text.split(" - ")
- price = Decimal(
- price_text
- .replace("$", "")
- .replace(",", "")
- )
- return {"variant_name": name, "price": price}
-
-if __name__ == '__main__':
- asyncio.run(main())
+ },
+});
+
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
+// highlight-next-line
+crawler.log.info('Exporting data');
+await crawler.exportData('dataset.json');
+await crawler.exportData('dataset.csv');
```
-Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are [standard Python loggers](https://docs.python.org/3/library/logging.html).
+Depending on what we find helpful, we can tweak the logs to include more or less detail. See the Crawlee docs on the [Log instance](https://crawlee.dev/js/api/core/class/Log) for more details on what you can do with it.
-If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but at each phase, the code remained clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about.
+If we compare `index.js` and `oldindex.js` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but at each phase, the code remained clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about.
In the next lesson, we'll use a scraping platform to set up our application to run automatically every day.
@@ -423,7 +382,7 @@ Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Acade
- Name
- Team
- Nationality
-- Date of birth (as a `date()` object)
+- Date of birth (as a string in `YYYY-MM-DD` format)
- Instagram URL
If you export the dataset as JSON, it should look something like this:
@@ -453,53 +412,46 @@ If you export the dataset as JSON, it should look something like this:
Hints:
-- Use Python's `datetime.strptime(text, "%d/%m/%Y").date()` to parse dates in the `DD/MM/YYYY` format. Check out the [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) for more details.
+- The website uses `DD/MM/YYYY` format for the date of birth. You'll need to change the format to the ISO 8601 standard with dashes.
- To locate the Instagram URL, use the attribute selector `a[href*='instagram']`. Learn more about attribute selectors in the [MDN docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors).
Solution
- ```py
- import asyncio
- from datetime import datetime
-
- from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
-
- async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_listing(context: BeautifulSoupCrawlingContext):
- await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER")
-
- @crawler.router.handler("DRIVER")
- async def handle_driver(context: BeautifulSoupCrawlingContext):
- info = {}
- for row in context.soup.select(".common-driver-info li"):
- name = row.select_one("span").text.strip()
- value = row.select_one("h4").text.strip()
- info[name] = value
-
- detail = {}
- for row in context.soup.select(".driver-detail--cta-group a"):
- name = row.select_one("p").text.strip()
- value = row.select_one("h2").text.strip()
- detail[name] = value
-
- await context.push_data({
- "url": context.request.url,
- "name": context.soup.select_one("h1").text.strip(),
- "team": detail["Team"],
- "nationality": info["Nationality"],
- "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(),
- "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"),
- })
-
- await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"])
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
-
- if __name__ == '__main__':
- asyncio.run(main())
+ ```js
+ import { CheerioCrawler } from 'crawlee';
+
+ const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, pushData }) {
+ if (request.label === 'DRIVER') {
+ const info = {};
+ for (const itemElement of $('.common-driver-info li').toArray()) {
+ const name = $(itemElement).find('span').text().trim();
+ const value = $(itemElement).find('h4').text().trim();
+ info[name] = value;
+ }
+ const detail = {};
+ for (const linkElement of $('.driver-detail--cta-group a').toArray()) {
+ const name = $(linkElement).find('p').text().trim();
+ const value = $(linkElement).find('h2').text().trim();
+ detail[name] = value;
+ }
+ pushData({
+ url: request.url,
+ name: $('h1').text().trim(),
+ team: detail['Team'],
+ nationality: info['Nationality'],
+ dob: info['DOB'].replaceAll("/", "-"),
+ instagram_url: $(".common-social-share a[href*='instagram']").attr('href'),
+ });
+ } else {
+ await enqueueLinks({ selector: '.teams-driver-item a', label: 'DRIVER' });
+ }
+ },
+ });
+
+ await crawler.run(['https://www.f1academy.com/Racing-Series/Drivers']);
+ await crawler.exportData('dataset.json');
```
@@ -533,69 +485,60 @@ If you export the dataset as JSON, it should look something like this:
To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this:
-```py
-...
-from urllib.parse import quote_plus
+```js
+import { CheerioCrawler, Request } from 'crawlee';
+import { escape } from 'node:querystring';
-async def main():
- ...
+const imdbSearchUrl = `https://www.imdb.com/find/?q=${escape(name)}&s=tt&ttype=ft`;
+const request = new Request({ url: imdbSearchUrl, label: 'IMDB_SEARCH' });
+```
- @crawler.router.default_handler
- async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
- requests = []
- for name_cell in context.soup.select(...):
- name = name_cell.text.strip()
- imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
- requests.append(Request.from_url(imdb_search_url, label="..."))
- await context.add_requests(requests)
+Then use the `addRequests()` function to instruct Crawlee that it should follow an array of these manually constructed requests:
- ...
-...
+```js
+async requestHandler({ ..., addRequests }) {
+ ...
+ await addRequests(requests);
+},
```
-When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue.
+When navigating to the first IMDb search result, you might find it helpful to know that `enqueueLinks()` accepts a `limit` option, letting you specify the max number of HTTP requests to enqueue.
Solution
- ```py
- import asyncio
- from urllib.parse import quote_plus
-
- from crawlee import Request
- from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
-
- async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
- requests = []
- for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"):
- name = name_cell.text.strip()
- imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
- requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
- await context.add_requests(requests)
-
- @crawler.router.handler("IMDB_SEARCH")
- async def handle_imdb_search(context: BeautifulSoupCrawlingContext):
- await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)
-
- @crawler.router.handler("IMDB")
- async def handle_imdb(context: BeautifulSoupCrawlingContext):
- rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
- rating_text = context.soup.select_one(rating_selector).text.strip()
- await context.push_data({
- "url": context.request.url,
- "title": context.soup.select_one("h1").text.strip(),
- "rating": rating_text,
- })
-
- await crawler.run(["https://www.netflix.com/tudum/top10"])
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
-
- if __name__ == '__main__':
- asyncio.run(main())
+ ```js
+ import { CheerioCrawler, Request } from 'crawlee';
+ import { escape } from 'node:querystring';
+
+ const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, pushData, addRequests }) {
+ if (request.label === 'IMDB') {
+ // handle IMDB film page
+ pushData({
+ url: request.url,
+ title: $('h1').text().trim(),
+ rating: $("[data-testid='hero-rating-bar__aggregate-rating__score']").first().text().trim(),
+ });
+ } else if (request.label === 'IMDB_SEARCH') {
+ // handle IMDB search results
+ await enqueueLinks({ selector: '.find-result-item a', label: 'IMDB', limit: 1 });
+
+ } else {
+ // handle Netflix table
+ const $buttons = $('[data-uia="top10-table-row-title"] button');
+ const requests = $buttons.toArray().map(buttonElement => {
+ const name = $(buttonElement).text().trim();
+ const imdbSearchUrl = `https://www.imdb.com/find/?q=${escape(name)}&s=tt&ttype=ft`;
+ return new Request({ url: imdbSearchUrl, label: 'IMDB_SEARCH' });
+ });
+ await addRequests(requests);
+ }
+ },
+ });
+
+ await crawler.run(['https://www.netflix.com/tudum/top10']);
+ await crawler.exportData('dataset.json');
```
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md
index 475f36a17..e4405a47d 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md
@@ -35,9 +35,16 @@ Apify serves both as an infrastructure where to privately deploy and run own scr
## Getting access from the command line
-To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download).
+To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. The [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation) suggests we can install it with `npm` as a global package:
-After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version:
+```text
+$ npm -g install apify-cli
+
+added 440 packages in 2s
+...
+```
+
+We better verify that we installed the tool by printing its version:
```text
$ apify --version
@@ -52,191 +59,98 @@ $ apify login
Success: You are logged in to Apify as user1234!
```
-## Starting a real-world project
-
-Until now, we've kept our scrapers simple, each with just a single Python module like `main.py`, and we've added dependencies only by installing them with `pip` inside a virtual environment.
+## Turning our program to an Actor
-If we sent our code to a friend, they wouldn't know what to install to avoid import errors. The same goes for deploying to a cloud platform.
+Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://apify.com/actors)—a standardized container with designated places for input and output.
-To share our project, we need to package it. The best way is following the official [Python Packaging User Guide](https://packaging.python.org/), but for this course, we'll take a shortcut with the Apify CLI.
+Many [Actor templates](https://apify.com/templates/categories/javascript) simplify the setup for new projects. We'll skip those, as we're about to package an existing program.
-In our terminal, let's change to a directory where we usually start new projects. Then, we'll run the following command:
+Inside the project directory we'll run the `apify init` command followed by a name we want to give to the Actor:
```text
-apify create warehouse-watchdog --template=python-crawlee-beautifulsoup
-```
-
-It will create a new subdirectory called `warehouse-watchdog` for the new project, containing all the necessary files:
-
-```text
-Info: Python version 0.0.0 detected.
-Info: Creating a virtual environment in ...
-...
-Success: Actor 'warehouse-watchdog' was created. To run it, run "cd warehouse-watchdog" and "apify run".
-Info: To run your code in the cloud, run "apify push" and deploy your code to Apify Console.
-Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory.
-```
-
-## Adjusting the template
-
-Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, including `main.py`. This is a sample Beautiful Soup scraper provided by the template.
-
-The file contains a single asynchronous function, `main()`. At the beginning, it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then passes that input to a small crawler built on top of the Crawlee framework.
-
-Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://apify.com/actors)—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code.
-
-
-
-We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson:
-
-```py title=warehouse-watchdog/src/crawler.py
-import asyncio
-from decimal import Decimal
-from crawlee.crawlers import BeautifulSoupCrawler
-
-async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_listing(context):
- context.log.info("Looking for product detail pages")
- await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
-
- @crawler.router.handler("DETAIL")
- async def handle_detail(context):
- context.log.info(f"Product detail page: {context.request.url}")
- price_text = (
- context.soup
- .select_one(".product-form__info-content .price")
- .contents[-1]
- .strip()
- .replace("$", "")
- .replace(",", "")
- )
- item = {
- "url": context.request.url,
- "title": context.soup.select_one(".product-meta__title").text.strip(),
- "vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
- "price": Decimal(price_text),
- "variant_name": None,
- }
- if variants := context.soup.select(".product-form__option.no-js option"):
- for variant in variants:
- context.log.info("Saving a product variant")
- await context.push_data(item | parse_variant(variant))
- else:
- context.log.info("Saving a product")
- await context.push_data(item)
-
- await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
-
- crawler.log.info("Exporting data")
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
- await crawler.export_data_csv(path='dataset.csv')
-
-def parse_variant(variant):
- text = variant.text.strip()
- name, price_text = text.split(" - ")
- price = Decimal(
- price_text
- .replace("$", "")
- .replace(",", "")
- )
- return {"variant_name": name, "price": price}
-
-if __name__ == '__main__':
- asyncio.run(main())
+$ apify init warehouse-watchdog
+Success: The Actor has been initialized in the current directory.
```
-Now, let's replace the contents of `warehouse-watchdog/src/main.py` with this:
+The command creates an `.actor` directory with `actor.json` file inside. This file serves as the configuration of the Actor.
-```py title=warehouse-watchdog/src/main.py
-from apify import Actor
-from .crawler import main as crawl
+:::tip Hidden dot files
-async def main():
- async with Actor:
- await crawl()
-```
+On some systems, `.actor` might be hidden in the directory listing because it starts with a dot. Use your editor's built-in file explorer to locate it.
-We import our scraper as a function and await the result inside the Actor block. Unlike the sample scraper, the one we made in the previous lesson doesn't expect any input data, so we can omit the code that handles that part.
+:::
-Next, we'll change to the `warehouse-watchdog` directory in our terminal and verify that everything works locally before deploying the project to the cloud:
+We'll also need a few changes to our code. First, let's add the `apify` package, which is the [Apify SDK](https://docs.apify.com/sdk/js/):
```text
-$ apify run
-Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
-[apify] INFO Initializing Actor...
-[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
-[BeautifulSoupCrawler] INFO Current request statistics:
-┌───────────────────────────────┬──────────┐
-│ requests_finished │ 0 │
-│ requests_failed │ 0 │
-│ retry_histogram │ [0] │
-│ request_avg_failed_duration │ None │
-│ request_avg_finished_duration │ None │
-│ requests_finished_per_minute │ 0 │
-│ requests_failed_per_minute │ 0 │
-│ request_total_duration │ 0.0 │
-│ requests_total │ 0 │
-│ crawler_runtime │ 0.016736 │
-└───────────────────────────────┴──────────┘
-[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
-[BeautifulSoupCrawler] INFO Looking for product detail pages
-[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
-[BeautifulSoupCrawler] INFO Saving a product variant
-[BeautifulSoupCrawler] INFO Saving a product variant
+$ npm install apify --save
+
+added 123 packages, and audited 123 packages in 0s
...
```
-## Updating the Actor configuration
-
-The Actor configuration from the template tells the platform to expect input, so we need to update that before running our scraper in the cloud.
+Now we'll modify the program so that before it starts, it configures the Actor environment, and after it ends, it gracefully exits the Actor process:
-Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'll edit the `input_schema.json` file, which looks like this by default:
+```js title="index.js"
+import { CheerioCrawler } from 'crawlee';
+// highlight-next-line
+import { Actor } from 'apify';
-```json title=warehouse-watchdog/src/.actor/input_schema.json
-{
- "title": "Python Crawlee BeautifulSoup Scraper",
- "type": "object",
- "schemaVersion": 1,
- "properties": {
- "start_urls": {
- "title": "Start URLs",
- "type": "array",
- "description": "URLs to start with",
- "prefill": [
- { "url": "https://apify.com" }
- ],
- "editor": "requestListSources"
- }
- },
- "required": ["start_urls"]
+function parseVariant($option) {
+ ...
}
-```
-:::tip Hidden dot files
+// highlight-next-line
+await Actor.init();
-On some systems, `.actor` might be hidden in the directory listing because it starts with a dot. Use your editor's built-in file explorer to locate it.
+const crawler = new CheerioCrawler({
+ ...
+});
-:::
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
+crawler.log.info('Exporting data');
+await crawler.exportData('dataset.json');
+await crawler.exportData('dataset.csv');
-We'll remove the expected properties and the list of required ones. After our changes, the file should look like this:
+// highlight-next-line
+await Actor.exit();
+```
-```json title=warehouse-watchdog/src/.actor/input_schema.json
+Finally, let's tell others how to start the project. This is not specific to Actors. JavaScript projects usually include this so people and tools like Apify know how to run them. We will add a `start` script to `package.json`:
+
+```json title="package.json"
{
- "title": "Python Crawlee BeautifulSoup Scraper",
- "type": "object",
- "schemaVersion": 1,
- "properties": {}
+ "name": "academy-example",
+ "version": "1.0.0",
+ ...
+ "scripts": {
+ // highlight-next-line
+ "start": "node index.js",
+ "test": "echo \"Error: no test specified\" && exit 1"
+ },
+ "dependencies": {
+ ...
+ }
}
```
-:::danger Trailing commas in JSON
+That's it! Before deploying the project to the cloud, let's verify that everything works locally:
-Make sure there's no trailing comma after `{}`, or the file won't be valid JSON.
+```text
+$ apify run
+Run: npm run start
-:::
+> academy-example@1.0.0 start
+> node index.js
+
+INFO System info {"apifyVersion":"0.0.0","apifyClientVersion":"0.0.0","crawleeVersion":"0.0.0","osType":"Darwin","nodeVersion":"v0.0.0"}
+INFO CheerioCrawler: Starting the crawler.
+INFO CheerioCrawler: Looking for product detail pages
+INFO CheerioCrawler: Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
+INFO CheerioCrawler: Saving a product variant
+INFO CheerioCrawler: Saving a product variant
+...
+```
## Deploying the scraper
@@ -263,7 +177,7 @@ When the run finishes, the interface will turn green. On the **Output** tab, we
:::info Accessing data
-We don't need to click buttons to download the data. It's possible to retrieve it also using Apify's API, the `apify datasets` CLI command, or the Python SDK. Learn more in the [Dataset docs](https://docs.apify.com/platform/storage/dataset).
+We don't need to click buttons to download the data. It's possible to retrieve it also using Apify's API, the `apify datasets` CLI command, or the JavaScript SDK. Learn more in the [Dataset docs](https://docs.apify.com/platform/storage/dataset).
:::
@@ -279,103 +193,95 @@ From now on, the Actor will execute daily. We can inspect each run, view logs, c
If monitoring shows that our scraper frequently fails to reach the Warehouse Shop website, it's likely being blocked. To avoid this, we can [configure proxies](https://docs.apify.com/platform/proxy) so our requests come from different locations, reducing the chances of detection and blocking.
-Proxy configuration is a type of Actor input, so let's start by reintroducing the necessary code. We'll update `warehouse-watchdog/src/main.py` like this:
-
-```py title=warehouse-watchdog/src/main.py
-from apify import Actor
-from .crawler import main as crawl
+Proxy configuration is a type of [Actor input](https://docs.apify.com/platform/actors/running/input-and-output#input). Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled manually. Inside the `.actor` directory we'll create a new file, `inputSchema.json`, with the following content:
-async def main():
- async with Actor:
- input_data = await Actor.get_input()
+```json title=".actor/inputSchema.json"
+{
+ "title": "Crawlee Cheerio Scraper",
+ "type": "object",
+ "schemaVersion": 1,
+ "properties": {
+ "proxyConfig": {
+ "title": "Proxy config",
+ "description": "Proxy configuration",
+ "type": "object",
+ "editor": "proxy",
+ "prefill": {
+ "useApifyProxy": true,
+ "apifyProxyGroups": []
+ },
+ "default": {
+ "useApifyProxy": true,
+ "apifyProxyGroups": []
+ }
+ }
+ }
+}
+```
- if actor_proxy_input := input_data.get("proxyConfig"):
- proxy_config = await Actor.create_proxy_configuration(actor_proxy_input=actor_proxy_input)
- else:
- proxy_config = None
+Now let's connect this file to the actor configuration. In `actor.json`, we'll add one more line:
- await crawl(proxy_config)
+```json title=".actor/actor.json"
+{
+ "actorSpecification": 1,
+ "name": "warehouse-watchdog",
+ "version": "0.0",
+ "buildTag": "latest",
+ "environmentVariables": {},
+ // highlight-next-line
+ "input": "./inputSchema.json"
+}
```
-Next, we'll add `proxy_config` as an optional parameter in `warehouse-watchdog/src/crawler.py`. Thanks to the built-in integration between Apify and Crawlee, we only need to pass it to `BeautifulSoupCrawler()`, and the class will handle the rest:
+:::danger Trailing commas in JSON
-```py title=warehouse-watchdog/src/crawler.py
-import asyncio
-from decimal import Decimal
-from crawlee.crawlers import BeautifulSoupCrawler
+Make sure there's no trailing comma after the line, or the file won't be valid JSON.
-# highlight-next-line
-async def main(proxy_config = None):
- # highlight-next-line
- crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config)
- # highlight-next-line
- crawler.log.info(f"Using proxy: {'yes' if proxy_config else 'no'}")
+:::
- @crawler.router.default_handler
- async def handle_listing(context):
- context.log.info("Looking for product detail pages")
- await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
+That tells the platform our Actor expects proxy configuration on input. We'll also update the `index.js`. Thanks to the built-in integration between Apify and Crawlee, we can pass the proxy configuration as-is to the `CheerioCrawler`:
+```js
+...
+await Actor.init();
+// highlight-next-line
+const proxyConfiguration = await Actor.createProxyConfiguration();
+
+const crawler = new CheerioCrawler({
+ // highlight-next-line
+ proxyConfiguration,
+ async requestHandler({ $, request, enqueueLinks, pushData, log }) {
...
-```
-
-Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` to include the `proxyConfig` input parameter:
+ },
+});
-```json title=warehouse-watchdog/src/.actor/input_schema.json
-{
- "title": "Python Crawlee BeautifulSoup Scraper",
- "type": "object",
- "schemaVersion": 1,
- "properties": {
- "proxyConfig": {
- "title": "Proxy config",
- "description": "Proxy configuration",
- "type": "object",
- "editor": "proxy",
- "prefill": {
- "useApifyProxy": true,
- "apifyProxyGroups": []
- },
- "default": {
- "useApifyProxy": true,
- "apifyProxyGroups": []
- }
- }
- }
-}
+// highlight-next-line
+crawler.log.info(`Using proxy: ${proxyConfiguration ? 'yes' : 'no'}`);
+await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
+...
```
To verify everything works, we'll run the scraper locally. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run:
```text
$ apify run --purge
-Info: All default local stores were purged.
-Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
-[apify] INFO Initializing Actor...
-[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
-[BeautifulSoupCrawler] INFO Using proxy: no
-[BeautifulSoupCrawler] INFO Current request statistics:
-┌───────────────────────────────┬──────────┐
-│ requests_finished │ 0 │
-│ requests_failed │ 0 │
-│ retry_histogram │ [0] │
-│ request_avg_failed_duration │ None │
-│ request_avg_finished_duration │ None │
-│ requests_finished_per_minute │ 0 │
-│ requests_failed_per_minute │ 0 │
-│ request_total_duration │ 0.0 │
-│ requests_total │ 0 │
-│ crawler_runtime │ 0.014976 │
-└───────────────────────────────┴──────────┘
-[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
-[BeautifulSoupCrawler] INFO Looking for product detail pages
-[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
-[BeautifulSoupCrawler] INFO Saving a product variant
-[BeautifulSoupCrawler] INFO Saving a product variant
+Run: npm run start
+
+> academy-example@1.0.0 start
+> node index.js
+
+INFO System info {"apifyVersion":"0.0.0","apifyClientVersion":"0.0.0","crawleeVersion":"0.0.0","osType":"Darwin","nodeVersion":"v0.0.0"}
+WARN ProxyConfiguration: The "Proxy external access" feature is not enabled for your account. Please upgrade your plan or contact support@apify.com
+INFO CheerioCrawler: Using proxy: no
+INFO CheerioCrawler: Starting the crawler.
+INFO CheerioCrawler: Looking for product detail pages
+INFO CheerioCrawler: Product detail page: https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones
+INFO CheerioCrawler: Saving a product variant
+INFO CheerioCrawler: Saving a product variant
...
```
-In the logs, we should see `Using proxy: no`, because local runs don't include proxy settings. All requests will be made from our own location, just as before. Now, let's update the cloud version of our scraper with `apify push`:
+In the logs, we should see `Using proxy: no`, because local runs don't include proxy settings. A warning informs us that it's a paid feature we don't have enabled, so all requests will be made from our own location, just as before. Now, let's update the cloud version of our scraper with `apify push`:
```text
$ apify push
@@ -394,30 +300,17 @@ Back in the Apify console, we'll go to the **Source** screen and switch to the *
We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform:
```text
-(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.
+(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from registry.
(timestamp) ACTOR: Creating Docker container.
(timestamp) ACTOR: Starting Docker container.
-(timestamp) [apify] INFO Initializing Actor...
-(timestamp) [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
-(timestamp) [BeautifulSoupCrawler] INFO Using proxy: yes
-(timestamp) [BeautifulSoupCrawler] INFO Current request statistics:
-(timestamp) ┌───────────────────────────────┬──────────┐
-(timestamp) │ requests_finished │ 0 │
-(timestamp) │ requests_failed │ 0 │
-(timestamp) │ retry_histogram │ [0] │
-(timestamp) │ request_avg_failed_duration │ None │
-(timestamp) │ request_avg_finished_duration │ None │
-(timestamp) │ requests_finished_per_minute │ 0 │
-(timestamp) │ requests_failed_per_minute │ 0 │
-(timestamp) │ request_total_duration │ 0.0 │
-(timestamp) │ requests_total │ 0 │
-(timestamp) │ crawler_runtime │ 0.036449 │
-(timestamp) └───────────────────────────────┴──────────┘
-(timestamp) [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
-(timestamp) [crawlee.storages._request_queue] INFO The queue still contains requests locked by another client
-(timestamp) [BeautifulSoupCrawler] INFO Looking for product detail pages
-(timestamp) [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
-(timestamp) [BeautifulSoupCrawler] INFO Saving a product variant
+(timestamp) INFO System info {"apifyVersion":"0.0.0","apifyClientVersion":"0.0.0","crawleeVersion":"0.0.0","osType":"Darwin","nodeVersion":"v0.0.0"}
+(timestamp) INFO CheerioCrawler: Using proxy: yes
+(timestamp) INFO CheerioCrawler: Starting the crawler.
+(timestamp) INFO CheerioCrawler: Looking for product detail pages
+(timestamp) INFO CheerioCrawler: Product detail page: https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
+(timestamp) INFO CheerioCrawler: Saving a product
+(timestamp) INFO CheerioCrawler: Product detail page: https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1
+(timestamp) INFO CheerioCrawler: Saving a product
...
```
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/index.md b/sources/academy/webscraping/scraping_basics_javascript2/index.md
index c7dcb96b5..3751f05ef 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/index.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/index.md
@@ -33,7 +33,7 @@ Anyone with basic knowledge of developing programs in JavaScript who wants to st
## Requirements
- A macOS, Linux, or Windows machine with a web browser and Node.js installed.
-- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
+- Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, arrays, objects, files, classes, promises, imports, and exceptions.
- Comfort with building a Node.js package and installing dependencies with `npm`.
- Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
diff --git a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
index 8b90a5cf1..74c399b69 100644
--- a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+++ b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
@@ -63,7 +63,7 @@ $ python main.py
[Sales
]
```
-Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
+Our code lists all `h1` elements it can find in the HTML we gave it. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
```py
headings = soup.select("h1")
@@ -80,7 +80,7 @@ Sales
:::note Dynamic websites
-The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
+The Warehouse returns full HTML in its initial response, but many other sites add some content after the page loads or after user interaction. In such cases, what we'd see in DevTools could differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses.
:::
@@ -117,12 +117,12 @@ That's it! We've managed to download a product listing, parse its HTML, and coun
-### Scrape F1 teams
+### Scrape F1 Academy teams
-Print a total count of F1 teams listed on this page:
+Print a total count of F1 Academy teams listed on this page:
```text
-https://www.formula1.com/en/teams
+https://www.f1academy.com/Racing-Series/Teams
```
@@ -132,20 +132,20 @@ https://www.formula1.com/en/teams
import httpx
from bs4 import BeautifulSoup
- url = "https://www.formula1.com/en/teams"
+ url = "https://www.f1academy.com/Racing-Series/Teams"
response = httpx.get(url)
response.raise_for_status()
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")
- print(len(soup.select(".group")))
+ print(len(soup.select(".teams-driver-item")))
```
-### Scrape F1 drivers
+### Scrape F1 Academy drivers
-Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
+Use the same URL as in the previous exercise, but this time print a total count of F1 Academy drivers.
Solution
@@ -154,13 +154,13 @@ Use the same URL as in the previous exercise, but this time print a total count
import httpx
from bs4 import BeautifulSoup
- url = "https://www.formula1.com/en/teams"
+ url = "https://www.f1academy.com/Racing-Series/Teams"
response = httpx.get(url)
response.raise_for_status()
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")
- print(len(soup.select(".f1-team-driver-name")))
+ print(len(soup.select(".driver")))
```
diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
index 974f41504..4193c0b13 100644
--- a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
@@ -162,7 +162,7 @@ We can use Beautiful Soup's `.contents` property to access individual nodes. It
["\n", Sale price, "$74.95"]
```
-It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program:
+It seems like we can read the last element to get the actual amount. Let's fix our program:
```py
import httpx
@@ -228,6 +228,16 @@ Algeria
Angola
Benin
Botswana
+Burkina Faso
+Burundi
+Cameroon
+Cape Verde
+Central African Republic
+Chad
+Comoros
+Democratic Republic of the Congo
+Republic of the Congo
+Djibouti
...
```
diff --git a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
index 6567e24ef..c5140e8d1 100644
--- a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
@@ -65,7 +65,7 @@ for product in soup.select(".product-item"):
print(data)
```
-Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once.
+Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list. The program should now print the results as a single large Python list:
```text
$ python main.py
@@ -215,7 +215,7 @@ In this lesson, we created export files in two formats. The following challenges
### Process your JSON
-Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp).
+Write a new Python program that reads the `products.json` file we created in the lesson, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp).
Solution
diff --git a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
index 483958c22..e208044b6 100644
--- a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
+++ b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
@@ -115,7 +115,7 @@ def parse_product(product):
return {"title": title, "min_price": min_price, "price": price}
```
-Now the JSON export. For better readability of it, let's make a small change here and set the indentation level to two spaces:
+Now the JSON export. For better readability, let's make a small change here and set the indentation level to two spaces:
```py
def export_json(file, data):
diff --git a/sources/academy/webscraping/scraping_basics_python/10_crawling.md b/sources/academy/webscraping/scraping_basics_python/10_crawling.md
index dc4d8cee2..90bbf8e19 100644
--- a/sources/academy/webscraping/scraping_basics_python/10_crawling.md
+++ b/sources/academy/webscraping/scraping_basics_python/10_crawling.md
@@ -125,7 +125,7 @@ Depending on what's valuable for our use case, we can now use the same technique
It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string:
```py
-vendor = product_soup.select_one(".product-meta__vendor").text.strip()
+vendor = soup.select_one(".product-meta__vendor").text.strip()
```
But where do we put this line in our program?
@@ -135,8 +135,6 @@ But where do we put this line in our program?
In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary:
```py
-...
-
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
listing_soup = download(listing_url)
@@ -148,8 +146,6 @@ for product in listing_soup.select(".product-item"):
# highlight-next-line
item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip()
data.append(item)
-
-...
```
If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name:
@@ -177,7 +173,7 @@ If we run the program now, it'll take longer to finish since it's making 24 more
## Extracting price
-Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices!
+Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we're building a Python app to track prices!
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
@@ -191,7 +187,7 @@ In the next lesson, we'll scrape the product detail pages so that each product v
### Scrape calling codes of African countries
-This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the _calling code_ from the info table. Print the URL and the calling code for each country. Start with this URL:
+Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the _calling code_ from the info table. Print the URL and the calling code for each country. Start with this URL:
```text
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
@@ -246,7 +242,7 @@ Hint: Locating cells in tables is sometimes easier if you know how to [navigate
### Scrape authors of F1 news articles
-This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
+Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
```text
https://www.theguardian.com/sport/formulaone
@@ -282,7 +278,7 @@ Hints:
return BeautifulSoup(response.text, "html.parser")
def parse_author(article_soup):
- link = article_soup.select_one('aside a[rel="author"]')
+ link = article_soup.select_one('a[rel="author"]')
if link:
return link.text.strip()
address = article_soup.select_one('aside address')
diff --git a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
index 2d8b9e822..98f04a761 100644
--- a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
@@ -71,8 +71,6 @@ These elements aren't visible to regular visitors. They're there just in case Ja
Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need:
```py
-...
-
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
listing_soup = download(listing_url)
@@ -88,11 +86,9 @@ for product in listing_soup.select(".product-item"):
else:
item["variant_name"] = None
data.append(item)
-
-...
```
-The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper.
+The CSS selector `.product-form__option.no-js` targets elements that have both the `product-form__option` and `no-js` classes. We then use the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements nested within the `.product-form__option.no-js` wrapper.
Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty.
@@ -225,6 +221,7 @@ def parse_product(product, base_url):
return {"title": title, "min_price": min_price, "price": price, "url": url}
+# highlight-start
def parse_variant(variant):
text = variant.text.strip()
name, price_text = text.split(" - ")
@@ -234,6 +231,7 @@ def parse_variant(variant):
.replace(",", "")
)
return {"variant_name": name, "price": price}
+# highlight-end
def export_json(file, data):
def serialize(obj):
@@ -310,7 +308,7 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
### Build a scraper for watching Python jobs
-You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
+You can build a scraper now, can't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
- Tagged as "Database"
- Posted within the last 60 days
diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md
index c8b5f6468..691543454 100644
--- a/sources/academy/webscraping/scraping_basics_python/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_python/12_framework.md
@@ -181,7 +181,7 @@ https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-tu
└───────────────────────────────┴──────────┘
```
-In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially.
+In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially. These requests are not made all at once without planning. They are scheduled and sent in a way that doesn't overload the target server. And if they do, Crawlee can automatically retry them.
## Extracting data
@@ -209,7 +209,7 @@ The code above assumes the `.select_one()` call doesn't return `None`. If your e
Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper.
-The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
+The only change will be in the selector. In `oldmain.py`, we look for `.price` within a `product_soup` object representing a product card. Here, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
```py
async def main():
@@ -235,7 +235,7 @@ async def main():
print(item)
```
-Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we had in `main.py`. The full program will look like this:
+Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we have in `oldmain.py`. The full program will look like this:
```py
import asyncio
@@ -533,7 +533,6 @@ If you export the dataset as JSON, it should look something like this:
To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this:
```py
-...
from urllib.parse import quote_plus
async def main():
@@ -549,10 +548,9 @@ async def main():
await context.add_requests(requests)
...
-...
```
-When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue.
+When navigating to the first IMDb search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue.
Solution
@@ -570,7 +568,7 @@ When navigating to the first search result, you might find it helpful to know th
@crawler.router.default_handler
async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
requests = []
- for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"):
+ for name_cell in context.soup.select('[data-uia="top10-table-row-title"] button'):
name = name_cell.text.strip()
imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
diff --git a/sources/academy/webscraping/scraping_basics_python/13_platform.md b/sources/academy/webscraping/scraping_basics_python/13_platform.md
index d039540a4..6cbbec6fe 100644
--- a/sources/academy/webscraping/scraping_basics_python/13_platform.md
+++ b/sources/academy/webscraping/scraping_basics_python/13_platform.md
@@ -322,7 +322,7 @@ Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/
```json title=warehouse-watchdog/src/.actor/input_schema.json
{
- "title": "Python Crawlee BeautifulSoup Scraper",
+ "title": "Crawlee BeautifulSoup Scraper",
"type": "object",
"schemaVersion": 1,
"properties": {