Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-inspecting
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.**

Expand All @@ -28,11 +28,11 @@ Google Chrome is currently the most popular browser, and many others use the sam

Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**.

![Wikipedia with Chrome DevTools open](./images/devtools-wikipedia.png)
![Wikipedia with Chrome DevTools open](../scraping_basics/images/devtools-wikipedia.png)

Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page:

![Elements tab in Chrome DevTools](./images/devtools-elements-tab.png)
![Elements tab in Chrome DevTools](../scraping_basics/images/devtools-elements-tab.png)

:::warning Screen adaptations

Expand Down Expand Up @@ -62,17 +62,17 @@ While HTML and CSS describe what the browser should display, JavaScript adds int

If you don't see it, press <kbd>ESC</kbd> to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly.

![Console in Chrome DevTools](./images/devtools-console.png)
![Console in Chrome DevTools](../scraping_basics/images/devtools-console.png)

## Selecting an element

In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square.

![Chrome DevTools element selection tool](./images/devtools-element-selection.png)
![Chrome DevTools element selection tool](../scraping_basics/images/devtools-element-selection.png)

We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.

![Chrome DevTools element hover](./images/devtools-hover.png)
![Chrome DevTools element hover](../scraping_basics/images/devtools-hover.png)

The highlighted section should look something like this:

Expand Down Expand Up @@ -108,7 +108,7 @@ We won't be creating Node.js scrapers just yet. Let's first get familiar with wh

In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.

![Global variable in Chrome DevTools Console](./images/devtools-console-variable.png)
![Global variable in Chrome DevTools Console](../scraping_basics/images/devtools-console-variable.png)

The Console allows us to run code in the context of the loaded page. We can use it to play around with elements.

Expand All @@ -132,7 +132,7 @@ temp1.textContent = 'Hello World!';

When we change elements in the Console, those changes reflect immediately on the page!

![Changing textContent in Chrome DevTools Console](./images/devtools-console-textcontent.png)
![Changing textContent in Chrome DevTools Console](../scraping_basics/images/devtools-console-textcontent.png)

But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence.

Expand Down Expand Up @@ -161,7 +161,7 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
1. In the console, type `temp1.src` and hit **Enter**.

![DevTools exercise result](./images/devtools-exercise-fifa.png)
![DevTools exercise result](../scraping_basics/images/devtools-exercise-fifa.png)

</details>

Expand All @@ -178,6 +178,6 @@ Open a news website, such as [CNN](https://cnn.com). Use the Console to change t
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**.

![DevTools exercise result](./images/devtools-exercise-cnn.png)
![DevTools exercise result](../scraping_basics/images/devtools-exercise-cnn.png)

</details>
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-locating-elements
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.**

Expand All @@ -30,17 +30,17 @@ That said, we designed all the additional exercises to work with live websites.

As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales).

![Warehouse store with DevTools open](./images/devtools-warehouse.png)
![Warehouse store with DevTools open](../scraping_basics/images/devtools-warehouse.png)

The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it.

![Selecting an element with DevTools](./images/devtools-product-title.png)
![Selecting an element with DevTools](../scraping_basics/images/devtools-product-title.png)

Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.

In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**.

![Selecting an element with hover](./images/devtools-hover-product.png)
![Selecting an element with hover](../scraping_basics/images/devtools-hover-product.png)

At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do.

Expand All @@ -64,7 +64,7 @@ document.querySelector('.product-item');

It will return the HTML element for the first product card in the listing:

![Using querySelector() in DevTools Console](./images/devtools-queryselector.webp)
![Using querySelector() in DevTools Console](../scraping_basics/images/devtools-queryselector.webp)

CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine.

Expand Down Expand Up @@ -114,13 +114,13 @@ The product card has four classes: `product-item`, `product-item--vertical`, `1/

This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.

![Overview of all the product cards in DevTools](./images/devtools-product-list.png)
![Overview of all the product cards in DevTools](../scraping_basics/images/devtools-product-list.png)

## Locating all product cards

In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.

![Highlighting a querySelector() result](./images/devtools-hover-queryselector.png)
![Highlighting a querySelector() result](../scraping_basics/images/devtools-hover-queryselector.png)

But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**:

Expand All @@ -132,7 +132,7 @@ The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/We

We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!

![Highlighting a querySelectorAll() result](./images/devtools-hover-queryselectorall.png)
![Highlighting a querySelectorAll() result](../scraping_basics/images/devtools-hover-queryselectorall.png)

To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with regular JavaScript arrays:

Expand All @@ -151,7 +151,7 @@ Even though we're just playing in the browser's **Console**, we're inching close

On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones).

![Wikipedia's Main Page headings](./images/devtools-exercise-wikipedia.png)
![Wikipedia's Main Page headings](../scraping_basics/images/devtools-exercise-wikipedia.png)

<details>
<summary>Solution</summary>
Expand All @@ -169,7 +169,7 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use

Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products.

![Products in Shein's Jewelry & Accessories category](./images/devtools-exercise-shein.png)
![Products in Shein's Jewelry & Accessories category](../scraping_basics/images/devtools-exercise-shein.png)

<details>
<summary>Solution</summary>
Expand All @@ -194,7 +194,7 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs

:::

![Articles on Guardian's page about F1](./images/devtools-exercise-guardian1.png)
![Articles on Guardian's page about F1](../scraping_basics/images/devtools-exercise-guardian1.png)

<details>
<summary>Solution</summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-extracting-data
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**

Expand All @@ -31,15 +31,15 @@ subwoofer.textContent;

That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.

![Printing text content of the parent element](./images/devtools-extracting-text.png)
![Printing text content of the parent element](../scraping_basics/images/devtools-extracting-text.png)

We'll need to first locate relevant child elements and extract the data from each of them individually.

## Extracting title

We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.

![Finding child elements](./images/devtools-product-details.png)
![Finding child elements](../scraping_basics/images/devtools-product-details.png)

Browser JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:

Expand All @@ -50,13 +50,13 @@ title.textContent;

Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:

![Extracting product title](./images/devtools-extracting-title.png)
![Extracting product title](../scraping_basics/images/devtools-extracting-title.png)

## Extracting price

To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.

![Finding child elements](./images/devtools-product-details.png)
![Finding child elements](../scraping_basics/images/devtools-product-details.png)

We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result:

Expand All @@ -67,7 +67,7 @@ price.textContent;

It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:

![Extracting product price](./images/devtools-extracting-price.png)
![Extracting product price](../scraping_basics/images/devtools-extracting-price.png)

But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Node.js, we'll figure out how to get the values as numbers.

Expand Down Expand Up @@ -100,7 +100,7 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a

On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use the [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name.

![Fandom's Movies page](./images/devtools-exercise-fandom.png)
![Fandom's Movies page](../scraping_basics/images/devtools-exercise-fandom.png)

<details>
<summary>Solution</summary>
Expand All @@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto

On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.

![F1 news page](./images/devtools-exercise-guardian2.png)
![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png)

<details>
<summary>Solution</summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/downloading-html
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll start building a Node.js application for watching prices. As a first step, we'll use the Fetch API to download HTML code of a product listing page.**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ slug: /scraping-basics-javascript2/parsing-html
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll look for products in the downloaded HTML. We'll use Cheerio to turn the HTML into objects which we can work with in our Node.js program.**

---

From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.

![Products have the ‘product-item’ class](./images/product-item.png)
![Products have the ‘product-item’ class](../scraping_basics/images/product-item.png)

As a first step, let's try counting how many products are on the listing page.

Expand Down Expand Up @@ -50,7 +50,7 @@ Being comfortable around installing Node.js packages is a prerequisite of this c

Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.

![Element of the main heading](./images/h1.png)
![Element of the main heading](../scraping_basics/images/h1.png)

We'll update our code to the following:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/locating-elements
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll locate product data in the downloaded HTML. We'll use Cheerio to find those HTML elements which contain details about each product, such as title or price.**

Expand Down Expand Up @@ -64,7 +64,7 @@ To get details about each product in a structured way, we'll need a different ap

As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card.

![Product card's child elements](./images/child-elements.png)
![Product card's child elements](../scraping_basics/images/child-elements.png)

We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/extracting-data
unlisted: true
---

import Exercises from './_exercises.mdx';
import Exercises from '../scraping_basics/_exercises.mdx';

**In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ await writeFile("products.csv", csvData);

The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.

![CSV preview](images/csv.png)
![CSV preview](../scraping_basics/images/csv.png)

In the CSV format, if a value contains commas, we should enclose it in quotes. If it contains quotes, we should double them. When we open the file in a text editor of our choice, we can see that the library automatically handled this:

Expand Down Expand Up @@ -232,6 +232,6 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic
1. Select the header row. Go to **Data > Create filter**.
1. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.

![CSV in Google Sheets](images/csv-sheets.png)
![CSV in Google Sheets](../scraping_basics/images/csv-sheets.png)

</details>
Loading
Loading