Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ As mentioned in the previous lesson, before building a scraper, we need to under

![Warehouse store with DevTools open](./images/devtools-warehouse.png)

The page displays a grid of product cards, each showing a product's name and picture. Open DevTools and locate the name of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
The page displays a grid of product cards, each showing a product's title and picture. Open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.

![Selecting an element with DevTools](./images/devtools-product-name.png)
![Selecting an element with DevTools](./images/devtools-product-title.png)

Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.

In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's name. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's title. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.

![Selecting an element with hover](./images/devtools-hover-product.png)

Expand Down Expand Up @@ -166,9 +166,9 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
1. Activate the element selection tool in your DevTools.
1. Click on several headings to examine the markup.
1. Notice that all headings are `h2` tags with the `mp-h2` class.
1. Notice that all headings are `h2` elements with the `mp-h2` class.
1. In the **Console**, execute `document.querySelectorAll('h2')`.
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` tags on the page. Thus, the selector is sufficient as is.
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.

</details>

Expand All @@ -184,7 +184,7 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
1. Activate the element selection tool in your DevTools.
1. Click on the first product to inspect its markup. Repeat with a few others.
1. Observe that all products are `section` tags with multiple classes, including `product-card`.
1. Observe that all products are `section` elements with multiple classes, including `product-card`.
1. Since `section` is a generic wrapper, focus on the `product-card` class.
1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
Expand All @@ -205,7 +205,7 @@ Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-U
1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
1. Activate the element selection tool in your DevTools.
1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
1. Note that all articles are `li` tags, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
1. In the **Console**, execute `document.querySelectorAll('main li')`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
1. Activate the element selection tool in your DevTools.
1. Click on the first post.
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tags and randomized classes, requiring you to rely on the element hierarchy and order instead.
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ import Exercises from './_exercises.mdx';

---

From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.

![Products have the ‘product-item’ class](./images/product-item.png)

Expand All @@ -37,9 +37,9 @@ $ pip install beautifulsoup4
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
```

Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.

![Tag of the main heading](./images/h1.png)
![Element of the main heading](./images/h1.png)

Update your code to the following:

Expand All @@ -63,15 +63,15 @@ $ python main.py
[<h1 class="collection__title heading h1">Sales</h1>]
```

Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:

```py
headings = soup.select("h1")
first_heading = headings[0]
print(first_heading.text)
```

If we run our scraper again, it prints the text of the first `<h1>` tag:
If we run our scraper again, it prints the text of the first `h1` element:

```text
$ python main.py
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09

Hints:

- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
- To get just the date part, you can call `.date()` on any `datetime` object.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -228,13 +228,13 @@ With everything in place, we can now start working on a scraper that also scrape

![Product card's child elements](./images/child-elements.png)

Several methods exist for transitioning from one page to another, but the most common is a link tag, which looks like this:
Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this:

```html
<a href="https://example.com">Text of the link</a>
```

In DevTools, we can see that each product title is, in fact, also a link tag. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:

```py
def parse_product(product):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ As a scraper developer, you are not limited by whether certain data is available

### Why learn with Apify

We are [Apify](https://apify.com), a web scraping and automation platform. We do our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
We are [Apify](https://apify.com), a web scraping and automation platform. We do our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how a scraping platform can simplify your life, but that lesson is optional and designed to fit within our [free tier](https://apify.com/pricing).

## Course content

Expand Down
Loading