Skip to content

Commit 1590092

Browse files
committed
fix: streamline the parsing lesson
1 parent 381b122 commit 1590092

File tree

4 files changed

+35
-27
lines changed

4 files changed

+35
-27
lines changed

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 34 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,17 @@ slug: /scraping-basics-python/parsing-html
1010

1111
---
1212

13-
From previous lessons we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
13+
From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
1414

15-
![Products have the ‘product-item’ class](./images/collection-class.png)
15+
![Products have the ‘product-item’ class](./images/product-item.png)
1616

17-
As a first step, let's try counting how many products is in the listing.
17+
As a first step, let's try counting how many products is on the listing page.
1818

1919
## Treating HTML as a string
2020

21-
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. Can we use Python string operations to count the products? Each string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
21+
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. If it's a string, could we use Python string operations to count the products? Each Python string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
2222

23-
After manually inspecting the page in browser DevTools we can see that each product has the following structure:
23+
After manually inspecting the page in browser DevTools we can see that all products follow this structure:
2424

2525
```html
2626
<div class="product-item product-item--vertical ...">
@@ -33,7 +33,9 @@ After manually inspecting the page in browser DevTools we can see that each prod
3333
</div>
3434
```
3535

36-
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. Replace your program with the following code:
36+
At first sight, counting `product-item` occurances wouldn't match only products, but also `product-item__image-wrapper`. Hmm.
37+
38+
We could try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag, but that would also count `<div class="product-item__info`! We'll need to add a space after the class name to avoid matching those. Replace your program with the following code:
3739

3840
```python
3941
import httpx
@@ -43,26 +45,25 @@ response = httpx.get(url)
4345
response.raise_for_status()
4446

4547
html_code = response.text
46-
count = html_code.count('<div class="product-item')
48+
count = html_code.count('<div class="product-item ')
4749
print(count)
4850
```
4951

52+
Note that because the substring contains a double quote character, we need single quotes as string boundaries.
53+
5054
:::info Handling errors
5155

52-
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least crashes and prints what happened in case there's an error.
56+
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least visibly crashes and prints what happened in case there's an error.
5357

5458
:::
5559

56-
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more `div` tags with class names starting with `product-item`.
57-
58-
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
60+
Our scraper prints 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, figuring this out was quite tedious!
5961

60-
```python
61-
count = html_code.count('<div class="product-item ')
62+
```text
63+
$ python main.py
64+
24
6265
```
6366

64-
Now it prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, that was tedious!
65-
6667
<!-- TODO image -->
6768

6869
While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
@@ -85,7 +86,11 @@ $ pip install beautifulsoup4
8586
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
8687
```
8788

88-
Now let's use it for parsing the HTML:
89+
Now let's use it for parsing the HTML. Unlike plain string, the `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
90+
91+
![Tag of the main heading](./images/h1.png)
92+
93+
Update your code to the following:
8994

9095
```python
9196
import httpx
@@ -97,24 +102,25 @@ response.raise_for_status()
97102

98103
html_code = response.text
99104
soup = BeautifulSoup(html_code, "html.parser")
100-
print(soup.title)
105+
print(soup.select("h1"))
101106
```
102107

103-
The `BeautifulSoup` object contains our HTML, but unlike plain string, it allows us to work with the HTML elements in a structured way. As a demonstration, we use the shorthand `.title` for accessing the HTML `<title>` tag. Let's run the program:
108+
Let's run the program:
104109

105110
```text
106111
$ python main.py
107-
<title>Sales
108-
</title>
112+
[<h1 class="collection__title heading h1">Sales</h1>]
109113
```
110114

111-
That looks promising! What if we want just the contents of the tag? Let's change the print line to the following:
115+
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
112116

113117
```python
114-
print(soup.title.text)
118+
headings = soup.select("h1")
119+
first_heading = headings[0]
120+
print(first_heading.text)
115121
```
116122

117-
If we run our scraper again, it prints just the actual text of the `<title>` tag:
123+
If we run our scraper again, it prints the text of the first `<h1>` tag:
118124

119125
```text
120126
$ python main.py
@@ -123,7 +129,9 @@ Sales
123129

124130
## Using CSS selectors
125131

126-
Beautiful Soup offers a `.select()` method, which runs a CSS selector against a parsed HTML document and returns all the matching elements. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
132+
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
133+
134+
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
127135

128136
```python
129137
import httpx
@@ -139,14 +147,14 @@ products = soup.select(".product-item")
139147
print(len(products))
140148
```
141149

142-
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. On the last line, we use `len()` to count how many items is in the list. That's it!
150+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
143151

144152
```text
145153
$ python main.py
146154
24
147155
```
148156

149-
We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
157+
That's it! We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
150158

151159
---
152160

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Locating HTML elements with Python
33
sidebar_label: Locating HTML elements
4-
description: TODO
4+
description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page.
55
sidebar_position: 6
66
slug: /scraping-basics-python/locating-elements
77
---
2.24 MB
Loading

0 commit comments

Comments
 (0)