You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From previous lessons we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
13
+
From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
14
14
15
-

15
+

16
16
17
-
As a first step, let's try counting how many products is in the listing.
17
+
As a first step, let's try counting how many products is on the listing page.
18
18
19
19
## Treating HTML as a string
20
20
21
-
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. Can we use Python string operations to count the products? Each string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
21
+
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. If it's a string, could we use Python string operations to count the products? Each Python string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
22
22
23
-
After manually inspecting the page in browser DevTools we can see that each product has the following structure:
23
+
After manually inspecting the page in browser DevTools we can see that all products follow this structure:
@@ -33,7 +33,9 @@ After manually inspecting the page in browser DevTools we can see that each prod
33
33
</div>
34
34
```
35
35
36
-
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. Replace your program with the following code:
36
+
At first sight, counting `product-item` occurances wouldn't match only products, but also `product-item__image-wrapper`. Hmm.
37
+
38
+
We could try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag, but that would also count `<div class="product-item__info`! We'll need to add a space after the class name to avoid matching those. Replace your program with the following code:
Note that because the substring contains a double quote character, we need single quotes as string boundaries.
53
+
50
54
:::info Handling errors
51
55
52
-
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least crashes and prints what happened in case there's an error.
56
+
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least visibly crashes and prints what happened in case there's an error.
53
57
54
58
:::
55
59
56
-
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more `div` tags with class names starting with `product-item`.
57
-
58
-
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
60
+
Our scraper prints 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, figuring this out was quite tedious!
Now it prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, that was tedious!
65
-
66
67
<!-- TODO image -->
67
68
68
69
While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
Now let's use it for parsing the HTML. Unlike plain string, the `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
90
+
91
+

92
+
93
+
Update your code to the following:
89
94
90
95
```python
91
96
import httpx
@@ -97,24 +102,25 @@ response.raise_for_status()
97
102
98
103
html_code = response.text
99
104
soup = BeautifulSoup(html_code, "html.parser")
100
-
print(soup.title)
105
+
print(soup.select("h1"))
101
106
```
102
107
103
-
The `BeautifulSoup` object contains our HTML, but unlike plain string, it allows us to work with the HTML elements in a structured way. As a demonstration, we use the shorthand `.title` for accessing the HTML `<title>` tag. Let's run the program:
That looks promising! What if we want just the contents of the tag? Let's change the print line to the following:
115
+
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
112
116
113
117
```python
114
-
print(soup.title.text)
118
+
headings = soup.select("h1")
119
+
first_heading = headings[0]
120
+
print(first_heading.text)
115
121
```
116
122
117
-
If we run our scraper again, it prints just the actual text of the `<title>` tag:
123
+
If we run our scraper again, it prints the text of the first `<h1>` tag:
118
124
119
125
```text
120
126
$ python main.py
@@ -123,7 +129,9 @@ Sales
123
129
124
130
## Using CSS selectors
125
131
126
-
Beautiful Soup offers a `.select()` method, which runs a CSS selector against a parsed HTML document and returns all the matching elements. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
132
+
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
133
+
134
+
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. On the last line, we use `len()` to count how many items is in the list. That's it!
150
+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
143
151
144
152
```text
145
153
$ python main.py
146
154
24
147
155
```
148
156
149
-
We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
157
+
That's it! We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Locating HTML elements with Python
3
3
sidebar_label: Locating HTML elements
4
-
description: TODO
4
+
description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page.
0 commit comments