You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ As a first step, let's try counting how many products is on the listing page.
20
20
21
21
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. If it's a string, could we use Python string operations to count the products? Each Python string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
22
22
23
-
After manually inspecting the page in browser DevTools we can see that all products follow this structure:
23
+
After manually inspecting the page in browser DevTools we can see that all product cards have the following structure:
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
133
133
134
-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
134
+
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+165-3Lines changed: 165 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,8 +6,170 @@ sidebar_position: 6
6
6
slug: /scraping-basics-python/locating-elements
7
7
---
8
8
9
-
:::danger Work in progress
9
+
**In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.**
10
10
11
-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11
+
---
12
+
13
+
In the previous lesson we've managed to print text of the page's main heading or count how many products is in the listing. Let's combine those two—what happens if we print `.text` for each product card?
There's still some room for improvement, but it's already much better!
101
+
102
+
## Locating a single element
103
+
104
+
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
title = product.select_one('.product-item__title').text
118
+
price = product.select_one('.price').text
119
+
print(title, price)
120
+
```
121
+
122
+
This program does the same as the one we already had, but its code is more concise.
123
+
124
+
## Precisely locating price
125
+
126
+
In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:
127
+
128
+
```html
129
+
<spanclass="price">
130
+
<spanclass="visually-hidden">Sale price</span>
131
+
$74.95
132
+
</span>
133
+
```
134
+
When translated to a tree of Python objects, the element with class `price` will contain several nodes:
135
+
136
+
- Textual node with white space,
137
+
- a `span` HTML element,
138
+
- a textual node representing the actual amount and possibly also white space.
139
+
140
+
We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this:
title = product.select_one('.product-item__title').text
160
+
price = product.select_one('.price').contents[-1]
161
+
print(title, price)
162
+
```
163
+
164
+
If we run our program now, it should print prices just as the actual amounts:
165
+
166
+
```text
167
+
$ python main.py
168
+
JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
169
+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
170
+
Sony SACS9 10" Active Subwoofer $158.00
171
+
Sony PS-HX500 Hi-Res USB Turntable $398.00
172
+
...
173
+
```
12
174
13
-
:::
175
+
Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data.
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
0 commit comments