Skip to content

Commit 8685196

Browse files
committed
feat: add one more lesson
1 parent 1590092 commit 8685196

File tree

4 files changed

+169
-5
lines changed

4 files changed

+169
-5
lines changed

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ As a first step, let's try counting how many products is on the listing page.
2020

2121
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. If it's a string, could we use Python string operations to count the products? Each Python string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
2222

23-
After manually inspecting the page in browser DevTools we can see that all products follow this structure:
23+
After manually inspecting the page in browser DevTools we can see that all product cards have the following structure:
2424

2525
```html
2626
<div class="product-item product-item--vertical ...">
@@ -131,7 +131,7 @@ Sales
131131

132132
Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
133133

134-
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
134+
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
135135

136136
```python
137137
import httpx

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 165 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,170 @@ sidebar_position: 6
66
slug: /scraping-basics-python/locating-elements
77
---
88

9-
:::danger Work in progress
9+
**In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.**
1010

11-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11+
---
12+
13+
In the previous lesson we've managed to print text of the page's main heading or count how many products is in the listing. Let's combine those two—what happens if we print `.text` for each product card?
14+
15+
```python
16+
import httpx
17+
from bs4 import BeautifulSoup
18+
19+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
20+
response = httpx.get(url)
21+
response.raise_for_status()
22+
23+
html_code = response.text
24+
soup = BeautifulSoup(html_code, "html.parser")
25+
for product in soup.select(".product-item"):
26+
print(product.text)
27+
```
28+
29+
Well, it definitely prints _something_
30+
31+
```text
32+
$ python main.py
33+
Save $25.00
34+
35+
36+
JBL
37+
JBL Flip 4 Waterproof Portable Bluetooth Speaker
38+
39+
40+
41+
Black
42+
43+
+7
44+
45+
46+
Blue
47+
48+
+6
49+
50+
51+
Grey
52+
...
53+
```
54+
55+
To get details about each product in a structured way, we'll need a different approach.
56+
57+
## Locating child elements
58+
59+
As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card.
60+
61+
![Product card's child elements](./images/child-elements.png)
62+
63+
We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:
64+
65+
```python
66+
import httpx
67+
from bs4 import BeautifulSoup
68+
69+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
70+
response = httpx.get(url)
71+
response.raise_for_status()
72+
73+
html_code = response.text
74+
soup = BeautifulSoup(html_code, "html.parser")
75+
for product in soup.select(".product-item"):
76+
titles = product.select('.product-item__title')
77+
first_title = titles[0].text
78+
79+
prices = product.select('.price')
80+
first_price = prices[0].text
81+
82+
print(first_title, first_price)
83+
```
84+
85+
Let's run the program now:
86+
87+
```text
88+
$ python main.py
89+
JBL Flip 4 Waterproof Portable Bluetooth Speaker
90+
Sale price$74.95
91+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
92+
Sale priceFrom $1,398.00
93+
Sony SACS9 10" Active Subwoofer
94+
Sale price$158.00
95+
Sony PS-HX500 Hi-Res USB Turntable
96+
Sale price$398.00
97+
...
98+
```
99+
100+
There's still some room for improvement, but it's already much better!
101+
102+
## Locating a single element
103+
104+
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
105+
106+
```python
107+
import httpx
108+
from bs4 import BeautifulSoup
109+
110+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
111+
response = httpx.get(url)
112+
response.raise_for_status()
113+
114+
html_code = response.text
115+
soup = BeautifulSoup(html_code, "html.parser")
116+
for product in soup.select(".product-item"):
117+
title = product.select_one('.product-item__title').text
118+
price = product.select_one('.price').text
119+
print(title, price)
120+
```
121+
122+
This program does the same as the one we already had, but its code is more concise.
123+
124+
## Precisely locating price
125+
126+
In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:
127+
128+
```html
129+
<span class="price">
130+
<span class="visually-hidden">Sale price</span>
131+
$74.95
132+
</span>
133+
```
134+
When translated to a tree of Python objects, the element with class `price` will contain several nodes:
135+
136+
- Textual node with white space,
137+
- a `span` HTML element,
138+
- a textual node representing the actual amount and possibly also white space.
139+
140+
We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this:
141+
142+
```
143+
['\n', <span class="visually-hidden">Sale price</span>, '$74.95']
144+
```
145+
146+
It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program:
147+
148+
```python
149+
import httpx
150+
from bs4 import BeautifulSoup
151+
152+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
153+
response = httpx.get(url)
154+
response.raise_for_status()
155+
156+
html_code = response.text
157+
soup = BeautifulSoup(html_code, "html.parser")
158+
for product in soup.select(".product-item"):
159+
title = product.select_one('.product-item__title').text
160+
price = product.select_one('.price').contents[-1]
161+
print(title, price)
162+
```
163+
164+
If we run our program now, it should print prices just as the actual amounts:
165+
166+
```text
167+
$ python main.py
168+
JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
169+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
170+
Sony SACS9 10" Active Subwoofer $158.00
171+
Sony PS-HX500 Hi-Res USB Turntable $398.00
172+
...
173+
```
12174

13-
:::
175+
Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data.

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,5 @@ slug: /scraping-basics-python/extracting-data
1111
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
1212

1313
:::
14+
15+
<!-- .strings, .stripped_strings, string manipulation, regexp -->
917 KB
Loading

0 commit comments

Comments
 (0)