You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ Being comfortable around Python project setup and installing packages is a prere
30
30
31
31
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
32
32
33
-
```python
33
+
```py
34
34
import httpx
35
35
36
36
print("OK")
@@ -53,7 +53,7 @@ If you see errors or for any other reason cannot run the code above, we're sorry
53
53
54
54
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing OK. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
@@ -106,7 +106,7 @@ Sometimes websites return all kinds of errors. Most often because:
106
106
107
107
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or success. Let's change the last line of our program to print the code of the response we get:
108
108
109
-
```python
109
+
```py
110
110
print(response.status_code)
111
111
```
112
112
@@ -140,7 +140,7 @@ A robust scraper skips or retries requests when errors occur, but let's start si
140
140
141
141
We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ At first sight, counting `product-item` occurances wouldn't match only products,
37
37
38
38
We could try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag, but that would also count `<div class="product-item__info`! We'll need to add a space after the class name to avoid matching those. Replace your program with the following code:
@@ -92,7 +92,7 @@ Now let's use it for parsing the HTML. Unlike plain string, the `BeautifulSoup`
92
92
93
93
Update your code to the following:
94
94
95
-
```python
95
+
```py
96
96
import httpx
97
97
from bs4 import BeautifulSoup
98
98
@@ -114,7 +114,7 @@ $ python main.py
114
114
115
115
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
116
116
117
-
```python
117
+
```py
118
118
headings = soup.select("h1")
119
119
first_heading = headings[0]
120
120
print(first_heading.text)
@@ -133,7 +133,7 @@ Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML
133
133
134
134
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
In the previous lesson we've managed to print text of the page's main heading or count how many products is in the listing. Let's combine those two—what happens if we print `.text` for each product card?
14
14
15
-
```python
15
+
```py
16
16
import httpx
17
17
from bs4 import BeautifulSoup
18
18
@@ -62,7 +62,7 @@ As in the browser DevTools lessons, we need to change the code so that it locate
62
62
63
63
We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:
64
64
65
-
```python
65
+
```py
66
66
import httpx
67
67
from bs4 import BeautifulSoup
68
68
@@ -73,10 +73,10 @@ response.raise_for_status()
73
73
html_code = response.text
74
74
soup = BeautifulSoup(html_code, "html.parser")
75
75
for product in soup.select(".product-item"):
76
-
titles = product.select('.product-item__title')
76
+
titles = product.select(".product-item__title")
77
77
first_title = titles[0].text
78
78
79
-
prices = product.select('.price')
79
+
prices = product.select(".price")
80
80
first_price = prices[0].text
81
81
82
82
print(first_title, first_price)
@@ -103,7 +103,7 @@ There's still some room for improvement, but it's already much better!
103
103
104
104
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
105
105
106
-
```python
106
+
```py
107
107
import httpx
108
108
from bs4 import BeautifulSoup
109
109
@@ -114,8 +114,8 @@ response.raise_for_status()
114
114
html_code = response.text
115
115
soup = BeautifulSoup(html_code, "html.parser")
116
116
for product in soup.select(".product-item"):
117
-
title = product.select_one('.product-item__title').text
118
-
price = product.select_one('.price').text
117
+
title = product.select_one(".product-item__title").text
118
+
price = product.select_one(".price").text
119
119
print(title, price)
120
120
```
121
121
@@ -131,7 +131,7 @@ In the output we can see that the price isn't located precisely. For each produc
131
131
$74.95
132
132
</span>
133
133
```
134
-
When translated to a tree of Python objects, the element with class `price` will contain several nodes:
134
+
When translated to a tree of Python objects, the element with class `price` will contain several _nodes_:
135
135
136
136
- Textual node with white space,
137
137
- a `span` HTML element,
@@ -140,12 +140,12 @@ When translated to a tree of Python objects, the element with class `price` will
140
140
We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this:
It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program:
147
147
148
-
```python
148
+
```py
149
149
import httpx
150
150
from bs4 import BeautifulSoup
151
151
@@ -156,12 +156,12 @@ response.raise_for_status()
156
156
html_code = response.text
157
157
soup = BeautifulSoup(html_code, "html.parser")
158
158
for product in soup.select(".product-item"):
159
-
title = product.select_one('.product-item__title').text
160
-
price = product.select_one('.price').contents[-1]
159
+
title = product.select_one(".product-item__title").text
160
+
price = product.select_one(".price").contents[-1]
161
161
print(title, price)
162
162
```
163
163
164
-
If we run our program now, it should print prices just as the actual amounts:
164
+
If we run the scraper now, it should print prices as only amounts:
165
165
166
166
```text
167
167
$ python main.py
@@ -173,3 +173,11 @@ Sony PS-HX500 Hi-Res USB Turntable $398.00
173
173
```
174
174
175
175
Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data.
176
+
177
+
---
178
+
179
+
## Exercises
180
+
181
+
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
0 commit comments