You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
+3-10Lines changed: 3 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,30 +94,23 @@ HTTP is a network protocol powering the internet. Understanding it well is an im
94
94
95
95
## Handling errors
96
96
97
-
Sometimes websites return all kinds of errors, most often because:
98
-
99
-
- The server is temporarily down.
100
-
- The server breaks under a heavy load of requests.
101
-
- The server applies anti-scraping protections.
102
-
- The server application is buggy and just couldn't handle our request.
103
-
104
-
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or a success. A robust scraper skips or retries requests on errors. It's a big task though, and it's best to use libraries or frameworks for that.
97
+
Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit _status code_ that indicates whether it is an error or a success.
105
98
106
99
:::tip All status codes
107
100
108
101
If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource.
109
102
110
103
:::
111
104
112
-
For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
105
+
A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
113
106
114
107
First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
113
+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,9 +20,9 @@ As a first step, let's try counting how many products are on the listing page.
20
20
21
21
## Processing HTML
22
22
23
-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. But if it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
23
+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
24
24
25
-
While somewhat possible, such approach is tedious, fragile, and unreliable. To work with HTML we need a robust tool dedicated for the task. An_HTML parser_ takes a text with HTML markup and turns it into a tree of Python objects.
25
+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an_HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
There's still some room for improvement, but it's already much better!
105
101
106
102
## Locating a single element
107
103
108
-
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a`.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
104
+
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the`.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code!
109
105
110
106
```py
111
107
import httpx
@@ -174,8 +170,23 @@ If we run the scraper now, it should print prices as only amounts:
174
170
$ python main.py
175
171
JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
176
172
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
177
-
Sony SACS9 10" Active Subwoofer $158.00
178
-
Sony PS-HX500 Hi-Res USB Turntable $398.00
173
+
...
174
+
```
175
+
176
+
## Formatting output
177
+
178
+
The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function:
179
+
180
+
```py
181
+
print(title, price, sep=" | ")
182
+
```
183
+
184
+
The output is much nicer this way:
185
+
186
+
```text
187
+
$ python main.py
188
+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
189
+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+20-21Lines changed: 20 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,10 +16,8 @@ Locating the right HTML elements is the first step of a successful data extracti
16
16
17
17
```text
18
18
$ python main.py
19
-
JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
20
-
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
21
-
Sony SACS9 10" Active Subwoofer $158.00
22
-
Sony PS-HX500 Hi-Res USB Turntable $398.00
19
+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
20
+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
23
21
...
24
22
```
25
23
@@ -35,7 +33,7 @@ The last bullet point is the most important to figure out before we start coding
35
33
36
34
It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing.
37
35
38
-
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix!
36
+
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix?
-`.startswith()`, a method for [checking the beginning of a string](https://docs.python.org/3/library/stdtypes.html#str.startswith).
60
-
-`.removeprefix()`, a method for [removing something from the beginning of a string](https://docs.python.org/3/library/stdtypes.html#str.removeprefix).
57
+
If you're not proficient in Python's string methods, [.startswith()](https://docs.python.org/3/library/stdtypes.html#str.startswith) checks the beginning of a given string, and [.removeprefix()](https://docs.python.org/3/library/stdtypes.html#str.removeprefix) removes something from the beginning of a given string.
61
58
62
-
The whole program now looks like this:
59
+
:::
60
+
61
+
The whole program would look like this:
63
62
64
63
```py
65
64
import httpx
@@ -83,31 +82,33 @@ for product in soup.select(".product-item"):
83
82
min_price = price_text
84
83
price = min_price
85
84
86
-
print(title, min_price, price)
85
+
print(title, min_price, price, sep=" | ")
87
86
```
88
87
89
88
## Removing white space
90
89
91
90
Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) of the HTML tags.
92
91
93
-
We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it.
94
-
95
-
In Python, we have `.strip()`, a built-in string method for [removing whitespace from both the beginning and the end](https://docs.python.org/3/library/stdtypes.html#str.strip). Let's add it to our code:
92
+
We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add Python's built-in [.strip()](https://docs.python.org/3/library/stdtypes.html#str.strip):
96
93
97
94
```py
98
95
title = product.select_one(".product-item__title").text.strip()
While we're at it, let's see what Beautiful Soup offers when it comes to working with strings:
100
+
:::info Handling strings in Beautiful Soup
101
+
102
+
Beautiful Soup offers several attributes when it comes to working with strings:
104
103
105
104
-`.string`, which often is like `.text`,
106
105
-`.strings`, which [returns a list of all nested textual nodes](https://beautiful-soup-4.readthedocs.io/en/latest/#strings-and-stripped-strings),
107
106
-`.stripped_strings`, which does the same but with whitespace removed.
108
107
109
108
These might be useful in some complex scenarios, but in our case, they won't make scraping the title or price any shorter or more elegant.
110
109
110
+
:::
111
+
111
112
## Removing dollar sign and commas
112
113
113
114
We got rid of the `From` and possible whitespace, but we still can't save the price as a number in our Python program:
@@ -126,7 +127,7 @@ The demonstration above is inside the Python's [interactive REPL](https://realpy
126
127
127
128
:::
128
129
129
-
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but our case is so simple that we can just throw in `.replace()`, Python's built-in string method for [replacing substrings](https://docs.python.org/3/library/stdtypes.html#str.replace):
130
+
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but in this case [`.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) is also sufficient:
130
131
131
132
```py
132
133
price_text = (
@@ -159,7 +160,7 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
159
160
0.30000000000000004
160
161
```
161
162
162
-
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in `Decimal()`, a [type designed to represent decimal numbers exactly](https://docs.python.org/3/library/decimal.html):
163
+
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type:
163
164
164
165
```py
165
166
from decimal import Decimal
@@ -191,17 +192,15 @@ for product in soup.select(".product-item"):
191
192
min_price = Decimal(price_text)
192
193
price = min_price
193
194
194
-
print(title, min_price, price)
195
+
print(title, min_price, price, sep=" | ")
195
196
```
196
197
197
198
If we run the code above, we have nice, clean data about all the products!
198
199
199
200
```text
200
201
$ python main.py
201
-
JBL Flip 4 Waterproof Portable Bluetooth Speaker 74.95 74.95
202
-
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 1398.00 None
0 commit comments