Skip to content

Commit c878f30

Browse files
committed
feat: streamline the section about handling errors
1 parent 96b47e9 commit c878f30

File tree

2 files changed

+27
-71
lines changed

2 files changed

+27
-71
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 27 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,6 @@ import httpx
6060

6161
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
6262
response = httpx.get(url)
63-
6463
print(response.text)
6564
```
6665

@@ -93,75 +92,56 @@ HTTP is a network protocol powering the internet. Understanding it well is an im
9392

9493
:::
9594

96-
## Checking status codes
95+
## Handling errors
9796

98-
Sometimes websites return all kinds of errors. Most often because:
97+
Sometimes websites return all kinds of errors, most often because:
9998

10099
- The server is temporarily down.
101100
- The server breaks under a heavy load of requests.
102101
- The server applies anti-scraping protections.
103102
- The server application is buggy and just couldn't handle our request.
104103

105-
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or success. Let's change the last line of our program to print the code of the response we get:
106-
107-
```py
108-
print(response.status_code)
109-
```
110-
111-
If we run the program, it should print number 200, which means the server understood our request and was happy to respond with what we asked for:
112-
113-
```text
114-
$ python main.py
115-
200
116-
```
117-
118-
Good! Now let's fix our code so that it can handle a situation when the server doesn't return 200.
104+
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or a success. A robust scraper skips or retries requests on errors. It's a big task though, and it's best to use libraries or frameworks for that.
119105

120106
:::tip All status codes
121107

122-
If you're curious, sneak a peek at the list of all [HTTP response status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). There's plenty of them and they're categorized according to their first digit. If you're even more curious, we recommend browsing the [HTTP Cats](https://http.cat/) as a highly professional resource on the topic.
108+
If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource.
123109

124110
:::
125111

126-
## Handling errors
112+
For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
127113

128-
It's time to ask for trouble! Let's change the URL in our code to a page which doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
114+
First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
129115

130116
```text
131117
https://warehouse-theme-metal.myshopify.com/does/not/exist
132118
```
133119

134-
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
135-
136-
A robust scraper skips or retries requests when errors occur, but let's start simple. Our program will print an error message and stop further processing of the response.
137-
138-
139-
We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
120+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
140121

141122
```py
142-
import sys
143123
import httpx
144124

145125
url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"
146-
try:
147-
response = httpx.get(url)
148-
response.raise_for_status()
149-
print(response.text)
150-
151-
except httpx.HTTPError as error:
152-
print(error, file=sys.stderr)
153-
sys.exit(1)
126+
response = httpx.get(url)
127+
response.raise_for_status()
128+
print(response.text)
154129
```
155130

156-
If you run the code above, you should see a nice error message:
131+
If you run the code above, the program should crash:
157132

158133
```text
159134
$ python main.py
160-
Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
135+
Traceback (most recent call last):
136+
File "/Users/.../main.py", line 5, in <module>
137+
response.raise_for_status()
138+
File "/Users/.../.venv/lib/python3/site-packages/httpx/_models.py", line 761, in raise_for_status
139+
raise HTTPStatusError(message, request=request, response=self)
140+
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
161141
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
162142
```
163143

164-
Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
144+
Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
165145

166146
---
167147

@@ -179,18 +159,12 @@ https://www.amazon.com/s?k=darth+vader
179159
<summary>Solution</summary>
180160

181161
```py
182-
import sys
183162
import httpx
184163

185164
url = "https://www.amazon.com/s?k=darth+vader"
186-
try:
187-
response = httpx.get(url)
188-
response.raise_for_status()
189-
print(response.text)
190-
191-
except httpx.HTTPError as error:
192-
print(error, file=sys.stderr)
193-
sys.exit(1)
165+
response = httpx.get(url)
166+
response.raise_for_status()
167+
print(response.text)
194168
```
195169

196170
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
@@ -216,19 +190,13 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
216190
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
217191

218192
```py
219-
import sys
220193
import httpx
221194
from pathlib import Path
222195

223196
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
224-
try:
225-
response = httpx.get(url)
226-
response.raise_for_status()
227-
Path("products.html").write_text(response.text)
228-
229-
except httpx.HTTPError as error:
230-
print(error, file=sys.stderr)
231-
sys.exit(1)
197+
response = httpx.get(url)
198+
response.raise_for_status()
199+
Path("products.html").write_text(response.text)
232200
```
233201

234202
</details>
@@ -248,18 +216,12 @@ https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72
248216

249217
```py
250218
from pathlib import Path
251-
import sys
252219
import httpx
253220

254221
url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"
255-
try:
256-
response = httpx.get(url)
257-
response.raise_for_status()
258-
Path("tv.jpg").write_bytes(response.content)
259-
except httpx.HTTPError as e:
260-
print(e, file=sys.stderr)
261-
sys.exit(1)
262-
222+
response = httpx.get(url)
223+
response.raise_for_status()
224+
Path("tv.jpg").write_bytes(response.content)
263225
```
264226

265227
</details>

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,6 @@ count = html_code.count('<div class="product-item ')
5252
print(count)
5353
```
5454

55-
:::info Handling errors
56-
57-
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least visibly crashes and prints what happened in case there's an error.
58-
59-
:::
60-
6155
Our scraper prints 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, figuring this out was quite tedious!
6256

6357
```text

0 commit comments

Comments
 (0)