You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -93,75 +92,56 @@ HTTP is a network protocol powering the internet. Understanding it well is an im
93
92
94
93
:::
95
94
96
-
## Checking status codes
95
+
## Handling errors
97
96
98
-
Sometimes websites return all kinds of errors. Most often because:
97
+
Sometimes websites return all kinds of errors, most often because:
99
98
100
99
- The server is temporarily down.
101
100
- The server breaks under a heavy load of requests.
102
101
- The server applies anti-scraping protections.
103
102
- The server application is buggy and just couldn't handle our request.
104
103
105
-
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or success. Let's change the last line of our program to print the code of the response we get:
106
-
107
-
```py
108
-
print(response.status_code)
109
-
```
110
-
111
-
If we run the program, it should print number 200, which means the server understood our request and was happy to respond with what we asked for:
112
-
113
-
```text
114
-
$ python main.py
115
-
200
116
-
```
117
-
118
-
Good! Now let's fix our code so that it can handle a situation when the server doesn't return 200.
104
+
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or a success. A robust scraper skips or retries requests on errors. It's a big task though, and it's best to use libraries or frameworks for that.
119
105
120
106
:::tip All status codes
121
107
122
-
If you're curious, sneak a peek at the list of all [HTTP response status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). There's plenty of them and they're categorized according to their first digit. If you're even more curious, we recommend browsing the [HTTP Cats](https://http.cat/) as a highly professional resource on the topic.
108
+
If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource.
123
109
124
110
:::
125
111
126
-
## Handling errors
112
+
For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
127
113
128
-
It's time to ask for trouble! Let's change the URL in our code to a page which doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
114
+
First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
135
-
136
-
A robust scraper skips or retries requests when errors occur, but let's start simple. Our program will print an error message and stop further processing of the response.
137
-
138
-
139
-
We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
120
+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
161
141
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
162
142
```
163
143
164
-
Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
144
+
Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least visibly crashes and prints what happened in case there's an error.
58
-
59
-
:::
60
-
61
55
Our scraper prints 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, figuring this out was quite tedious!
0 commit comments