You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
82
+
And that's it! It's not particularly useful yet, but it's a good start of our scraper.
83
83
84
-
## Treating HTML as a string
84
+
## About HTTP
85
85
86
-
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
86
+
Running `httpx.get(url)`, we made our first HTTP request and received our first response. HTTP is a network protocol powering most of the internet. Understanding it well is an important foundation for successful scraping, but for now it's enough to know the basic flow and terminology.
HTTP is an exchange of two participants. The _client_ sends a _request_ to the _server_, which replies with a _response_. In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
89
+
90
+
<!-- TODO image basic HTTP chart -->
91
+
92
+
:::tip Deep dive to HTTP
93
+
94
+
The HTTP protocol is defined by several documents called RFCs, such as [RFC 7230: HTTP Message Syntax and Routing](https://www.rfc-editor.org/rfc/rfc7230) or [RFC 7231: HTTP Semantics and Content](https://www.rfc-editor.org/rfc/rfc7231). While these technical specifications are surprisingly digestible, you may also like [HTTP tutorials by MDN](https://developer.mozilla.org/en-US/docs/Web/HTTP).
95
+
96
+
:::
97
+
98
+
## Checking status codes
99
+
100
+
Sometimes websites return all kinds of errors. Most often because:
101
+
102
+
- The server is temporarily down.
103
+
- The server breaks under a heavy load of requests.
104
+
- The server applies anti-scraping protections.
105
+
- The server application is buggy and just couldn't handle our request.
98
106
99
-
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
107
+
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or success. Let's change the last line of our program to print the code of the response we get:
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
120
+
Good! Now let's fix our code so that it can handle a situation when the server doesn't return 200.
113
121
114
-
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
If you're curious, sneak a peek at the list of all [HTTP response status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). There's plenty of them and they're categorized according to their first digit. If you're even more curious, we recommend browsing the [HTTP Cats](https://http.cat/) as a highly professional resource on the topic.
125
+
126
+
:::
127
+
128
+
## Handling errors
129
+
130
+
It's time to ask for trouble! Let's change the URL in our code to a page which doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404):
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
136
+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
121
137
122
-
<!-- TODO image -->
138
+
A robust scraper skips or retries requests when errors occur, but we'll start simple. Our program will print an error message and stop further processing of the response.
123
139
124
-
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
125
140
126
-
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
141
+
We also want to play along with the conventions of the operating system, so let's print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
Sometimes websites return all kinds of strange errors, most often because they're temporarily down, or because they employ anti-scraping protections. Change the URL in your code to the following:
158
+
If you run the code above, you should see a nice error message:
133
159
134
160
```text
135
-
https://example.com/does/not/exist
161
+
$ python main.py
162
+
Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
163
+
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
136
164
```
137
165
138
-
The page doesn't exist, which means the response will be [error 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). Explore the [HTTPX documentation](https://www.python-httpx.org/) on how to adjust your code to handle such error. In case of error response, your program should print an error message to the user and stop further processing of the response.
166
+
Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
167
+
168
+
## Exercises
169
+
170
+
These challenges should help you verify that you can apply knowledge acquired in this lesson. Resist the temptation to look at the solutions right away. Learn by doing, not by copying and pasting!
171
+
172
+
### Scrape Amazon
173
+
174
+
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
175
+
176
+
```text
177
+
https://www.amazon.com/s?k=darth+vader
178
+
```
139
179
140
180
<details>
141
181
<summary>Solution</summary>
@@ -144,25 +184,80 @@ The page doesn't exist, which means the response will be [error 404](https://dev
If you want your program to play well with the conventions of the operating system, you can print errors to so called _standard error output_ and exit your program with non-zero status code:
216
+
If you want to use Python, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
159
217
160
218
```python
161
-
if response.status_code !=200:
162
-
print(f"Failed to fetch {url}: ERROR {response.status_code}", file=sys.stderr)
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/) for guidance. You can use this URL pointing to an image of a TV:
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
12
12
13
+
This lesson contains just a fraction of what it should contain. In the end, the current content might get rewritten. Everything on this page is a subject to change!
14
+
13
15
:::
14
16
17
+
## Treating HTML as a string
18
+
19
+
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
20
+
21
+
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
48
+
49
+
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
56
+
57
+
<!-- TODO image -->
58
+
59
+
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
60
+
61
+
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
0 commit comments