Skip to content

Commit 5856f4f

Browse files
committed
feat: rewrite and complete the lesson about downloading
1 parent 85a25da commit 5856f4f

File tree

3 files changed

+190
-47
lines changed

3 files changed

+190
-47
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 141 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -79,63 +79,103 @@ $ python main.py
7979
</html>
8080
```
8181

82-
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
82+
And that's it! It's not particularly useful yet, but it's a good start of our scraper.
8383

84-
## Treating HTML as a string
84+
## About HTTP
8585

86-
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
86+
Running `httpx.get(url)`, we made our first HTTP request and received our first response. HTTP is a network protocol powering most of the internet. Understanding it well is an important foundation for successful scraping, but for now it's enough to know the basic flow and terminology.
8787

88-
```html
89-
<div class="product-item product-item--vertical ...">
90-
<a href="/products/..." class="product-item__image-wrapper">
91-
...
92-
</a>
93-
<div class="product-item__info">
94-
...
95-
</div>
96-
</div>
97-
```
88+
HTTP is an exchange of two participants. The _client_ sends a _request_ to the _server_, which replies with a _response_. In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
89+
90+
<!-- TODO image basic HTTP chart -->
91+
92+
:::tip Deep dive to HTTP
93+
94+
The HTTP protocol is defined by several documents called RFCs, such as [RFC 7230: HTTP Message Syntax and Routing](https://www.rfc-editor.org/rfc/rfc7230) or [RFC 7231: HTTP Semantics and Content](https://www.rfc-editor.org/rfc/rfc7231). While these technical specifications are surprisingly digestible, you may also like [HTTP tutorials by MDN](https://developer.mozilla.org/en-US/docs/Web/HTTP).
95+
96+
:::
97+
98+
## Checking status codes
99+
100+
Sometimes websites return all kinds of errors. Most often because:
101+
102+
- The server is temporarily down.
103+
- The server breaks under a heavy load of requests.
104+
- The server applies anti-scraping protections.
105+
- The server application is buggy and just couldn't handle our request.
98106

99-
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
107+
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or success. Let's change the last line of our program to print the code of the response we get:
100108

101109
```python
102-
import httpx
110+
print(response.status_code)
111+
```
103112

104-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
105-
response = httpx.get(url)
113+
If we run the program, it should print number 200, which means the server understood our request and was happy to respond with what we asked for:
106114

107-
html_code = response.text
108-
count = html_code.count('<div class="product-item')
109-
print(count)
115+
```text
116+
$ python main.py
117+
200
110118
```
111119

112-
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
120+
Good! Now let's fix our code so that it can handle a situation when the server doesn't return 200.
113121

114-
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
122+
:::tip All status codes
115123

116-
```python
117-
count = html_code.count('<div class="product-item ')
124+
If you're curious, sneak a peek at the list of all [HTTP response status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). There's plenty of them and they're categorized according to their first digit. If you're even more curious, we recommend browsing the [HTTP Cats](https://http.cat/) as a highly professional resource on the topic.
125+
126+
:::
127+
128+
## Handling errors
129+
130+
It's time to ask for trouble! Let's change the URL in our code to a page which doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404):
131+
132+
```text
133+
https://warehouse-theme-metal.myshopify.com/does/not/exist
118134
```
119135

120-
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
136+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
121137

122-
<!-- TODO image -->
138+
A robust scraper skips or retries requests when errors occur, but we'll start simple. Our program will print an error message and stop further processing of the response.
123139

124-
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
125140

126-
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
141+
We also want to play along with the conventions of the operating system, so let's print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
127142

128-
## Exercises
143+
```python
144+
import sys
145+
import httpx
129146

130-
### Handle errors
147+
url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"
148+
try:
149+
response = httpx.get(url)
150+
response.raise_for_status()
151+
print(response.text)
152+
153+
except httpx.HTTPError as error:
154+
print(error, file=sys.stderr)
155+
sys.exit(1)
156+
```
131157

132-
Sometimes websites return all kinds of strange errors, most often because they're temporarily down, or because they employ anti-scraping protections. Change the URL in your code to the following:
158+
If you run the code above, you should see a nice error message:
133159

134160
```text
135-
https://example.com/does/not/exist
161+
$ python main.py
162+
Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
163+
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
136164
```
137165

138-
The page doesn't exist, which means the response will be [error 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). Explore the [HTTPX documentation](https://www.python-httpx.org/) on how to adjust your code to handle such error. In case of error response, your program should print an error message to the user and stop further processing of the response.
166+
Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
167+
168+
## Exercises
169+
170+
These challenges should help you verify that you can apply knowledge acquired in this lesson. Resist the temptation to look at the solutions right away. Learn by doing, not by copying and pasting!
171+
172+
### Scrape Amazon
173+
174+
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
175+
176+
```text
177+
https://www.amazon.com/s?k=darth+vader
178+
```
139179

140180
<details>
141181
<summary>Solution</summary>
@@ -144,25 +184,80 @@ The page doesn't exist, which means the response will be [error 404](https://dev
144184
import sys
145185
import httpx
146186

147-
url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"
148-
response = httpx.get(url)
187+
url = "https://www.amazon.com/s?k=darth+vader"
188+
try:
189+
response = httpx.get(url)
190+
response.raise_for_status()
191+
print(response.text)
192+
193+
except httpx.HTTPError as error:
194+
print(error, file=sys.stderr)
195+
sys.exit(1)
196+
```
197+
</details>
198+
199+
### Save downloaded HTML as a file
200+
201+
Download HTML, then save it on your disk as a `products.html` file. You can use the URL we've been already playing with:
202+
203+
```text
204+
https://warehouse-theme-metal.myshopify.com/collections/sales
205+
```
206+
207+
<details>
208+
<summary>Solution</summary>
209+
210+
Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
149211

150-
if response.status_code != 200:
151-
print(f"Failed to fetch {url}: ERROR {response.status_code}")
152-
else:
153-
html_code = response.text
154-
count = html_code.count('<div class="product-item ')
155-
print(count)
212+
```text
213+
$ python main.py > products.html
156214
```
157215

158-
If you want your program to play well with the conventions of the operating system, you can print errors to so called _standard error output_ and exit your program with non-zero status code:
216+
If you want to use Python, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
159217

160218
```python
161-
if response.status_code != 200:
162-
print(f"Failed to fetch {url}: ERROR {response.status_code}", file=sys.stderr)
219+
import sys
220+
import httpx
221+
from pathlib import Path
222+
223+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
224+
try:
225+
response = httpx.get(url)
226+
response.raise_for_status()
227+
Path("products.html").write_text(response.text)
228+
229+
except httpx.HTTPError as error:
230+
print(error, file=sys.stderr)
163231
sys.exit(1)
164232
```
165233
</details>
166234

167-
- Two
168-
- Three
235+
### Download an image as a file
236+
237+
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/) for guidance. You can use this URL pointing to an image of a TV:
238+
239+
```text
240+
https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg
241+
```
242+
243+
<details>
244+
<summary>Solution</summary>
245+
246+
Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
247+
248+
```python
249+
from pathlib import Path
250+
import sys
251+
import httpx
252+
253+
url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"
254+
try:
255+
response = httpx.get(url)
256+
response.raise_for_status()
257+
Path("tv.jpg").write_bytes(response.content)
258+
except httpx.HTTPError as e:
259+
print(e, file=sys.stderr)
260+
sys.exit(1)
261+
262+
```
263+
</details>

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,56 @@ slug: /scraping-basics-python/parsing-html
1010

1111
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
1212

13+
This lesson contains just a fraction of what it should contain. In the end, the current content might get rewritten. Everything on this page is a subject to change!
14+
1315
:::
1416

17+
## Treating HTML as a string
18+
19+
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
20+
21+
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
22+
23+
```html
24+
<div class="product-item product-item--vertical ...">
25+
<a href="/products/..." class="product-item__image-wrapper">
26+
...
27+
</a>
28+
<div class="product-item__info">
29+
...
30+
</div>
31+
</div>
32+
```
33+
34+
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
35+
36+
```python
37+
import httpx
38+
39+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
40+
response = httpx.get(url)
41+
42+
html_code = response.text
43+
count = html_code.count('<div class="product-item')
44+
print(count)
45+
```
46+
47+
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
48+
49+
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
50+
51+
```python
52+
count = html_code.count('<div class="product-item ')
53+
```
54+
55+
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
56+
57+
<!-- TODO image -->
58+
59+
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
60+
61+
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
62+
1563
## Exercises
1664

1765
- One

sources/academy/webscraping/scraping_basics_python/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Anyone with basic knowledge of developing programs in Python who wants to start
4040
## Requirements
4141

4242
- macOS, Linux or Windows machine with a web browser and Python installed
43-
- Familiar with Python basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes
43+
- Familiar with Python basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, exceptions
4444
- Comfortable importing from the Python standard library, using virtual environments, and installing dependencies with `pip`
4545
- Running commands in Terminal or Command Prompt
4646

0 commit comments

Comments
 (0)