Skip to content

Commit 85a25da

Browse files
committed
feat: small edits and adding one exercise
1 parent 618db4a commit 85a25da

File tree

1 file changed

+42
-2
lines changed

1 file changed

+42
-2
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,12 +117,52 @@ On closer look at the HTML, our substring matches also tags like `<div class="pr
117117
count = html_code.count('<div class="product-item ')
118118
```
119119

120-
Now our program prints number 24, which is in line with the text _Showing 1 - 24 of 50 products_ above the product listing. Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile.
120+
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
121+
122+
<!-- TODO image -->
123+
124+
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
121125

122126
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
123127

124128
## Exercises
125129

126-
- One
130+
### Handle errors
131+
132+
Sometimes websites return all kinds of strange errors, most often because they're temporarily down, or because they employ anti-scraping protections. Change the URL in your code to the following:
133+
134+
```text
135+
https://example.com/does/not/exist
136+
```
137+
138+
The page doesn't exist, which means the response will be [error 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). Explore the [HTTPX documentation](https://www.python-httpx.org/) on how to adjust your code to handle such error. In case of error response, your program should print an error message to the user and stop further processing of the response.
139+
140+
<details>
141+
<summary>Solution</summary>
142+
143+
```python
144+
import sys
145+
import httpx
146+
147+
url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"
148+
response = httpx.get(url)
149+
150+
if response.status_code != 200:
151+
print(f"Failed to fetch {url}: ERROR {response.status_code}")
152+
else:
153+
html_code = response.text
154+
count = html_code.count('<div class="product-item ')
155+
print(count)
156+
```
157+
158+
If you want your program to play well with the conventions of the operating system, you can print errors to so called _standard error output_ and exit your program with non-zero status code:
159+
160+
```python
161+
if response.status_code != 200:
162+
print(f"Failed to fetch {url}: ERROR {response.status_code}", file=sys.stderr)
163+
sys.exit(1)
164+
```
165+
</details>
166+
127167
- Two
128168
- Three

0 commit comments

Comments
 (0)