Skip to content

Commit 7f59f82

Browse files
authored
feat: kick off the Python course (upstream branch) (#1197)
Closes #1023. It's the same PR, but from an upstream branch, not from my fork. As discussed on Slack, this works around unauthorized npm token on CI. Additional changes: - I addressed @TC-MO's comment from the previous PR and added a link to Beautiful Soup docs. - After the Docusaurus upgrade, I moved the code responsible for skipping front matter modifications to partials from patches to Docusaurus config. - I fixed two typos, as we now have typos checking on the repo. - Although I don't think it provides better user experience than plain HTML, I now use the new `<Details />` component (because otherwise I've been getting build errors). I filed #1199
2 parents b04820e + 8dea575 commit 7f59f82

16 files changed

+1216
-6
lines changed

docusaurus.config.js

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -193,12 +193,12 @@ module.exports = {
193193
mermaid: true,
194194
parseFrontMatter: async (params) => {
195195
const result = await params.defaultParseFrontMatter(params);
196-
197-
const ogImageURL = new URL('https://apify.com/og-image/docs-article');
198-
199-
ogImageURL.searchParams.set('title', result.frontMatter.title);
200-
result.frontMatter.image ??= ogImageURL.toString();
201-
196+
const isPartial = params.filePath.split('/').pop()[0] === '_';
197+
if (!isPartial) {
198+
const ogImageURL = new URL('https://apify.com/og-image/docs-article');
199+
ogImageURL.searchParams.set('title', result.frontMatter.title);
200+
result.frontMatter.image ??= ogImageURL.toString();
201+
}
202202
return result;
203203
},
204204
},
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Inspecting web pages with browser DevTools
3+
sidebar_label: "DevTools: Inspecting"
4+
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of an e-commerce website.
5+
sidebar_position: 1
6+
slug: /scraping-basics-python/devtools-inspecting
7+
---
8+
9+
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of an e-commerce website.**
10+
11+
---
12+
13+
:::danger Work in Progress
14+
15+
This lesson is under development. Please read [Starting with browser DevTools](../scraping_basics_javascript/data_extraction/browser_devtools.md) in the meantime so you can follow the upcoming lessons.
16+
17+
:::
18+
19+
<!--
20+
https://developer.chrome.com/docs/devtools/
21+
https://firefox-dev.tools/
22+
https://developer.apple.com/documentation/safari-developer-tools/web-inspector
23+
-->
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: Locating HTML elements on a web page with browser DevTools
3+
sidebar_label: "DevTools: Locating HTML elements"
4+
description: Lesson about using the browser tools for developers to manually find products on an e-commerce website.
5+
sidebar_position: 2
6+
slug: /scraping-basics-python/devtools-locating-elements
7+
---
8+
9+
**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.**
10+
11+
---
12+
13+
:::danger Work in Progress
14+
15+
This lesson is under development. Please read [Finding elements with DevTools](../scraping_basics_javascript/data_extraction/using_devtools.md) in the meantime so you can follow the upcoming lessons.
16+
17+
:::
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: Extracting data from a web page with browser DevTools
3+
sidebar_label: "DevTools: Extracting data"
4+
description: Lesson about using the browser tools for developers to manually extract product data from an e-commerce website.
5+
sidebar_position: 3
6+
slug: /scraping-basics-python/devtools-extracting-data
7+
---
8+
9+
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**
10+
11+
---
12+
13+
:::danger Work in Progress
14+
15+
This lesson is under development. Please read [Extracting data with DevTools](../scraping_basics_javascript/data_extraction/devtools_continued.md) in the meantime so you can follow the upcoming lessons.
16+
17+
:::
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
title: Downloading HTML with Python
3+
sidebar_label: Downloading HTML
4+
description: Lesson about building a Python application for watching prices. Using the HTTPX library to download HTML code of a product listing page.
5+
sidebar_position: 4
6+
slug: /scraping-basics-python/downloading-html
7+
---
8+
9+
import Exercises from './_exercises.mdx';
10+
import Details from '@theme/Details';
11+
12+
**In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.**
13+
14+
---
15+
16+
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing.
17+
18+
## Starting a Python project
19+
20+
Before we start coding, we need to set up a Python project. Create new directory with a virtual environment, then inside the directory and with the environment activated, install the HTTPX library:
21+
22+
```text
23+
$ pip install httpx
24+
...
25+
Successfully installed ... httpx-0.0.0
26+
```
27+
28+
:::tip Installing packages
29+
30+
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
31+
32+
:::
33+
34+
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
35+
36+
```py
37+
import httpx
38+
39+
print("OK")
40+
```
41+
42+
Running it as a Python program will verify that your setup is okay and you've installed HTTPX:
43+
44+
```text
45+
$ python main.py
46+
OK
47+
```
48+
49+
:::info Troubleshooting
50+
51+
If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
52+
53+
:::
54+
55+
## Downloading product listing
56+
57+
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `OK`. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
58+
59+
```py
60+
import httpx
61+
62+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
63+
response = httpx.get(url)
64+
print(response.text)
65+
```
66+
67+
If you run the program now, it should print the downloaded HTML:
68+
69+
```text
70+
$ python main.py
71+
<!doctype html>
72+
<html class="no-js" lang="en">
73+
<head>
74+
<meta charset="utf-8">
75+
<meta name="viewport" content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0">
76+
<meta name="theme-color" content="#00badb">
77+
<meta name="robots" content="noindex">
78+
<title>Sales</title>
79+
...
80+
</body>
81+
</html>
82+
```
83+
84+
Running `httpx.get(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper.
85+
86+
:::tip Client and server, request and response
87+
88+
HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology:
89+
90+
- HTTP is an exchange between two participants.
91+
- The _client_ sends a _request_ to the _server_, which replies with a _response_.
92+
- In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
93+
94+
:::
95+
96+
## Handling errors
97+
98+
Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit _status code_ that indicates whether it is an error or a success.
99+
100+
:::tip All status codes
101+
102+
If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource.
103+
104+
:::
105+
106+
A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
107+
108+
First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
109+
110+
```text
111+
https://warehouse-theme-metal.myshopify.com/does/not/exist
112+
```
113+
114+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
115+
116+
```py
117+
import httpx
118+
119+
url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"
120+
response = httpx.get(url)
121+
response.raise_for_status()
122+
print(response.text)
123+
```
124+
125+
If you run the code above, the program should crash:
126+
127+
```text
128+
$ python main.py
129+
Traceback (most recent call last):
130+
File "/Users/.../main.py", line 5, in <module>
131+
response.raise_for_status()
132+
File "/Users/.../.venv/lib/python3/site-packages/httpx/_models.py", line 761, in raise_for_status
133+
raise HTTPStatusError(message, request=request, response=self)
134+
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
135+
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
136+
```
137+
138+
Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
139+
140+
---
141+
142+
<Exercises />
143+
144+
### Scrape Amazon
145+
146+
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
147+
148+
```text
149+
https://www.amazon.com/s?k=darth+vader
150+
```
151+
152+
<Details>
153+
<summary>Solution</summary>
154+
155+
```py
156+
import httpx
157+
158+
url = "https://www.amazon.com/s?k=darth+vader"
159+
response = httpx.get(url)
160+
response.raise_for_status()
161+
print(response.text)
162+
```
163+
164+
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
165+
</Details>
166+
167+
### Save downloaded HTML as a file
168+
169+
Download HTML, then save it on your disk as a `products.html` file. You can use the URL we've been already playing with:
170+
171+
```text
172+
https://warehouse-theme-metal.myshopify.com/collections/sales
173+
```
174+
175+
<Details>
176+
<summary>Solution</summary>
177+
178+
Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
179+
180+
```text
181+
python main.py > products.html
182+
```
183+
184+
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
185+
186+
```py
187+
import httpx
188+
from pathlib import Path
189+
190+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
191+
response = httpx.get(url)
192+
response.raise_for_status()
193+
Path("products.html").write_text(response.text)
194+
```
195+
196+
</Details>
197+
198+
### Download an image as a file
199+
200+
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/) for guidance. You can use this URL pointing to an image of a TV:
201+
202+
```text
203+
https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg
204+
```
205+
206+
<Details>
207+
<summary>Solution</summary>
208+
209+
Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
210+
211+
```py
212+
from pathlib import Path
213+
import httpx
214+
215+
url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"
216+
response = httpx.get(url)
217+
response.raise_for_status()
218+
Path("tv.jpg").write_bytes(response.content)
219+
```
220+
221+
</Details>

0 commit comments

Comments
 (0)