Skip to content

Commit 7b63fe7

Browse files
committed
fix: update downloading to be about JS
1 parent acb3750 commit 7b63fe7

File tree

2 files changed

+116
-68
lines changed

2 files changed

+116
-68
lines changed

sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md

Lines changed: 102 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -12,61 +12,83 @@ import Exercises from './_exercises.mdx';
1212

1313
---
1414

15-
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing.
15+
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a JavaScript program which downloads HTML code of the product listing.
1616

17-
## Starting a Python project
17+
## Starting a Node.js project
1818

19-
Before we start coding, we need to set up a Python project. Let's create new directory with a virtual environment. Inside the directory and with the environment activated, we'll install the HTTPX library:
19+
Before we start coding, we need to set up a Node.js project. Let's create new directory and let's name it `product-scraper`. Inside the directory, we'll initialize new project:
2020

2121
```text
22-
$ pip install httpx
22+
$ npm init
23+
This utility will walk you through creating a package.json file.
2324
...
24-
Successfully installed ... httpx-0.0.0
25-
```
26-
27-
:::tip Installing packages
28-
29-
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
3025
31-
:::
26+
Press ^C at any time to quit.
27+
package name: (product-scraper)
28+
version: (1.0.0)
29+
description: Product scraper
30+
entry point: (index.js)
31+
test command:
32+
git repository:
33+
keywords:
34+
author:
35+
license: (ISC)
36+
# highlight-next-line
37+
type: (commonjs) module
38+
About to write to /Users/.../product-scraper/package.json:
39+
40+
{
41+
"name": "product-scraper",
42+
"version": "1.0.0",
43+
"description": "Product scraper",
44+
"main": "index.js",
45+
"scripts": {
46+
"test": "echo \"Error: no test specified\" && exit 1"
47+
},
48+
"author": "",
49+
"license": "ISC",
50+
# highlight-next-line
51+
"type": "module"
52+
}
53+
```
3254

33-
Now let's test that all works. Inside the project directory we'll create a new file called `main.py` with the following code:
55+
The above creates a `package.json` file with configuration of our project. While most of the values are arbitrary, it's important that the project's type is set to `module`. Now let's test that all works. Inside the project directory we'll create a new file called `index.js` with the following code:
3456

35-
```py
36-
import httpx
57+
```js
58+
import process from 'node:process';
3759

38-
print("OK")
60+
console.log(`All is OK, ${process.argv[2]}`);
3961
```
4062

41-
Running it as a Python program will verify that our setup is okay and we've installed HTTPX:
63+
Running it as a Node.js program will verify that our setup is okay and we've correctly set the type to `module`. The program takes a single word as an argument and will address us with it, so let's pass it "mate", for example:
4264

4365
```text
44-
$ python main.py
45-
OK
66+
$ node index.js mate
67+
All is OK, mate
4668
```
4769

4870
:::info Troubleshooting
4971

50-
If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
72+
If you see `ReferenceError: require is not defined in ES module scope, you can use import instead`, double check that in your `package.json` the type property is set to `module`.
73+
74+
If you see other errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
5175

5276
:::
5377

5478
## Downloading product listing
5579

56-
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `OK`. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
57-
58-
```py
59-
import httpx
80+
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `All is OK`. The [documentation of the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch) provides us with examples how to use it. Inspired by those, our code will look like this:
6081

61-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
62-
response = httpx.get(url)
63-
print(response.text)
82+
```js
83+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
84+
const response = await fetch(url);
85+
console.log(await response.text());
6486
```
6587

6688
If we run the program now, it should print the downloaded HTML:
6789

6890
```text
69-
$ python main.py
91+
$ node index.js
7092
<!doctype html>
7193
<html class="no-js" lang="en">
7294
<head>
@@ -80,15 +102,15 @@ $ python main.py
80102
</html>
81103
```
82104

83-
Running `httpx.get(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper.
105+
Running `await fetch(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper.
84106

85107
:::tip Client and server, request and response
86108

87109
HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology:
88110

89111
- HTTP is an exchange between two participants.
90112
- The _client_ sends a _request_ to the _server_, which replies with a _response_.
91-
- In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
113+
- In our case, `index.js` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
92114

93115
:::
94116

@@ -110,28 +132,30 @@ First, let's ask for trouble. We'll change the URL in our code to a page that do
110132
https://warehouse-theme-metal.myshopify.com/does/not/exist
111133
```
112134

113-
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
135+
We could check the value of `response.status` against a list of allowed numbers, but the Fetch API already provides `response.ok`, a property which returns `false` if our request wasn't successful:
114136

115-
```py
116-
import httpx
137+
```js
138+
const url = "https://warehouse-theme-metal.myshopify.com/does/not/exist";
139+
const response = await fetch(url);
117140

118-
url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"
119-
response = httpx.get(url)
120-
response.raise_for_status()
121-
print(response.text)
141+
if (response.ok) {
142+
console.log(await response.text());
143+
} else {
144+
throw new Error(`HTTP ${response.status}`);
145+
}
122146
```
123147

124148
If you run the code above, the program should crash:
125149

126150
```text
127-
$ python main.py
128-
Traceback (most recent call last):
129-
File "/Users/.../main.py", line 5, in <module>
130-
response.raise_for_status()
131-
File "/Users/.../.venv/lib/python3/site-packages/httpx/_models.py", line 761, in raise_for_status
132-
raise HTTPStatusError(message, request=request, response=self)
133-
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
134-
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
151+
$ node index.js
152+
file:///Users/.../index.js:7
153+
throw new Error(`HTTP ${response.status}`);
154+
^
155+
156+
Error: HTTP 404
157+
at file:///Users/.../index.js:7:9
158+
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
135159
```
136160

137161
Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
@@ -151,13 +175,15 @@ https://www.aliexpress.com/w/wholesale-darth-vader.html
151175
<details>
152176
<summary>Solution</summary>
153177

154-
```py
155-
import httpx
178+
```js
179+
const url = "https://www.aliexpress.com/w/wholesale-darth-vader.html";
180+
const response = await fetch(url);
156181

157-
url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
158-
response = httpx.get(url)
159-
response.raise_for_status()
160-
print(response.text)
182+
if (response.ok) {
183+
console.log(await response.text());
184+
} else {
185+
throw new Error(`HTTP ${response.status}`);
186+
}
161187
```
162188

163189
</details>
@@ -176,26 +202,30 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
176202
Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
177203

178204
```text
179-
python main.py > products.html
205+
node index.js > products.html
180206
```
181207

182-
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
208+
If you want to use Node.js instead, it offers several ways how to create files. The solution below uses the [Promises API](https://nodejs.org/api/fs.html#promises-api):
183209

184-
```py
185-
import httpx
186-
from pathlib import Path
210+
```js
211+
import { writeFile } from 'node:fs/promises';
187212

188-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
189-
response = httpx.get(url)
190-
response.raise_for_status()
191-
Path("products.html").write_text(response.text)
213+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
214+
const response = await fetch(url);
215+
216+
if (response.ok) {
217+
const html = await response.text();
218+
await writeFile('products.html', html);
219+
} else {
220+
throw new Error(`HTTP ${response.status}`);
221+
}
192222
```
193223

194224
</details>
195225

196226
### Download an image as a file
197227

198-
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/) for guidance. You can use this URL pointing to an image of a TV:
228+
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV:
199229

200230
```text
201231
https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg
@@ -204,16 +234,20 @@ https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72
204234
<details>
205235
<summary>Solution</summary>
206236

207-
Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
237+
Node.js offers several ways how to create files. The solution below uses [Promises API](https://nodejs.org/api/fs.html#promises-api):
238+
239+
```js
240+
import { writeFile } from 'node:fs/promises';
208241

209-
```py
210-
from pathlib import Path
211-
import httpx
242+
const url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg";
243+
const response = await fetch(url);
212244

213-
url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"
214-
response = httpx.get(url)
215-
response.raise_for_status()
216-
Path("tv.jpg").write_bytes(response.content)
245+
if (response.ok) {
246+
const buffer = Buffer.from(await response.arrayBuffer());
247+
await writeFile('tv.jpg', buffer);
248+
} else {
249+
throw new Error(`HTTP ${response.status}`);
250+
}
217251
```
218252

219253
</details>

sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,20 @@ $ pip install beautifulsoup4
3838
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3939
```
4040

41+
<!--
42+
:::tip Installing packages
43+
44+
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
45+
46+
:::
47+
48+
:::info Troubleshooting
49+
50+
If you see other errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
51+
52+
:::
53+
-->
54+
4155
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
4256

4357
![Element of the main heading](./images/h1.png)

0 commit comments

Comments
 (0)