You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -12,61 +12,101 @@ import Exercises from './_exercises.mdx';
12
12
13
13
---
14
14
15
-
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing.
15
+
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a JavaScript program which downloads HTML code of the product listing.
16
16
17
-
## Starting a Python project
17
+
## Starting a Node.js project
18
18
19
-
Before we start coding, we need to set up a Python project. Let's create new directory with a virtual environment. Inside the directory and with the environment activated, we'll install the HTTPX library:
19
+
Before we start coding, we need to set up a Node.js project. Let's create new directory and let's name it `product-scraper`. Inside the directory, we'll initialize new project:
20
20
21
21
```text
22
-
$ pip install httpx
22
+
$ npm init
23
+
This utility will walk you through creating a package.json file.
23
24
...
24
-
Successfully installed ... httpx-0.0.0
25
-
```
26
25
27
-
:::tip Installing packages
26
+
Press ^C at any time to quit.
27
+
package name: (product-scraper)
28
+
version: (1.0.0)
29
+
description: Product scraper
30
+
entry point: (index.js)
31
+
test command:
32
+
git repository:
33
+
keywords:
34
+
author:
35
+
license: (ISC)
36
+
# highlight-next-line
37
+
type: (commonjs) module
38
+
About to write to /Users/.../product-scraper/package.json:
39
+
40
+
{
41
+
"name": "product-scraper",
42
+
"version": "1.0.0",
43
+
"description": "Product scraper",
44
+
"main": "index.js",
45
+
"scripts": {
46
+
"test": "echo \"Error: no test specified\" && exit 1"
47
+
},
48
+
"author": "",
49
+
"license": "ISC",
50
+
# highlight-next-line
51
+
"type": "module"
52
+
}
53
+
```
28
54
29
-
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
55
+
The above creates a `package.json` file with configuration of our project. While most of the values are arbitrary, it's important that the project's type is set to `module`. Now let's test that all works. Inside the project directory we'll create a new file called `index.js` with the following code:
30
56
31
-
:::
57
+
```js
58
+
importprocessfrom'node:process';
32
59
33
-
Now let's test that all works. Inside the project directory we'll create a new file called `main.py` with the following code:
60
+
console.log(`All is OK, ${process.argv[2]}`);
61
+
```
34
62
35
-
```py
36
-
import httpx
63
+
Running it as a Node.js program will verify that our setup is okay and we've correctly set the type to `module`. The program takes a single word as an argument and will address us with it, so let's pass it "mate", for example:
37
64
38
-
print("OK")
65
+
```text
66
+
$ node index.js mate
67
+
All is OK, mate
39
68
```
40
69
41
-
Running it as a Python program will verify that our setup is okay and we've installed HTTPX:
70
+
:::info Troubleshooting
71
+
72
+
If you see errors or are otherwise unable to run the code above, it likely means your environment isn't set up correctly. Unfortunately, diagnosing the issue is out of scope for this course.
73
+
74
+
Make sure that in your `package.json` the type property is set to `module`, otherwise you'll get the following warning:
42
75
43
76
```text
44
-
$ python main.py
45
-
OK
77
+
[MODULE_TYPELESS_PACKAGE_JSON] Warning: Module type of file:///Users/.../product-scraper/index.js is not specified and it doesn't parse as CommonJS.
78
+
Reparsing as ES module because module syntax was detected. This incurs a performance overhead.
79
+
To eliminate this warning, add "type": "module" to /Users/.../product-scraper/package.json.
46
80
```
47
81
48
-
:::info Troubleshooting
82
+
In older versions of Node.js, you may even encounter this error:
49
83
50
-
If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course.
84
+
```text
85
+
SyntaxError: Cannot use import statement outside a module
86
+
```
51
87
52
88
:::
53
89
54
90
## Downloading product listing
55
91
56
-
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `OK`. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
57
-
58
-
```py
59
-
import httpx
92
+
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `All is OK`. The [documentation of the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch) provides us with examples how to use it. Inspired by those, our code will look like this:
First time you see `await`? It's a modern syntax for working with promises. See the [JavaScript Asynchronous Programming and Callbacks](https://nodejs.org/en/learn/asynchronous-work/javascript-asynchronous-programming-and-callbacks) and [Discover Promises in Node.js](https://nodejs.org/en/learn/asynchronous-work/discover-promises-in-nodejs) tutorials in the official Node.js documentation for more.
103
+
104
+
:::
105
+
66
106
If we run the program now, it should print the downloaded HTML:
67
107
68
108
```text
69
-
$ python main.py
109
+
$ node index.js
70
110
<!doctype html>
71
111
<html class="no-js" lang="en">
72
112
<head>
@@ -80,15 +120,15 @@ $ python main.py
80
120
</html>
81
121
```
82
122
83
-
Running `httpx.get(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper.
123
+
Running `await fetch(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper.
84
124
85
125
:::tip Client and server, request and response
86
126
87
127
HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology:
88
128
89
129
- HTTP is an exchange between two participants.
90
130
- The _client_ sends a _request_ to the _server_, which replies with a _response_.
91
-
- In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
131
+
- In our case, `index.js` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
92
132
93
133
:::
94
134
@@ -110,28 +150,30 @@ First, let's ask for trouble. We'll change the URL in our code to a page that do
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
153
+
We could check the value of `response.status` against a list of allowed numbers, but the Fetch API already provides `response.ok`, a property which returns `false` if our request wasn't successful:
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist'
134
-
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
169
+
$ node index.js
170
+
file:///Users/.../index.js:7
171
+
throw new Error(`HTTP ${response.status}`);
172
+
^
173
+
174
+
Error: HTTP 404
175
+
at file:///Users/.../index.js:7:9
176
+
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
135
177
```
136
178
137
179
Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
177
221
178
222
```text
179
-
python main.py > products.html
223
+
node index.js > products.html
180
224
```
181
225
182
-
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
226
+
If you want to use Node.js instead, it offers several ways how to create files. The solution below uses the [Promises API](https://nodejs.org/api/fs.html#promises-api):
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/)for guidance. You can use this URL pointing to an image of a TV:
246
+
Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) and the [Writing files with Node.js](https://nodejs.org/en/learn/manipulating-files/writing-files-with-nodejs) tutorial for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV:
0 commit comments