You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
+7-11Lines changed: 7 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,7 +53,7 @@ If you see errors or for any other reason cannot run the code above, it means th
53
53
54
54
## Downloading product listing
55
55
56
-
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing OK. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
56
+
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `OK`. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
57
57
58
58
```py
59
59
import httpx
@@ -81,19 +81,15 @@ $ python main.py
81
81
</html>
82
82
```
83
83
84
-
And that's it! It's not particularly useful yet, but it's a good start of our scraper.
84
+
Running `httpx.get(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper.
85
85
86
-
## About HTTP
86
+
:::tip Client and server, request and response
87
87
88
-
Running `httpx.get(url)`, we made our first HTTP request and received our first response. HTTP is a network protocol powering most of the internet. Understanding it well is an important foundation for successful scraping, but for now it's enough to know the basic flow and terminology.
88
+
HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology:
89
89
90
-
HTTP is an exchange of two participants. The _client_ sends a _request_ to the _server_, which replies with a _response_. In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
91
-
92
-
<!-- TODO image basic HTTP chart -->
93
-
94
-
:::tip Deep dive to HTTP
95
-
96
-
The HTTP protocol is defined by several documents called RFCs, such as [RFC 7230: HTTP Message Syntax and Routing](https://www.rfc-editor.org/rfc/rfc7230) or [RFC 7231: HTTP Semantics and Content](https://www.rfc-editor.org/rfc/rfc7231). While these technical specifications are surprisingly digestible, you may also like [HTTP tutorials by MDN](https://developer.mozilla.org/en-US/docs/Web/HTTP).
90
+
- HTTP is an exchange between two participants.
91
+
- The _client_ sends a _request_ to the _server_, which replies with a _response_.
92
+
- In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server.
0 commit comments