|
1 | 1 | ---
|
2 | 2 | title: Downloading HTML with Python
|
3 | 3 | sidebar_label: Downloading HTML
|
4 |
| -description: TODO |
| 4 | +description: Lesson about building a Python application for watching prices and using the HTTPX library to download HTML code of a product listing page. |
5 | 5 | sidebar_position: 4
|
6 | 6 | slug: /scraping-basics-python/downloading-html
|
7 | 7 | ---
|
8 | 8 |
|
9 |
| -:::danger Work in progress |
| 9 | +**In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.** |
10 | 10 |
|
11 |
| -This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem. |
| 11 | +--- |
| 12 | + |
| 13 | +Using browser tools for developers is crucial for understanding structure of a particular page, but it's a manual task. Now let's start building our first automation, a Python program which downloads HTML code of the product listing. |
| 14 | + |
| 15 | +## Starting a Python project |
| 16 | + |
| 17 | +Before we start coding, we need to setup a Python project. Create new directory with a virtual environment, then inside the directory and with the environment activated, install the HTTPX library: |
| 18 | + |
| 19 | +```text |
| 20 | +$ pip install httpx |
| 21 | +... |
| 22 | +Successfully installed ... httpx-0.0.0 |
| 23 | +``` |
| 24 | + |
| 25 | +:::tip Installing packages |
| 26 | + |
| 27 | +Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide. |
| 28 | + |
| 29 | +::: |
| 30 | + |
| 31 | +Now let's test that all works. In the project directory create a new file called `main.py` with the following code: |
| 32 | + |
| 33 | +```python |
| 34 | +import httpx |
| 35 | + |
| 36 | +print("OK") |
| 37 | +``` |
| 38 | + |
| 39 | +Running it as a Python program will verify that your setup is okay and you've installed HTTPX: |
| 40 | + |
| 41 | +```text |
| 42 | +$ python main.py |
| 43 | +OK |
| 44 | +``` |
| 45 | + |
| 46 | +:::info Troubleshooting |
| 47 | + |
| 48 | +If you see errors or for any other reason cannot run the code above, we're sorry, but figuring out the issue is out of scope of this course. |
12 | 49 |
|
13 | 50 | :::
|
14 | 51 |
|
| 52 | +## Downloading product listing |
| 53 | + |
| 54 | +Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing OK. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this: |
| 55 | + |
| 56 | +```python |
| 57 | +import httpx |
| 58 | + |
| 59 | +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" |
| 60 | +response = httpx.get(url) |
| 61 | + |
| 62 | +print(response.text) |
| 63 | +``` |
| 64 | + |
| 65 | +If you run the program now, it should print the downloaded HTML: |
| 66 | + |
| 67 | +```text |
| 68 | +$ python main.py |
| 69 | +<!doctype html> |
| 70 | +<html class="no-js" lang="en"> |
| 71 | + <head> |
| 72 | + <meta charset="utf-8"> |
| 73 | + <meta name="viewport" content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0"> |
| 74 | + <meta name="theme-color" content="#00badb"> |
| 75 | + <meta name="robots" content="noindex"> |
| 76 | + <title>Sales</title> |
| 77 | + ... |
| 78 | + </body> |
| 79 | +</html> |
| 80 | +``` |
| 81 | + |
| 82 | +Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations. |
| 83 | + |
| 84 | +## Treating HTML as a string |
| 85 | + |
| 86 | +Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure: |
| 87 | + |
| 88 | +```html |
| 89 | +<div class="product-item product-item--vertical ..."> |
| 90 | + <a href="/products/..." class="product-item__image-wrapper"> |
| 91 | + ... |
| 92 | + </a> |
| 93 | + <div class="product-item__info"> |
| 94 | + ... |
| 95 | + </div> |
| 96 | +</div> |
| 97 | +``` |
| 98 | + |
| 99 | +At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. |
| 100 | + |
| 101 | +```python |
| 102 | +import httpx |
| 103 | + |
| 104 | +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" |
| 105 | +response = httpx.get(url) |
| 106 | + |
| 107 | +html_code = response.text |
| 108 | +count = html_code.count('<div class="product-item') |
| 109 | +print(count) |
| 110 | +``` |
| 111 | + |
| 112 | +Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`. |
| 113 | + |
| 114 | +On closer look at the whole HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name? |
| 115 | + |
| 116 | +```python |
| 117 | +count = html_code.count('<div class="product-item ') |
| 118 | +``` |
| 119 | + |
| 120 | +Now our program prints number 24, which is in line with the text _Showing 1 - 24 of 50 products_ above the product listing. Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. |
| 121 | + |
| 122 | +In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser. |
| 123 | + |
15 | 124 | ## Exercises
|
16 | 125 |
|
17 | 126 | - One
|
|
0 commit comments