Skip to content

Commit 7184e71

Browse files
committed
feat: add lesson descriptions and write the downloading HTML lesson
1 parent 8943a58 commit 7184e71

File tree

5 files changed

+141
-14
lines changed

5 files changed

+141
-14
lines changed

sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
---
22
title: Inspecting web pages with browser DevTools
33
sidebar_label: "DevTools: Inspecting"
4-
description: TODO
4+
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of an e-commerce website.
55
sidebar_position: 1
66
slug: /scraping-basics-python/devtools-inspecting
77
---
88

9+
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of an e-commerce website.**
10+
11+
---
12+
913
:::danger Work in progress
1014

1115
This lesson doesn't exist yet, but it's going to be similar to [Starting with browser DevTools](../scraping_basics_javascript/data_extraction/browser_devtools.md).

sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
---
22
title: Locating HTML elements on a web page with browser DevTools
33
sidebar_label: "DevTools: Locating HTML elements"
4-
description: TODO
4+
description: Lesson about using the browser tools for developers to manually find products on an e-commerce website.
55
sidebar_position: 2
66
slug: /scraping-basics-python/devtools-locating-elements
77
---
88

9+
**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.**
10+
11+
---
12+
913
:::danger Work in progress
1014

1115
This lesson doesn't exist yet, but it's going to be similar to [Finding elements with DevTools](../scraping_basics_javascript/data_extraction/using_devtools.md).

sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
---
22
title: Extracting data from a web page with browser DevTools
33
sidebar_label: "DevTools: Extracting data"
4-
description: TODO
4+
description: Lesson about using the browser tools for developers to manually extract product data from an e-commerce website.
55
sidebar_position: 3
66
slug: /scraping-basics-python/devtools-extracting-data
77
---
88

9+
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**
10+
11+
---
12+
913
:::danger Work in progress
1014

1115
This lesson doesn't exist yet, but it's going to be similar to [Extracting data with DevTools](../scraping_basics_javascript/data_extraction/devtools_continued.md).

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 112 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,126 @@
11
---
22
title: Downloading HTML with Python
33
sidebar_label: Downloading HTML
4-
description: TODO
4+
description: Lesson about building a Python application for watching prices and using the HTTPX library to download HTML code of a product listing page.
55
sidebar_position: 4
66
slug: /scraping-basics-python/downloading-html
77
---
88

9-
:::danger Work in progress
9+
**In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.**
1010

11-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11+
---
12+
13+
Using browser tools for developers is crucial for understanding structure of a particular page, but it's a manual task. Now let's start building our first automation, a Python program which downloads HTML code of the product listing.
14+
15+
## Starting a Python project
16+
17+
Before we start coding, we need to setup a Python project. Create new directory with a virtual environment, then inside the directory and with the environment activated, install the HTTPX library:
18+
19+
```text
20+
$ pip install httpx
21+
...
22+
Successfully installed ... httpx-0.0.0
23+
```
24+
25+
:::tip Installing packages
26+
27+
Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide.
28+
29+
:::
30+
31+
Now let's test that all works. In the project directory create a new file called `main.py` with the following code:
32+
33+
```python
34+
import httpx
35+
36+
print("OK")
37+
```
38+
39+
Running it as a Python program will verify that your setup is okay and you've installed HTTPX:
40+
41+
```text
42+
$ python main.py
43+
OK
44+
```
45+
46+
:::info Troubleshooting
47+
48+
If you see errors or for any other reason cannot run the code above, we're sorry, but figuring out the issue is out of scope of this course.
1249

1350
:::
1451

52+
## Downloading product listing
53+
54+
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing OK. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
55+
56+
```python
57+
import httpx
58+
59+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
60+
response = httpx.get(url)
61+
62+
print(response.text)
63+
```
64+
65+
If you run the program now, it should print the downloaded HTML:
66+
67+
```text
68+
$ python main.py
69+
<!doctype html>
70+
<html class="no-js" lang="en">
71+
<head>
72+
<meta charset="utf-8">
73+
<meta name="viewport" content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0">
74+
<meta name="theme-color" content="#00badb">
75+
<meta name="robots" content="noindex">
76+
<title>Sales</title>
77+
...
78+
</body>
79+
</html>
80+
```
81+
82+
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
83+
84+
## Treating HTML as a string
85+
86+
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
87+
88+
```html
89+
<div class="product-item product-item--vertical ...">
90+
<a href="/products/..." class="product-item__image-wrapper">
91+
...
92+
</a>
93+
<div class="product-item__info">
94+
...
95+
</div>
96+
</div>
97+
```
98+
99+
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
100+
101+
```python
102+
import httpx
103+
104+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
105+
response = httpx.get(url)
106+
107+
html_code = response.text
108+
count = html_code.count('<div class="product-item')
109+
print(count)
110+
```
111+
112+
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
113+
114+
On closer look at the whole HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
115+
116+
```python
117+
count = html_code.count('<div class="product-item ')
118+
```
119+
120+
Now our program prints number 24, which is in line with the text _Showing 1 - 24 of 50 products_ above the product listing. Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile.
121+
122+
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
123+
15124
## Exercises
16125

17126
- One

sources/academy/webscraping/scraping_basics_python/index.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,20 @@ This course is incomplete. As we work on adding new lessons, we would love to he
1818

1919
:::
2020

21-
## What you'll learn
21+
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data sets from several rans of such program would be useful for seeing trends in price changes, detecting discounts, etc.
2222

23-
- Inspecting pages using browser DevTools
24-
- Downloading web pages using the HTTPX library
25-
- Extracting data from web pages using the Beautiful Soup library
26-
- Saving extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open
27-
- Following links programatically (crawling)
28-
- Saving time and effort with frameworks, such as Scrapy, and scraping platforms, such as Apify
23+
<!--
24+
TODO image of warehouse with some CVS or JSON exported, similar to sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-collection.png, which is for some reason the same as sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-extraction.png
25+
-->
26+
27+
## What you'll do
28+
29+
- Inspect pages using browser DevTools
30+
- Download web pages using the HTTPX library
31+
- Extract data from web pages using the Beautiful Soup library
32+
- Save extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open
33+
- Follow links programatically (crawling)
34+
- Save time and effort with frameworks, such as Scrapy, and scraping platforms, such as Apify
2935

3036
## Who this course is for
3137

@@ -62,7 +68,7 @@ As a scraper developer, you are not limited by whether certain data is available
6268

6369
### Why learn with Apify
6470

65-
We are [Apify](https://apify.com), a web scraping and automation platform, but we built this course on top of open source technologies. The skills you'll learn are applicable to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
71+
We are [Apify](https://apify.com), a web scraping and automation platform, but we built this course on top of open source technologies. The skills you can learn are applicable to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
6672

6773
## Course content
6874

0 commit comments

Comments
 (0)