Skip to content

Commit 8d5114f

Browse files
committed
feat: finish the parsing lesson
1 parent 29acc1b commit 8d5114f

File tree

5 files changed

+168
-27
lines changed

5 files changed

+168
-27
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ slug: /scraping-basics-python/downloading-html
1010

1111
---
1212

13-
Using browser tools for developers is crucial for understanding structure of a particular page, but it's a manual task. Now let's start building our first automation, a Python program which downloads HTML code of the product listing.
13+
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing.
1414

1515
## Starting a Python project
1616

@@ -28,7 +28,7 @@ Being comfortable around Python project setup and installing packages is a prere
2828

2929
:::
3030

31-
Now let's test that all works. In the project directory create a new file called `main.py` with the following code:
31+
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
3232

3333
```python
3434
import httpx
@@ -135,10 +135,10 @@ https://warehouse-theme-metal.myshopify.com/does/not/exist
135135

136136
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
137137

138-
A robust scraper skips or retries requests when errors occur, but we'll start simple. Our program will print an error message and stop further processing of the response.
138+
A robust scraper skips or retries requests when errors occur, but let's start simple. Our program will print an error message and stop further processing of the response.
139139

140140

141-
We also want to play along with the conventions of the operating system, so let's print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
141+
We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
142142

143143
```python
144144
import sys
@@ -165,9 +165,11 @@ For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/St
165165

166166
Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
167167

168+
---
169+
168170
## Exercises
169171

170-
These challenges should help you verify that you can apply knowledge acquired in this lesson. Resist the temptation to look at the solutions right away. Learn by doing, not by copying and pasting!
172+
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
171173

172174
### Scrape Amazon
173175

@@ -214,7 +216,7 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
214216
$ python main.py > products.html
215217
```
216218

217-
If you want to use Python, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
219+
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
218220

219221
```python
220222
import sys

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 150 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,21 @@ sidebar_position: 5
66
slug: /scraping-basics-python/parsing-html
77
---
88

9-
:::danger Work in progress
9+
**In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.**
1010

11-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11+
---
1212

13-
This lesson contains just a fraction of what it should contain. In the end, the current content might get rewritten. Everything on this page is a subject to change!
13+
From previous lessons we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
1414

15-
:::
15+
![Products have the ‘product-item’ class](./images/collection-class.png)
16+
17+
As a first step, let's try counting how many products is in the listing.
1618

1719
## Treating HTML as a string
1820

19-
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
21+
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. Can we use Python string operations to count the products? Each string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
2022

21-
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
23+
After manually inspecting the page in browser DevTools we can see that each product has the following structure:
2224

2325
```html
2426
<div class="product-item product-item--vertical ...">
@@ -31,37 +33,171 @@ Let's try counting how many products is in the listing. Manually inspecting the
3133
</div>
3234
```
3335

34-
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
36+
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. Replace your program with the following code:
3537

3638
```python
3739
import httpx
3840

3941
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
4042
response = httpx.get(url)
43+
response.raise_for_status()
4144

4245
html_code = response.text
4346
count = html_code.count('<div class="product-item')
4447
print(count)
4548
```
4649

47-
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
50+
:::info Handling errors
51+
52+
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least crashes and prints what happened in case there's an error.
53+
54+
:::
55+
56+
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more `div` tags with class names starting with `product-item`.
4857

4958
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
5059

5160
```python
5261
count = html_code.count('<div class="product-item ')
5362
```
5463

55-
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
64+
Now it prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, that was tedious!
5665

5766
<!-- TODO image -->
5867

59-
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
68+
While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
69+
70+
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. To work with HTML we need a robust tool dedicated for the task.
71+
72+
:::tip Why regex can't parse HTML
73+
74+
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
75+
76+
:::
77+
78+
## Using HTML parser
79+
80+
An HTML parser takes a text with HTML markup and turns it into a tree of Python objects. We'll choose Beautiful Soup as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
81+
82+
```text
83+
$ pip install beautifulsoup4
84+
...
85+
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
86+
```
87+
88+
Now let's use it for parsing the HTML:
89+
90+
```python
91+
import httpx
92+
from bs4 import BeautifulSoup
93+
94+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
95+
response = httpx.get(url)
96+
response.raise_for_status()
97+
98+
html_code = response.text
99+
soup = BeautifulSoup(html_code, "html.parser")
100+
print(soup.title)
101+
```
102+
103+
The `BeautifulSoup` object contains our HTML, but unlike plain string, it allows us to work with the HTML elements in a structured way. As a demonstration, we use the shorthand `.title` for accessing the HTML `<title>` tag. Let's run the program:
104+
105+
```text
106+
$ python main.py
107+
<title>Sales
108+
</title>
109+
```
110+
111+
That looks promising! What if we want just the contents of the tag? Let's change the print line to the following:
112+
113+
```python
114+
print(soup.title.text)
115+
```
116+
117+
If we run our scraper again, it prints just the actual text of the `<title>` tag:
60118

61-
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
119+
```text
120+
$ python main.py
121+
Sales
122+
```
123+
124+
## Using CSS selectors
125+
126+
Beautiful Soup offers a `.select()` method, which runs a CSS selector against a parsed HTML document and returns all the matching elements. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
127+
128+
```python
129+
import httpx
130+
from bs4 import BeautifulSoup
131+
132+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
133+
response = httpx.get(url)
134+
response.raise_for_status()
135+
136+
html_code = response.text
137+
soup = BeautifulSoup(html_code, "html.parser")
138+
products = soup.select(".product-item")
139+
print(len(products))
140+
```
141+
142+
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. On the last line, we use `len()` to count how many items is in the list. That's it!
143+
144+
```text
145+
$ python main.py
146+
24
147+
```
148+
149+
We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
150+
151+
---
62152

63153
## Exercises
64154

65-
- One
66-
- Two
67-
- Three
155+
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
156+
157+
### Scrape F1 teams
158+
159+
Print a total count of F1 teams listed on this page:
160+
161+
```text
162+
https://www.formula1.com/en/teams
163+
```
164+
165+
<details>
166+
<summary>Solution</summary>
167+
168+
```python
169+
import httpx
170+
from bs4 import BeautifulSoup
171+
172+
url = "https://www.formula1.com/en/teams"
173+
response = httpx.get(url)
174+
response.raise_for_status()
175+
176+
html_code = response.text
177+
soup = BeautifulSoup(html_code, "html.parser")
178+
print(len(soup.select(".outline")))
179+
```
180+
181+
</details>
182+
183+
### Scrape F1 drivers
184+
185+
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
186+
187+
<details>
188+
<summary>Solution</summary>
189+
190+
```python
191+
import httpx
192+
from bs4 import BeautifulSoup
193+
194+
url = "https://www.formula1.com/en/teams"
195+
response = httpx.get(url)
196+
response.raise_for_status()
197+
198+
html_code = response.text
199+
soup = BeautifulSoup(html_code, "html.parser")
200+
print(len(soup.select(".f1-grid")))
201+
```
202+
203+
</details>
506 KB
Loading
89.4 KB
Loading

sources/academy/webscraping/scraping_basics_python/index.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,18 @@ import DocCardList from '@theme/DocCardList';
1414

1515
:::danger Work in progress
1616

17-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
17+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. Comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
1818

1919
:::
2020

21-
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data sets from several rans of such program would be useful for seeing trends in price changes, detecting discounts, etc.
21+
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.
2222

2323
<!--
2424
TODO image of warehouse with some CVS or JSON exported, similar to sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-collection.png, which is for some reason the same as sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-extraction.png
2525
-->
2626

27+
![E-commerce listing on the left, JSON with data on the right](./images/scraping.png)
28+
2729
## What you'll do
2830

2931
- Inspect pages using browser DevTools
@@ -42,7 +44,7 @@ Anyone with basic knowledge of developing programs in Python who wants to start
4244
- macOS, Linux or Windows machine with a web browser and Python installed
4345
- Familiar with Python basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, exceptions
4446
- Comfortable importing from the Python standard library, using virtual environments, and installing dependencies with `pip`
45-
- Running commands in Terminal or Command Prompt
47+
- Familiar with how to run commands in Terminal or Command Prompt
4648

4749
## You may want to know
4850

@@ -52,23 +54,24 @@ Let's explore the key reasons to take this course. What is web scraping good for
5254

5355
The internet is full of useful data, but most of it isn't offered in a structured way that is easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them.
5456

55-
Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with their servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
57+
Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
5658

5759
### Why build your own scrapers
5860

5961
Scrapers are programs specifically designed to mine data from the internet. Point-and-click or no-code scraping solutions do exist, but they only take you so far. While simple to use, they lack the flexibility and optimization needed to handle advanced cases. Only custom-built scrapers can tackle more difficult challenges. And unlike ready-made solutions, they can be fine-tuned to perform tasks more efficiently, at a lower cost, or with greater precision.
6062

6163
### Why become a scraper dev
6264

63-
As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API. Here are some things you can do if you understand scraping:
65+
As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API! Here are some things you can do if you understand scraping:
6466

6567
- Improve your productivity by building personal tools, such as your own real estate or rare sneakers watchdog.
66-
- Companies can hire you to build custom scrapers to mine data important for their business.
68+
- Companies can hire you to build custom scrapers mining data important for their business.
69+
- Become an invaluable asset to data journalism, data science, or nonprofit teams working to make the world a better place.
6770
- You can publish your scrapers on platforms like the [Apify Store](https://apify.com/store) and earn money by renting them out to others.
6871

6972
### Why learn with Apify
7073

71-
We are [Apify](https://apify.com), a web scraping and automation platform, but we built this course on top of open source technologies. The skills you can learn are applicable to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
74+
We are [Apify](https://apify.com), a web scraping and automation platform. We did our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
7275

7376
## Course content
7477

0 commit comments

Comments
 (0)