You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using browser tools for developers is crucial for understanding structure of a particular page, but it's a manual task. Now let's start building our first automation, a Python program which downloads HTML code of the product listing.
13
+
Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing.
14
14
15
15
## Starting a Python project
16
16
@@ -28,7 +28,7 @@ Being comfortable around Python project setup and installing packages is a prere
28
28
29
29
:::
30
30
31
-
Now let's test that all works. In the project directory create a new file called `main.py` with the following code:
31
+
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
137
137
138
-
A robust scraper skips or retries requests when errors occur, but we'll start simple. Our program will print an error message and stop further processing of the response.
138
+
A robust scraper skips or retries requests when errors occur, but let's start simple. Our program will print an error message and stop further processing of the response.
139
139
140
140
141
-
We also want to play along with the conventions of the operating system, so let's print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
141
+
We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
142
142
143
143
```python
144
144
import sys
@@ -165,9 +165,11 @@ For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/St
165
165
166
166
Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
167
167
168
+
---
169
+
168
170
## Exercises
169
171
170
-
These challenges should help you verify that you can apply knowledge acquired in this lesson. Resist the temptation to look at the solutions right away. Learn by doing, not by copying and pasting!
172
+
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
If you want to use Python, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
219
+
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
**In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.**
10
10
11
-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11
+
---
12
12
13
-
This lesson contains just a fraction of what it should contain. In the end, the current content might get rewritten. Everything on this page is a subject to change!
13
+
From previous lessons we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
14
14
15
-
:::
15
+

16
+
17
+
As a first step, let's try counting how many products is in the listing.
16
18
17
19
## Treating HTML as a string
18
20
19
-
Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
21
+
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. Can we use Python string operations to count the products? Each string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
20
22
21
-
Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
23
+
After manually inspecting the page in browser DevTools we can see that each product has the following structure:
@@ -31,37 +33,171 @@ Let's try counting how many products is in the listing. Manually inspecting the
31
33
</div>
32
34
```
33
35
34
-
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
36
+
At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. Replace your program with the following code:
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
50
+
:::info Handling errors
51
+
52
+
To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least crashes and prints what happened in case there's an error.
53
+
54
+
:::
55
+
56
+
Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more `div` tags with class names starting with `product-item`.
48
57
49
58
On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
64
+
Now it prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, that was tedious!
56
65
57
66
<!-- TODO image -->
58
67
59
-
Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
68
+
While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
69
+
70
+
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. To work with HTML we need a robust tool dedicated for the task.
71
+
72
+
:::tip Why regex can't parse HTML
73
+
74
+
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
75
+
76
+
:::
77
+
78
+
## Using HTML parser
79
+
80
+
An HTML parser takes a text with HTML markup and turns it into a tree of Python objects. We'll choose Beautiful Soup as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
The `BeautifulSoup` object contains our HTML, but unlike plain string, it allows us to work with the HTML elements in a structured way. As a demonstration, we use the shorthand `.title` for accessing the HTML `<title>` tag. Let's run the program:
104
+
105
+
```text
106
+
$ python main.py
107
+
<title>Sales
108
+
</title>
109
+
```
110
+
111
+
That looks promising! What if we want just the contents of the tag? Let's change the print line to the following:
112
+
113
+
```python
114
+
print(soup.title.text)
115
+
```
116
+
117
+
If we run our scraper again, it prints just the actual text of the `<title>` tag:
60
118
61
-
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
119
+
```text
120
+
$ python main.py
121
+
Sales
122
+
```
123
+
124
+
## Using CSS selectors
125
+
126
+
Beautiful Soup offers a `.select()` method, which runs a CSS selector against a parsed HTML document and returns all the matching elements. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. On the last line, we use `len()` to count how many items is in the list. That's it!
143
+
144
+
```text
145
+
$ python main.py
146
+
24
147
+
```
148
+
149
+
We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
150
+
151
+
---
62
152
63
153
## Exercises
64
154
65
-
- One
66
-
- Two
67
-
- Three
155
+
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
156
+
157
+
### Scrape F1 teams
158
+
159
+
Print a total count of F1 teams listed on this page:
160
+
161
+
```text
162
+
https://www.formula1.com/en/teams
163
+
```
164
+
165
+
<details>
166
+
<summary>Solution</summary>
167
+
168
+
```python
169
+
import httpx
170
+
from bs4 import BeautifulSoup
171
+
172
+
url ="https://www.formula1.com/en/teams"
173
+
response = httpx.get(url)
174
+
response.raise_for_status()
175
+
176
+
html_code = response.text
177
+
soup = BeautifulSoup(html_code, "html.parser")
178
+
print(len(soup.select(".outline")))
179
+
```
180
+
181
+
</details>
182
+
183
+
### Scrape F1 drivers
184
+
185
+
Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/index.md
+10-7Lines changed: 10 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,16 +14,18 @@ import DocCardList from '@theme/DocCardList';
14
14
15
15
:::danger Work in progress
16
16
17
-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
17
+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. Comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
18
18
19
19
:::
20
20
21
-
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data sets from several rans of such program would be useful for seeing trends in price changes, detecting discounts, etc.
21
+
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.
22
22
23
23
<!--
24
24
TODO image of warehouse with some CVS or JSON exported, similar to sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-collection.png, which is for some reason the same as sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-extraction.png
25
25
-->
26
26
27
+

28
+
27
29
## What you'll do
28
30
29
31
- Inspect pages using browser DevTools
@@ -42,7 +44,7 @@ Anyone with basic knowledge of developing programs in Python who wants to start
42
44
- macOS, Linux or Windows machine with a web browser and Python installed
- Comfortable importing from the Python standard library, using virtual environments, and installing dependencies with `pip`
45
-
-Running commands in Terminal or Command Prompt
47
+
-Familiar with how to run commands in Terminal or Command Prompt
46
48
47
49
## You may want to know
48
50
@@ -52,23 +54,24 @@ Let's explore the key reasons to take this course. What is web scraping good for
52
54
53
55
The internet is full of useful data, but most of it isn't offered in a structured way that is easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them.
54
56
55
-
Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with their servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
57
+
Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
56
58
57
59
### Why build your own scrapers
58
60
59
61
Scrapers are programs specifically designed to mine data from the internet. Point-and-click or no-code scraping solutions do exist, but they only take you so far. While simple to use, they lack the flexibility and optimization needed to handle advanced cases. Only custom-built scrapers can tackle more difficult challenges. And unlike ready-made solutions, they can be fine-tuned to perform tasks more efficiently, at a lower cost, or with greater precision.
60
62
61
63
### Why become a scraper dev
62
64
63
-
As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API. Here are some things you can do if you understand scraping:
65
+
As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API! Here are some things you can do if you understand scraping:
64
66
65
67
- Improve your productivity by building personal tools, such as your own real estate or rare sneakers watchdog.
66
-
- Companies can hire you to build custom scrapers to mine data important for their business.
68
+
- Companies can hire you to build custom scrapers mining data important for their business.
69
+
- Become an invaluable asset to data journalism, data science, or nonprofit teams working to make the world a better place.
67
70
- You can publish your scrapers on platforms like the [Apify Store](https://apify.com/store) and earn money by renting them out to others.
68
71
69
72
### Why learn with Apify
70
73
71
-
We are [Apify](https://apify.com), a web scraping and automation platform, but we built this course on top of open source technologies. The skills you can learn are applicable to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
74
+
We are [Apify](https://apify.com), a web scraping and automation platform. We did our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
0 commit comments