Skip to content

Commit 4760e4f

Browse files
committed
feat: skip processing HTML as a string, reach the goal faster
1 parent c878f30 commit 4760e4f

File tree

1 file changed

+6
-49
lines changed

1 file changed

+6
-49
lines changed

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 6 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -18,70 +18,27 @@ From lessons about browser DevTools we know that the HTML tags representing indi
1818

1919
As a first step, let's try counting how many products are on the listing page.
2020

21-
## Treating HTML as a string
21+
## Processing HTML
2222

23-
Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. If it's a string, could we use Python string operations to count the products? Each Python string has [`.count()`](https://docs.python.org/3/library/stdtypes.html#str.count), a method for counting substrings.
23+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. But if it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
2424

25-
After manually inspecting the page in browser DevTools we can see that all product cards have the following structure:
26-
27-
```html
28-
<div class="product-item product-item--vertical ...">
29-
<a href="/products/..." class="product-item__image-wrapper">
30-
...
31-
</a>
32-
<div class="product-item__info">
33-
...
34-
</div>
35-
</div>
36-
```
37-
38-
At first sight, counting `product-item` occurrences wouldn't match only products, but also `product-item__image-wrapper`. Hmm.
39-
40-
We could try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag, but that would also count `<div class="product-item__info`! We'll need to add a space after the class name to avoid matching those. Replace your program with the following code:
41-
42-
```py
43-
import httpx
44-
45-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
46-
response = httpx.get(url)
47-
response.raise_for_status()
48-
49-
html_code = response.text
50-
# use single quotes as string boundaries, because the substring contains a double quote character
51-
count = html_code.count('<div class="product-item ')
52-
print(count)
53-
```
54-
55-
Our scraper prints 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, figuring this out was quite tedious!
56-
57-
```text
58-
$ python main.py
59-
24
60-
```
61-
62-
<!-- TODO image -->
63-
64-
While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
65-
66-
In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. To work with HTML we need a robust tool dedicated for the task.
25+
While somewhat possible, such approach is tedious, fragile, and unreliable. To work with HTML we need a robust tool dedicated for the task. An _HTML parser_ takes a text with HTML markup and turns it into a tree of Python objects.
6726

6827
:::info Why regex can't parse HTML
6928

7029
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
7130

7231
:::
7332

74-
## Using HTML parser
75-
76-
An HTML parser takes a text with HTML markup and turns it into a tree of Python objects. We'll choose Beautiful Soup as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
33+
We'll choose _Beautiful Soup_ as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
7734

7835
```text
7936
$ pip install beautifulsoup4
8037
...
8138
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
8239
```
8340

84-
Now let's use it for parsing the HTML. Unlike plain string, the `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
41+
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
8542

8643
![Tag of the main heading](./images/h1.png)
8744

@@ -149,7 +106,7 @@ $ python main.py
149106
24
150107
```
151108

152-
That's it! We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
109+
That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
153110

154111
---
155112

0 commit comments

Comments
 (0)