apify
diff --git a/‎sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
Lines changed: 34 additions & 26 deletions b/‎sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
Lines changed: 34 additions & 26 deletions
diff --git a/‎sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
Lines changed: 1 addition & 1 deletion b/‎sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎sources/academy/webscraping/scraping_basics_python/images/h1.png
2.24 MB b/‎sources/academy/webscraping/scraping_basics_python/images/h1.png
2.24 MB
diff --git a/‎sources/academy/webscraping/scraping_basics_python/images/collection-class.png renamed to ‎sources/academy/webscraping/scraping_basics_python/images/product-item.png b/‎sources/academy/webscraping/scraping_basics_python/images/collection-class.png renamed to ‎sources/academy/webscraping/scraping_basics_python/images/product-item.png
@@ -10,17 +10,17 @@ slug: /scraping-basics-python/parsing-html
 
 ---
 
-From previous lessons we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
+From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
 
-![Products have the ‘product-item’ class](./images/collection-class.png)
+![Products have the ‘product-item’ class](./images/product-item.png)
 
-As a first step, let's try counting how many products is in the listing.
+As a first step, let's try counting how many products is on the listing page.
 
 ## Treating HTML as a string
 
-Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. Can we use Python string operations to count the products? Each string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
+Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. If it's a string, could we use Python string operations to count the products? Each Python string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
 
-After manually inspecting the page in browser DevTools we can see that each product has the following structure:
+After manually inspecting the page in browser DevTools we can see that all products follow this structure:
 
 ```html
 <div class="product-item product-item--vertical ...">
@@ -33,7 +33,9 @@ After manually inspecting the page in browser DevTools we can see that each prod
 </div>
 ```
 
-At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. Replace your program with the following code:
+At first sight, counting `product-item` occurances wouldn't match only products, but also `product-item__image-wrapper`. Hmm.
+
+We could try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag, but that would also count `<div class="product-item__info`! We'll need to add a space after the class name to avoid matching those. Replace your program with the following code:
 
 ```python
 import httpx
@@ -43,26 +45,25 @@ response = httpx.get(url)
 response.raise_for_status()
 
 html_code = response.text
-count = html_code.count('<div class="product-item')
+count = html_code.count('<div class="product-item ')
 print(count)
 ```
 
+Note that because the substring contains a double quote character, we need single quotes as string boundaries.
+
 :::info Handling errors
 
-To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least crashes and prints what happened in case there's an error.
+To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least visibly crashes and prints what happened in case there's an error.
 
 :::
 
-Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more `div` tags with class names starting with `product-item`.
-
-On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
+Our scraper prints 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, figuring this out was quite tedious!
 
-```python
-count = html_code.count('<div class="product-item ')
+```text
+$ python main.py
+24
 ```
 
-Now it prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, that was tedious!
-
 <!-- TODO image -->
 
 While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
@@ -85,7 +86,11 @@ $ pip install beautifulsoup4
 Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
 ```
 
-Now let's use it for parsing the HTML:
+Now let's use it for parsing the HTML. Unlike plain string, the `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
+
+![Tag of the main heading](./images/h1.png)
+
+Update your code to the following:
 
 ```python
 import httpx
@@ -97,24 +102,25 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
-print(soup.title)
+print(soup.select("h1"))
 ```
 
-The `BeautifulSoup` object contains our HTML, but unlike plain string, it allows us to work with the HTML elements in a structured way. As a demonstration, we use the shorthand `.title` for accessing the HTML `<title>` tag. Let's run the program:
+Let's run the program:
 
 ```text
 $ python main.py
-<title>Sales
-</title>
+[<h1 class="collection__title heading h1">Sales</h1>]
 ```
 
-That looks promising! What if we want just the contents of the tag? Let's change the print line to the following:
+Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
 
 ```python
-print(soup.title.text)
+headings = soup.select("h1")
+first_heading = headings[0]
+print(first_heading.text)
 ```
 
-If we run our scraper again, it prints just the actual text of the `<title>` tag:
+If we run our scraper again, it prints the text of the first `<h1>` tag:
 
 ```text
 $ python main.py
@@ -123,7 +129,9 @@ Sales
 
 ## Using CSS selectors
 
-Beautiful Soup offers a `.select()` method, which runs a CSS selector against a parsed HTML document and returns all the matching elements. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
+Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools.
+
+Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
 
 ```python
 import httpx
@@ -139,14 +147,14 @@ products = soup.select(".product-item")
 print(len(products))
 ```
 
-In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. On the last line, we use `len()` to count how many items is in the list. That's it!
+In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list.
 
 ```text
 $ python main.py
 24
 ```
 
-We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
+That's it! We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
 
 ---
 
 
@@ -1,7 +1,7 @@
 ---
 title: Locating HTML elements with Python
 sidebar_label: Locating HTML elements
-description: TODO
+description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page.
 sidebar_position: 6
 slug: /scraping-basics-python/locating-elements
 ---