apify
diff --git a/‎sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
Lines changed: 8 additions & 6 deletions b/‎sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
Lines changed: 8 additions & 6 deletions
diff --git a/‎sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
Lines changed: 150 additions & 14 deletions b/‎sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
Lines changed: 150 additions & 14 deletions
diff --git a/‎sources/academy/webscraping/scraping_basics_python/images/collection-class.png
506 KB b/‎sources/academy/webscraping/scraping_basics_python/images/collection-class.png
506 KB
diff --git a/‎sources/academy/webscraping/scraping_basics_python/images/scraping.png
89.4 KB b/‎sources/academy/webscraping/scraping_basics_python/images/scraping.png
89.4 KB
diff --git a/‎sources/academy/webscraping/scraping_basics_python/index.md
Lines changed: 10 additions & 7 deletions b/‎sources/academy/webscraping/scraping_basics_python/index.md
Lines changed: 10 additions & 7 deletions
@@ -10,7 +10,7 @@ slug: /scraping-basics-python/downloading-html
 
 ---
 
-Using browser tools for developers is crucial for understanding structure of a particular page, but it's a manual task. Now let's start building our first automation, a Python program which downloads HTML code of the product listing.
+Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing.
 
 ## Starting a Python project
 
@@ -28,7 +28,7 @@ Being comfortable around Python project setup and installing packages is a prere
 
 :::
 
-Now let's test that all works. In the project directory create a new file called `main.py` with the following code:
+Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
 
 ```python
 import httpx
@@ -135,10 +135,10 @@ https://warehouse-theme-metal.myshopify.com/does/not/exist
 
 We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method which analyzes the number and raises the `httpx.HTTPError` exception in case our request wasn't successful.
 
-A robust scraper skips or retries requests when errors occur, but we'll start simple. Our program will print an error message and stop further processing of the response.
+A robust scraper skips or retries requests when errors occur, but let's start simple. Our program will print an error message and stop further processing of the response.
 
 
-We also want to play along with the conventions of the operating system, so let's print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
+We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
 
 ```python
 import sys
@@ -165,9 +165,11 @@ For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/St
 
 Done! We have managed to apply basic error handling. Now let's get back to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML.
 
+---
+
 ## Exercises
 
-These challenges should help you verify that you can apply knowledge acquired in this lesson. Resist the temptation to look at the solutions right away. Learn by doing, not by copying and pasting!
+These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
 
 ### Scrape Amazon
 
@@ -214,7 +216,7 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
   $ python main.py > products.html
   ```
 
-  If you want to use Python, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
+  If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
 
   ```python
   import sys
 
@@ -6,19 +6,21 @@ sidebar_position: 5
 slug: /scraping-basics-python/parsing-html
 ---
 
-:::danger Work in progress
+**In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.**
 
-This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
+---
 
-This lesson contains just a fraction of what it should contain. In the end, the current content might get rewritten. Everything on this page is a subject to change!
+From previous lessons we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
 
-:::
+![Products have the ‘product-item’ class](./images/collection-class.png)
+
+As a first step, let's try counting how many products is in the listing.
 
 ## Treating HTML as a string
 
-Yay! The entire HTML is now available in our program as a string. For now, we are just printing it to the screen, but once it's a string, we can manipulate it using any Python string operations.
+Currently, the entire HTML is available in our program as a string. Our program can print it to the screen or save it to a file, but not much more. Can we use Python string operations to count the products? Each string has `.count()`, a [method for counting substrings](https://docs.python.org/3/library/stdtypes.html#str.count).
 
-Let's try counting how many products is in the listing. Manually inspecting the page in browser developer tools, we can see that HTML code of each product has roughly the following structure:
+After manually inspecting the page in browser DevTools we can see that each product has the following structure:
 
 ```html
 <div class="product-item product-item--vertical ...">
@@ -31,37 +33,171 @@ Let's try counting how many products is in the listing. Manually inspecting the
 </div>
 ```
 
-At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries.
+At first sight, counting `product-item` occurances wouldn't match only products. Let's try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag. Because the substring contains a double quote character, we need to use single quotes as string boundaries. Replace your program with the following code:
 
 ```python
 import httpx
 
 url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
 response = httpx.get(url)
+response.raise_for_status()
 
 html_code = response.text
 count = html_code.count('<div class="product-item')
 print(count)
 ```
 
-Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more div elements with class names starting with `product-item`.
+:::info Handling errors
+
+To have the code examples more concise, we're omitting error handling for now. Keeping `response.raise_for_status()` ensures that your program at least crashes and prints what happened in case there's an error.
+
+:::
+
+Unfortunately, this doesn't seem to be sufficient. Running the code above prints 123, which is a suspiciously high number. It seems there are more `div` tags with class names starting with `product-item`.
 
 On closer look at the HTML, our substring matches also tags like `<div class="product-item__info">`. What if we force our code to count only those with a space after the class name?
 
 ```python
 count = html_code.count('<div class="product-item ')
 ```
 
-Now our program prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing.
+Now it prints number 24, which is in line with the text **Showing 1–24 of 50 products** above the product listing. Phew, that was tedious!
 
 <!-- TODO image -->
 
-Oof, that was tedious! While successful, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get titles and prices.
+While possible, we can see that processing HTML with [standard string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) is difficult and fragile. Imagine we wouldn't be just counting, but trying to get the titles and prices.
+
+In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. To work with HTML we need a robust tool dedicated for the task.
+
+:::tip Why regex can't parse HTML
+
+While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
+
+:::
+
+## Using HTML parser
+
+An HTML parser takes a text with HTML markup and turns it into a tree of Python objects. We'll choose Beautiful Soup as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
+
+```text
+$ pip install beautifulsoup4
+...
+Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
+```
+
+Now let's use it for parsing the HTML:
+
+```python
+import httpx
+from bs4 import BeautifulSoup
+
+url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+response = httpx.get(url)
+response.raise_for_status()
+
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
+print(soup.title)
+```
+
+The `BeautifulSoup` object contains our HTML, but unlike plain string, it allows us to work with the HTML elements in a structured way. As a demonstration, we use the shorthand `.title` for accessing the HTML `<title>` tag. Let's run the program:
+
+```text
+$ python main.py
+<title>Sales
+</title>
+```
+
+That looks promising! What if we want just the contents of the tag? Let's change the print line to the following:
+
+```python
+print(soup.title.text)
+```
+
+If we run our scraper again, it prints just the actual text of the `<title>` tag:
 
-In fact HTML can be so complex that even [regular expressions](https://docs.python.org/3/library/re.html) aren't able to process it reliably. In the next lesson we'll meet a tool dedicated for the task, a HTML parser.
+```text
+$ python main.py
+Sales
+```
+
+## Using CSS selectors
+
+Beautiful Soup offers a `.select()` method, which runs a CSS selector against a parsed HTML document and returns all the matching elements. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the products:
+
+```python
+import httpx
+from bs4 import BeautifulSoup
+
+url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+response = httpx.get(url)
+response.raise_for_status()
+
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
+products = soup.select(".product-item")
+print(len(products))
+```
+
+In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. On the last line, we use `len()` to count how many items is in the list. That's it!
+
+```text
+$ python main.py
+24
+```
+
+We have managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products.
+
+---
 
 ## Exercises
 
-- One
-- Two
-- Three
+These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
+
+### Scrape F1 teams
+
+Print a total count of F1 teams listed on this page:
+
+```text
+https://www.formula1.com/en/teams
+```
+
+<details>
+  <summary>Solution</summary>
+
+  ```python
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://www.formula1.com/en/teams"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+  print(len(soup.select(".outline")))
+  ```
+
+</details>
+
+### Scrape F1 drivers
+
+Use the same URL as in the previous exercise, but this time print a total count of F1 drivers.
+
+<details>
+  <summary>Solution</summary>
+
+  ```python
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://www.formula1.com/en/teams"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+  print(len(soup.select(".f1-grid")))
+  ```
+
+</details>
@@ -14,16 +14,18 @@ import DocCardList from '@theme/DocCardList';
 
 :::danger Work in progress
 
-This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
+This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. Comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
 
 :::
 
-In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data sets from several rans of such program would be useful for seeing trends in price changes, detecting discounts, etc.
+In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.
 
 <!--
 TODO image of warehouse with some CVS or JSON exported, similar to sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-collection.png, which is for some reason the same as sources/academy/webscraping/scraping_basics_javascript/images/beginners-data-extraction.png
 -->
 
+![E-commerce listing on the left, JSON with data on the right](./images/scraping.png)
+
 ## What you'll do
 
 - Inspect pages using browser DevTools
@@ -42,7 +44,7 @@ Anyone with basic knowledge of developing programs in Python who wants to start
 - macOS, Linux or Windows machine with a web browser and Python installed
 - Familiar with Python basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, exceptions
 - Comfortable importing from the Python standard library, using virtual environments, and installing dependencies with `pip`
-- Running commands in Terminal or Command Prompt
+- Familiar with how to run commands in Terminal or Command Prompt
 
 ## You may want to know
 
@@ -52,23 +54,24 @@ Let's explore the key reasons to take this course. What is web scraping good for
 
 The internet is full of useful data, but most of it isn't offered in a structured way that is easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them.
 
-Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with their servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
+Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
 
 ### Why build your own scrapers
 
 Scrapers are programs specifically designed to mine data from the internet. Point-and-click or no-code scraping solutions do exist, but they only take you so far. While simple to use, they lack the flexibility and optimization needed to handle advanced cases. Only custom-built scrapers can tackle more difficult challenges. And unlike ready-made solutions, they can be fine-tuned to perform tasks more efficiently, at a lower cost, or with greater precision.
 
 ### Why become a scraper dev
 
-As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API. Here are some things you can do if you understand scraping:
+As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API! Here are some things you can do if you understand scraping:
 
 - Improve your productivity by building personal tools, such as your own real estate or rare sneakers watchdog.
-- Companies can hire you to build custom scrapers to mine data important for their business.
+- Companies can hire you to build custom scrapers mining data important for their business.
+- Become an invaluable asset to data journalism, data science, or nonprofit teams working to make the world a better place.
 - You can publish your scrapers on platforms like the [Apify Store](https://apify.com/store) and earn money by renting them out to others.
 
 ### Why learn with Apify
 
-We are [Apify](https://apify.com), a web scraping and automation platform, but we built this course on top of open source technologies. The skills you can learn are applicable to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
+We are [Apify](https://apify.com), a web scraping and automation platform. We did our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
 
 ## Course content