Skip to content

Commit 8293af8

Browse files
committed
feat: add one more lesson
1 parent 8685196 commit 8293af8

File tree

4 files changed

+237
-30
lines changed

4 files changed

+237
-30
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Being comfortable around Python project setup and installing packages is a prere
3030

3131
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
3232

33-
```python
33+
```py
3434
import httpx
3535

3636
print("OK")
@@ -53,7 +53,7 @@ If you see errors or for any other reason cannot run the code above, we're sorry
5353

5454
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing OK. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this:
5555

56-
```python
56+
```py
5757
import httpx
5858

5959
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
@@ -106,7 +106,7 @@ Sometimes websites return all kinds of errors. Most often because:
106106

107107
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or success. Let's change the last line of our program to print the code of the response we get:
108108

109-
```python
109+
```py
110110
print(response.status_code)
111111
```
112112

@@ -140,7 +140,7 @@ A robust scraper skips or retries requests when errors occur, but let's start si
140140

141141
We also want to play along with the conventions of the operating system, so we'll print to the [standard error output](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)) and exit our program with a non-zero [status code](https://en.wikipedia.org/wiki/Exit_status):
142142

143-
```python
143+
```py
144144
import sys
145145
import httpx
146146

@@ -182,7 +182,7 @@ https://www.amazon.com/s?k=darth+vader
182182
<details>
183183
<summary>Solution</summary>
184184

185-
```python
185+
```py
186186
import sys
187187
import httpx
188188

@@ -218,7 +218,7 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
218218

219219
If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
220220

221-
```python
221+
```py
222222
import sys
223223
import httpx
224224
from pathlib import Path
@@ -249,7 +249,7 @@ https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72
249249

250250
Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
251251

252-
```python
252+
```py
253253
from pathlib import Path
254254
import sys
255255
import httpx

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ At first sight, counting `product-item` occurances wouldn't match only products,
3737

3838
We could try looking for `<div class="product-item`, a substring which represents the enitre beginning of each product tag, but that would also count `<div class="product-item__info`! We'll need to add a space after the class name to avoid matching those. Replace your program with the following code:
3939

40-
```python
40+
```py
4141
import httpx
4242

4343
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
@@ -92,7 +92,7 @@ Now let's use it for parsing the HTML. Unlike plain string, the `BeautifulSoup`
9292

9393
Update your code to the following:
9494

95-
```python
95+
```py
9696
import httpx
9797
from bs4 import BeautifulSoup
9898

@@ -114,7 +114,7 @@ $ python main.py
114114

115115
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
116116

117-
```python
117+
```py
118118
headings = soup.select("h1")
119119
first_heading = headings[0]
120120
print(first_heading.text)
@@ -133,7 +133,7 @@ Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML
133133

134134
Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards:
135135

136-
```python
136+
```py
137137
import httpx
138138
from bs4 import BeautifulSoup
139139

@@ -173,7 +173,7 @@ https://www.formula1.com/en/teams
173173
<details>
174174
<summary>Solution</summary>
175175

176-
```python
176+
```py
177177
import httpx
178178
from bs4 import BeautifulSoup
179179

@@ -195,7 +195,7 @@ Use the same URL as in the previous exercise, but this time print a total count
195195
<details>
196196
<summary>Solution</summary>
197197

198-
```python
198+
```py
199199
import httpx
200200
from bs4 import BeautifulSoup
201201

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ slug: /scraping-basics-python/locating-elements
1212

1313
In the previous lesson we've managed to print text of the page's main heading or count how many products is in the listing. Let's combine those two—what happens if we print `.text` for each product card?
1414

15-
```python
15+
```py
1616
import httpx
1717
from bs4 import BeautifulSoup
1818

@@ -62,7 +62,7 @@ As in the browser DevTools lessons, we need to change the code so that it locate
6262

6363
We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:
6464

65-
```python
65+
```py
6666
import httpx
6767
from bs4 import BeautifulSoup
6868

@@ -73,10 +73,10 @@ response.raise_for_status()
7373
html_code = response.text
7474
soup = BeautifulSoup(html_code, "html.parser")
7575
for product in soup.select(".product-item"):
76-
titles = product.select('.product-item__title')
76+
titles = product.select(".product-item__title")
7777
first_title = titles[0].text
7878

79-
prices = product.select('.price')
79+
prices = product.select(".price")
8080
first_price = prices[0].text
8181

8282
print(first_title, first_price)
@@ -103,7 +103,7 @@ There's still some room for improvement, but it's already much better!
103103

104104
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
105105

106-
```python
106+
```py
107107
import httpx
108108
from bs4 import BeautifulSoup
109109

@@ -114,8 +114,8 @@ response.raise_for_status()
114114
html_code = response.text
115115
soup = BeautifulSoup(html_code, "html.parser")
116116
for product in soup.select(".product-item"):
117-
title = product.select_one('.product-item__title').text
118-
price = product.select_one('.price').text
117+
title = product.select_one(".product-item__title").text
118+
price = product.select_one(".price").text
119119
print(title, price)
120120
```
121121

@@ -131,7 +131,7 @@ In the output we can see that the price isn't located precisely. For each produc
131131
$74.95
132132
</span>
133133
```
134-
When translated to a tree of Python objects, the element with class `price` will contain several nodes:
134+
When translated to a tree of Python objects, the element with class `price` will contain several _nodes_:
135135

136136
- Textual node with white space,
137137
- a `span` HTML element,
@@ -140,12 +140,12 @@ When translated to a tree of Python objects, the element with class `price` will
140140
We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this:
141141

142142
```
143-
['\n', <span class="visually-hidden">Sale price</span>, '$74.95']
143+
["\n", <span class="visually-hidden">Sale price</span>, "$74.95"]
144144
```
145145

146146
It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program:
147147

148-
```python
148+
```py
149149
import httpx
150150
from bs4 import BeautifulSoup
151151

@@ -156,12 +156,12 @@ response.raise_for_status()
156156
html_code = response.text
157157
soup = BeautifulSoup(html_code, "html.parser")
158158
for product in soup.select(".product-item"):
159-
title = product.select_one('.product-item__title').text
160-
price = product.select_one('.price').contents[-1]
159+
title = product.select_one(".product-item__title").text
160+
price = product.select_one(".price").contents[-1]
161161
print(title, price)
162162
```
163163

164-
If we run our program now, it should print prices just as the actual amounts:
164+
If we run the scraper now, it should print prices as only amounts:
165165

166166
```text
167167
$ python main.py
@@ -173,3 +173,11 @@ Sony PS-HX500 Hi-Res USB Turntable $398.00
173173
```
174174

175175
Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data.
176+
177+
---
178+
179+
## Exercises
180+
181+
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
182+
183+
TODO

0 commit comments

Comments
 (0)