You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+112-1Lines changed: 112 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,7 @@ response.raise_for_status()
22
22
23
23
html_code = response.text
24
24
soup = BeautifulSoup(html_code, "html.parser")
25
+
25
26
for product in soup.select(".product-item"):
26
27
print(product.text)
27
28
```
@@ -72,6 +73,7 @@ response.raise_for_status()
72
73
73
74
html_code = response.text
74
75
soup = BeautifulSoup(html_code, "html.parser")
76
+
75
77
for product in soup.select(".product-item"):
76
78
titles = product.select(".product-item__title")
77
79
first_title = titles[0].text
@@ -113,6 +115,7 @@ response.raise_for_status()
113
115
114
116
html_code = response.text
115
117
soup = BeautifulSoup(html_code, "html.parser")
118
+
116
119
for product in soup.select(".product-item"):
117
120
title = product.select_one(".product-item__title").text
118
121
price = product.select_one(".price").text
@@ -156,6 +159,7 @@ response.raise_for_status()
156
159
157
160
html_code = response.text
158
161
soup = BeautifulSoup(html_code, "html.parser")
162
+
159
163
for product in soup.select(".product-item"):
160
164
title = product.select_one(".product-item__title").text
161
165
price = product.select_one(".price").contents[-1]
@@ -181,4 +185,111 @@ Great! We have managed to use CSS selectors and walk the HTML tree to get a list
181
185
182
186
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
183
187
184
-
TODO
188
+
### Scrape Wikipedia
189
+
190
+
Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL:
Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
230
+
231
+
</details>
232
+
233
+
### Use CSS selectors to their max
234
+
235
+
Simplify the code from previous exercise. Use a single for loop and a single CSS selector. You may want to check out the following pages:
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+135-1Lines changed: 135 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,6 +71,7 @@ response.raise_for_status()
71
71
72
72
html_code = response.text
73
73
soup = BeautifulSoup(html_code, "html.parser")
74
+
74
75
for product in soup.select(".product-item"):
75
76
title = product.select_one(".product-item__title").text
76
77
@@ -171,6 +172,7 @@ response.raise_for_status()
171
172
172
173
html_code = response.text
173
174
soup = BeautifulSoup(html_code, "html.parser")
175
+
174
176
for product in soup.select(".product-item"):
175
177
title = product.select_one(".product-item__title").text.strip()
176
178
@@ -211,4 +213,136 @@ Well, not to spoil the excitement, but in its current form, the data isn't very
211
213
212
214
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
213
215
214
-
TODO
216
+
### Scrape units on stock
217
+
218
+
Change our scraper so that it extracts how many units of each product are on stock. Your program should print the following. Note the unit amounts at the end of each line:
219
+
220
+
```text
221
+
JBL Flip 4 Waterproof Portable Bluetooth Speaker 672
222
+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 77
223
+
Sony SACS9 10" Active Subwoofer 7
224
+
Sony PS-HX500 Hi-Res USB Turntable 15
225
+
Klipsch R-120SW Powerful Detailed Home Speaker - Unit 0
title = product.select_one(".product-item__title").text.strip()
246
+
247
+
units_text = (
248
+
product
249
+
.select_one(".product-item__inventory")
250
+
.text
251
+
.removeprefix("In stock,")
252
+
.removeprefix("Only")
253
+
.removesuffix(" left")
254
+
.removesuffix("units")
255
+
.strip()
256
+
)
257
+
if"Sold out"in units_text:
258
+
units =0
259
+
else:
260
+
units =int(units_text)
261
+
262
+
print(title, units)
263
+
```
264
+
265
+
</details>
266
+
267
+
### Use regular expressions
268
+
269
+
Simplify the code from previous exercise. Use [regular expressions](https://docs.python.org/3/library/re.html) to parse the number of units. You can match digits using a range like `[0-9]` or by a special sequence `\d`. To match more characters of the same type you can use `+`.
Download Guardian's page with the latest F1 news and use Beautiful Soup to parse it. Print titles and publish dates of all the listed articles. This is the URL:
303
+
304
+
```text
305
+
https://www.theguardian.com/sport/formulaone
306
+
```
307
+
308
+
Your program should print something like the following. Note the dates at the end of each line:
309
+
310
+
```text
311
+
Wolff confident Mercedes are heading to front of grid after Canada improvement 2024-06-10
312
+
Frustrated Lando Norris blames McLaren team for missed chance 2024-06-09
313
+
Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
314
+
...
315
+
```
316
+
317
+
Hints:
318
+
319
+
- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
320
+
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
321
+
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
322
+
- To get just the date part, you can call `.date()` on any `datetime` object.
0 commit comments