You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -32,13 +31,13 @@ As mentioned in the previous lesson, before building a scraper, we need to under
32
31
33
32

34
33
35
-
The page displays a grid of product cards, each showing a product's name and picture. Open DevTools and locate the name of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
34
+
The page displays a grid of product cards, each showing a product's title and picture. Open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
36
35
37
-

36
+

38
37
39
38
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
40
39
41
-
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's name. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
40
+
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's title. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
42
41
43
42

It will return the HTML element for the first product card in the listing:
68
67
69
-

70
-
71
-
:::note About the missing semicolon
72
-
73
-
In the screenshot, there is a missing semicolon `;` at the end of the line. In JavaScript, semicolons are optional, so it doesn't make a difference here.
74
-
75
-
:::
68
+

76
69
77
70
CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine.
78
71
@@ -167,9 +160,9 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
167
160
1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
168
161
1. Activate the element selection tool in your DevTools.
169
162
1. Click on several headings to examine the markup.
170
-
1. Notice that all headings are `h2`tags with the `mp-h2` class.
163
+
1. Notice that all headings are `h2`elements with the `mp-h2` class.
171
164
1. In the **Console**, execute `document.querySelectorAll('h2')`.
172
-
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2`tags on the page. Thus, the selector is sufficient as is.
165
+
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2`elements on the page. Thus, the selector is sufficient as is.
173
166
174
167
</details>
175
168
@@ -185,7 +178,7 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
185
178
1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
186
179
1. Activate the element selection tool in your DevTools.
187
180
1. Click on the first product to inspect its markup. Repeat with a few others.
188
-
1. Observe that all products are `section`tags with multiple classes, including `product-card`.
181
+
1. Observe that all products are `section`elements with multiple classes, including `product-card`.
189
182
1. Since `section` is a generic wrapper, focus on the `product-card` class.
190
183
1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
191
184
1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
@@ -206,7 +199,7 @@ Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-U
206
199
1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
207
200
1. Activate the element selection tool in your DevTools.
208
201
1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
209
-
1. Note that all articles are `li`tags, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
202
+
1. Note that all articles are `li`elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
210
203
1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
211
204
1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
212
205
1. In the **Console**, execute `document.querySelectorAll('main li')`.
@@ -127,7 +126,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
127
126
1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
128
127
1. Activate the element selection tool in your DevTools.
129
128
1. Click on the first post.
130
-
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tags and randomized classes, requiring you to rely on the element hierarchy and order instead.
129
+
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
131
130
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
132
131
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
133
132
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+7-8Lines changed: 7 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,6 @@
2
2
title: Parsing HTML with Python
3
3
sidebar_label: Parsing HTML
4
4
description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to parse HTML code of a product listing page.
5
-
sidebar_position: 5
6
5
slug: /scraping-basics-python/parsing-html
7
6
---
8
7
@@ -12,7 +11,7 @@ import Exercises from './_exercises.mdx';
12
11
13
12
---
14
13
15
-
From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
14
+
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
16
15
17
16

Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>`tag, which represents the main heading of the page.
40
+
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>`element, which represents the main heading of the page.
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66
+
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
68
67
69
68
```py
70
69
headings = soup.select("h1")
71
70
first_heading = headings[0]
72
71
print(first_heading.text)
73
72
```
74
73
75
-
If we run our scraper again, it prints the text of the first `<h1>` tag:
74
+
If we run our scraper again, it prints the text of the first `h1` element:
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,6 @@
2
2
title: Locating HTML elements with Python
3
3
sidebar_label: Locating HTML elements
4
4
description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,6 @@
2
2
title: Extracting data from HTML with Python
3
3
sidebar_label: Extracting data from HTML
4
4
description: Lesson about building a Python application for watching prices. Using string manipulation to extract and clean data scraped from the product listing page.
5
-
sidebar_position: 7
6
5
slug: /scraping-basics-python/extracting-data
7
6
---
8
7
@@ -313,7 +312,7 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
313
312
314
313
Hints:
315
314
316
-
- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
315
+
- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
317
316
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
318
317
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
319
318
- To get just the date part, you can call `.date()` on any `datetime` object.
0 commit comments