You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,9 +166,9 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
166
166
1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
167
167
1. Activate the element selection tool in your DevTools.
168
168
1. Click on several headings to examine the markup.
169
-
1. Notice that all headings are `h2`tags with the `mp-h2` class.
169
+
1. Notice that all headings are `h2`elements with the `mp-h2` class.
170
170
1. In the **Console**, execute `document.querySelectorAll('h2')`.
171
-
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2`tags on the page. Thus, the selector is sufficient as is.
171
+
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2`elements on the page. Thus, the selector is sufficient as is.
172
172
173
173
</details>
174
174
@@ -184,7 +184,7 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
184
184
1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
185
185
1. Activate the element selection tool in your DevTools.
186
186
1. Click on the first product to inspect its markup. Repeat with a few others.
187
-
1. Observe that all products are `section`tags with multiple classes, including `product-card`.
187
+
1. Observe that all products are `section`elements with multiple classes, including `product-card`.
188
188
1. Since `section` is a generic wrapper, focus on the `product-card` class.
189
189
1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
190
190
1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
@@ -205,7 +205,7 @@ Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-U
205
205
1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
206
206
1. Activate the element selection tool in your DevTools.
207
207
1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
208
-
1. Note that all articles are `li`tags, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
208
+
1. Note that all articles are `li`elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
209
209
1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
210
210
1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
211
211
1. In the **Console**, execute `document.querySelectorAll('main li')`.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -126,7 +126,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
126
126
1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
127
127
1. Activate the element selection tool in your DevTools.
128
128
1. Click on the first post.
129
-
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tags and randomized classes, requiring you to rely on the element hierarchy and order instead.
129
+
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
130
130
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
131
131
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
132
132
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ import Exercises from './_exercises.mdx';
11
11
12
12
---
13
13
14
-
From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
14
+
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
15
15
16
16

Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>`tag, which represents the main heading of the page.
40
+
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>`element, which represents the main heading of the page.
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66
+
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
67
67
68
68
```py
69
69
headings = soup.select("h1")
70
70
first_heading = headings[0]
71
71
print(first_heading.text)
72
72
```
73
73
74
-
If we run our scraper again, it prints the text of the first `<h1>` tag:
74
+
If we run our scraper again, it prints the text of the first `h1` element:
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -312,7 +312,7 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
312
312
313
313
Hints:
314
314
315
-
- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
315
+
- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
316
316
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
317
317
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
318
318
- To get just the date part, you can call `.date()` on any `datetime` object.
Several methods exist for transitioning from one page to another, but the most common is a link tag, which looks like this:
231
+
Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this:
232
232
233
233
```html
234
234
<ahref="https://example.com">Text of the link</a>
235
235
```
236
236
237
-
In DevTools, we can see that each product title is, in fact, also a link tag. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
237
+
In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
0 commit comments