Skip to content

Commit 374c47a

Browse files
committed
fix: use 'tag' only when talking about HTML markup, otherwise use 'element'
1 parent 46cd811 commit 374c47a

File tree

5 files changed

+13
-13
lines changed

5 files changed

+13
-13
lines changed

sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -166,9 +166,9 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
166166
1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
167167
1. Activate the element selection tool in your DevTools.
168168
1. Click on several headings to examine the markup.
169-
1. Notice that all headings are `h2` tags with the `mp-h2` class.
169+
1. Notice that all headings are `h2` elements with the `mp-h2` class.
170170
1. In the **Console**, execute `document.querySelectorAll('h2')`.
171-
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` tags on the page. Thus, the selector is sufficient as is.
171+
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.
172172

173173
</details>
174174

@@ -184,7 +184,7 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
184184
1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
185185
1. Activate the element selection tool in your DevTools.
186186
1. Click on the first product to inspect its markup. Repeat with a few others.
187-
1. Observe that all products are `section` tags with multiple classes, including `product-card`.
187+
1. Observe that all products are `section` elements with multiple classes, including `product-card`.
188188
1. Since `section` is a generic wrapper, focus on the `product-card` class.
189189
1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
190190
1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
@@ -205,7 +205,7 @@ Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-U
205205
1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
206206
1. Activate the element selection tool in your DevTools.
207207
1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
208-
1. Note that all articles are `li` tags, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
208+
1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
209209
1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
210210
1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
211211
1. In the **Console**, execute `document.querySelectorAll('main li')`.

sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
126126
1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
127127
1. Activate the element selection tool in your DevTools.
128128
1. Click on the first post.
129-
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tags and randomized classes, requiring you to rely on the element hierarchy and order instead.
129+
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
130130
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
131131
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
132132
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ import Exercises from './_exercises.mdx';
1111

1212
---
1313

14-
From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
14+
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
1515

1616
![Products have the ‘product-item’ class](./images/product-item.png)
1717

@@ -37,9 +37,9 @@ $ pip install beautifulsoup4
3737
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3838
```
3939

40-
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
40+
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
4141

42-
![Tag of the main heading](./images/h1.png)
42+
![Element of the main heading](./images/h1.png)
4343

4444
Update your code to the following:
4545

@@ -63,15 +63,15 @@ $ python main.py
6363
[<h1 class="collection__title heading h1">Sales</h1>]
6464
```
6565

66-
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66+
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
6767

6868
```py
6969
headings = soup.select("h1")
7070
first_heading = headings[0]
7171
print(first_heading.text)
7272
```
7373

74-
If we run our scraper again, it prints the text of the first `<h1>` tag:
74+
If we run our scraper again, it prints the text of the first `h1` element:
7575

7676
```text
7777
$ python main.py

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
312312

313313
Hints:
314314

315-
- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
315+
- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
316316
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
317317
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
318318
- To get just the date part, you can call `.date()` on any `datetime` object.

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -228,13 +228,13 @@ With everything in place, we can now start working on a scraper that also scrape
228228

229229
![Product card's child elements](./images/child-elements.png)
230230

231-
Several methods exist for transitioning from one page to another, but the most common is a link tag, which looks like this:
231+
Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this:
232232

233233
```html
234234
<a href="https://example.com">Text of the link</a>
235235
```
236236

237-
In DevTools, we can see that each product title is, in fact, also a link tag. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
237+
In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
238238

239239
```py
240240
def parse_product(product):

0 commit comments

Comments
 (0)