Skip to content

Commit fd15f5e

Browse files
authored
feat: third lesson about DevTools (#1321)
Porting [this lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools) and [this lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/devtools-continued): - I deliberately didn't include anything about cleaning data or iterating over the data, because it would be unnecessary load of JavaScript for a Python dev, for no obvious reason. They'll learn everything in the following lessons. I focused only on stuff which can be considered as ad-hoc inspection of the website, which is useful even to Python devs. - I tried to unify the style with the rest of the course. I wrote the stuff with my own words, and added standard polishing I usually do (dictionary, proofreading by AI, etc.) - I added several screenshots. - I added exercises. - There's no Python, so I'm not bothering Vláďa with this one. ![image](https://github.com/user-attachments/assets/a95db947-4f05-4cef-ba22-18437a164d47)
1 parent afd1ffb commit fd15f5e

File tree

10 files changed

+149
-37
lines changed

10 files changed

+149
-37
lines changed

.github/styles/config/vocabularies/Docs/accept.txt

Lines changed: 25 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -84,45 +84,38 @@ preconfigured
8484
[Tt]rello
8585
[Pp]refill
8686

87-
8887
[Mm]ultiselect
8988

90-
[Ss]crapy
9189
asyncio
92-
parallelization
93-
IMDb
94-
95-
9690
Langflow
97-
98-
iPhone
99-
iPhones
100-
iPad
101-
iPads
102-
screenshotting
103-
Fakestore
104-
SKUs
105-
SKU
106-
Shopify
107-
learnings
108-
subwoofer
109-
captcha
110-
captchas
91+
backlinks?
92+
captchas?
93+
Chatbot
94+
combinator
11195
deduplicating
112-
reindexes
113-
READMEs
114-
backlink
115-
backlinks
116-
subreddit
117-
subreddits
118-
upvote
119-
walkthrough
120-
walkthroughs
96+
Fakestore
97+
Fandom('s)?
98+
IMDb
12199
influencers
100+
iPads?
101+
iPhones?
102+
jQuery
103+
learnings
122104
livestreams
123105
outro
124-
Chatbot
125-
Tripadvisor
106+
parallelization
107+
READMEs
108+
reindexes
126109
[Rr]epurpose
110+
screenshotting
111+
[Ss]crapy
112+
Shein('s)?
113+
Shopify
114+
SKUs?
115+
subreddits?
116+
[Ss]ubwoofer
117+
Tripadvisor
118+
upvote
119+
walkthroughs?
127120

128-
jQuery
121+
ul

sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ Multiple approaches often exist for creating a CSS selector that targets the ele
120120

121121
The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
122122

123-
This class is also unique enough in the page's context. If it were something generic like `item`, there'd be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, you can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
123+
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, you can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
124124

125125
![Overview of all the product cards in DevTools](./images/devtools-product-list.png)
126126

@@ -198,7 +198,7 @@ Go to Guardian's [page about F1](https://www.theguardian.com/sport/formulaone).
198198

199199
Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator).
200200

201-
![Articles on Guardian's page about F1](./images/devtools-exercise-guardian.png)
201+
![Articles on Guardian's page about F1](./images/devtools-exercise-guardian1.png)
202202

203203
<details>
204204
<summary>Solution</summary>

sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md

Lines changed: 122 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,131 @@ sidebar_position: 3
66
slug: /scraping-basics-python/devtools-extracting-data
77
---
88

9+
import Exercises from './_exercises.mdx';
10+
911
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**
1012

1113
---
1214

13-
:::danger Work in Progress
15+
In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data. Now how do we extract the data?
16+
17+
## Finding product details
18+
19+
Previously, we've figured out how to save the subwoofer product card to a variable in the **Console**:
20+
21+
```js
22+
products = document.querySelectorAll('.product-item');
23+
subwoofer = products[2];
24+
```
25+
26+
The product details are within the element as text, so maybe if we extract the text, we could work out the individual values?
27+
28+
```js
29+
subwoofer.textContent;
30+
```
31+
32+
That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.
33+
34+
![Printing text content of the parent element](./images/devtools-extracting-text.png)
35+
36+
We'll need to first locate relevant child elements and extract the data from each of them individually.
37+
38+
## Extracting title
39+
40+
We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.
41+
42+
![Finding child elements](./images/devtools-product-details.png)
43+
44+
JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:
45+
46+
```js
47+
title = subwoofer.querySelector('.product-item__title');
48+
title.textContent;
49+
```
50+
51+
Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:
52+
53+
![Extracting product title](./images/devtools-extracting-title.png)
54+
55+
## Extracting price
56+
57+
To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.
58+
59+
![Finding child elements](./images/devtools-product-details.png)
60+
61+
We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result:
62+
63+
```js
64+
price = subwoofer.querySelector('.price');
65+
price.textContent;
66+
```
67+
68+
It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:
69+
70+
![Extracting product price](./images/devtools-extracting-price.png)
71+
72+
But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get the values as numbers.
73+
74+
In the next lesson, we'll start with our Python project. First we'll be figuring out how to download the Sales page without browser and make it accessible in a Python program.
75+
76+
---
77+
78+
<Exercises />
79+
80+
### Extract the price of IKEA's most expensive artificial plant
81+
82+
At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number.
83+
84+
<details>
85+
<summary>Solution</summary>
86+
87+
1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/).
88+
1. Sort the products by price, from high to low, so the most expensive plant appears first in the listing.
89+
1. Activate the element selection tool in your DevTools.
90+
1. Click on the price of the first and most expensive plant.
91+
1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
92+
1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
93+
1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
94+
1. Convert the price text into a number by executing `parseInt(price.textContent)`.
95+
1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
96+
97+
</details>
98+
99+
### Extract the name of the top wiki on Fandom Movies
100+
101+
On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use JavaScript's [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name.
102+
103+
![Fandom's Movies page](./images/devtools-exercise-fandom.png)
104+
105+
<details>
106+
<summary>Solution</summary>
107+
108+
1. Open the [Movies page](https://www.fandom.com/topics/movies).
109+
1. Activate the element selection tool in your DevTools.
110+
1. Click on the list item for the top Fandom wiki in the category.
111+
1. Notice that it has a class `topic_explore-wikis__link`.
112+
1. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done.
113+
1. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`.
114+
1. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`.
115+
116+
</details>
117+
118+
### Extract details about the first post on Guardian's F1 news
119+
120+
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
121+
122+
![F1 news page](./images/devtools-exercise-guardian2.png)
123+
124+
<details>
125+
<summary>Solution</summary>
14126

15-
This lesson is under development. Please read [Extracting data with DevTools](../scraping_basics_javascript/data_extraction/devtools_continued.md) in the meantime so you can follow the upcoming lessons.
127+
1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
128+
1. Activate the element selection tool in your DevTools.
129+
1. Click on the first post.
130+
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tags and randomized classes, requiring you to rely on the element hierarchy and order instead.
131+
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
132+
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
133+
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
134+
1. Extract the photo URL by executing `post.querySelector('img').src`.
16135

17-
:::
136+
</details>
936 KB
Loading
760 KB
Loading
138 KB
Loading
139 KB
Loading
117 KB
Loading
247 KB
Loading

0 commit comments

Comments
 (0)