You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,8 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G
39
39
40
40
Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display information about the variants.
41
41
42
+

43
+
42
44
If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible.
43
45
44
46
After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data!
@@ -103,7 +105,7 @@ Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https
103
105
104
106
:::
105
107
106
-
If you run the program, you should see 34 items in total. Some items should have no variant:
108
+
If you run the program, you should see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page.
107
109
108
110
<!-- eslint-skip -->
109
111
```json title=products.json
@@ -121,7 +123,7 @@ If you run the program, you should see 34 items in total. Some items should have
121
123
]
122
124
```
123
125
124
-
Some products where we're missing the actual priceshould now have several variants:
126
+
Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items.
125
127
126
128
<!-- eslint-skip -->
127
129
```json title=products.json
@@ -147,7 +149,7 @@ Some products where we're missing the actual price should now have several varia
147
149
]
148
150
```
149
151
150
-
However, some products with variants will have the `price` field set. That's because the shop sells all these variants for the same price, so the product listing displays the price as a fixed amount:
152
+
Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like _$74.95_, instead of _from $74.95_.
151
153
152
154
<!-- eslint-skip -->
153
155
```json title=products.json
@@ -167,7 +169,7 @@ However, some products with variants will have the `price` field set. That's bec
167
169
168
170
## Parsing price
169
171
170
-
The items now contain the variant as text, which is good for a start, but it would be more useful to set the price in the `price` key. Let's introduce a new function to handle that:
172
+
The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` key. Let's introduce a new function to handle that:
0 commit comments