You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):
122
122
123
123
This program does the same as the one we already had, but its code is more concise.
124
124
125
+
:::note Fragile code
126
+
127
+
We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this.
128
+
129
+
Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.
130
+
131
+
:::
132
+
125
133
## Precisely locating price
126
134
127
135
In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:
for product in listing_soup.select(".product-item")
206
-
]
203
+
204
+
data = []
205
+
for product in listing_soup.select(".product-item"):
206
+
item = parse_product(product)
207
+
data.append(item)
207
208
208
209
withopen("products.csv", "w") asfile:
209
210
export_csv(file, data)
@@ -212,7 +213,7 @@ with open("products.json", "w") as file:
212
213
export_json(file, data)
213
214
```
214
215
215
-
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
216
+
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code.
216
217
217
218
:::tip Refactoring
218
219
@@ -304,11 +305,12 @@ Now we'll pass the base URL to the function in the main body of our program:
for product in listing_soup.select(".product-item")
76
-
]
73
+
74
+
data = []
75
+
for product in listing_soup.select(".product-item"):
76
+
item = parse_product(product, listing_url)
77
+
data.append(item)
77
78
78
79
withopen("products.csv", "w") asfile:
79
80
export_csv(file, data)
@@ -82,31 +83,77 @@ with open("products.json", "w") as file:
82
83
export_json(file, data)
83
84
```
84
85
85
-
## Crawling product URLs
86
+
## Extracting vendor name
87
+
88
+
Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more.
89
+
90
+

91
+
92
+
Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools we can see that the HTML around the vendor name has the following structure:
93
+
94
+
```html
95
+
<divclass="product-meta">
96
+
<h1class="product-meta__title heading h1">
97
+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
98
+
</h1>
99
+
<divclass="product-meta__label-list">
100
+
...
101
+
</div>
102
+
<divclass="product-meta__reference">
103
+
<!-- highlight-next-line -->
104
+
<aclass="product-meta__vendor link link--accented"href="/collections/sony">
<divclass="rating__stars"role="img"aria-label="4.0 out of 5.0 stars">
117
+
...
118
+
</div>
119
+
<spanclass="rating__caption">3 reviews</span>
120
+
</div>
121
+
</a>
122
+
...
123
+
</div>
124
+
```
125
+
126
+
It looks like using a CSS selector to locate element having the `product-meta__vendor` class and extracting its text should be enough to get the vendor name as a string:
In a new loop below the list comprehension we'll go through the product URLs, download and parse each of them, and extract some new data, e.g. name of the vendor. Then we'll save the data to the `product` dictionary as a new key.
134
+
## Crawling product detail pages
135
+
136
+
In the `data` loop we already go through all the products. Let's expand it so it also includes downloading the product detail page, parsing it, extracting the name of the vendor, and adding it as a new dictionary key to the item:
If you run the program now, it will take longer to finish, but should produce exports with a new field containing the vendor:
156
+
If you run the program now, it will take longer to finish, as it now makes 24 more HTTP requests, but in the end it should produce exports with a new field containing the vendor:
110
157
111
158
<!-- eslint-skip -->
112
159
```json title=products.json
@@ -129,26 +176,18 @@ If you run the program now, it will take longer to finish, but should produce ex
129
176
]
130
177
```
131
178
132
-
<!--
133
-
- show image of how we figured out the vendor or have a note about devtools
179
+
## Extracting price
134
180
135
-
caveats:
136
-
- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
137
-
- scrape price for the variants
181
+
Being able to scrape vendor name is nice, but the main reason we started peeking at the detail pages in the first place was to figure out how to get a price for each product, because from the product listing we could only scrape the min price. And we're building a Python application for watching prices, remember?
138
182
139
-
caveats and reasons for framework:
140
-
- it's slow
141
-
- logging
142
-
- a lot of boilerplate code
143
-
- anti-scraping protection
144
-
- browser crawling support
145
-
-->
183
+
Looking at [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's apparent that the listing features only min prices, because some of the products have variants, each with a different price. And different stock availability. And different SKU…
146
184
185
+

147
186
148
-
:::danger Work in progress
187
+
In the next lesson we'll scrape the product detail pages in such way that each product variant gets represented as a separate item in our dataset.
149
188
150
-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
189
+
---
151
190
152
-
This particular page is a placeholder for several lessons which should teach crawling.
description: Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to parse product variants and exporting them as separate items.
5
+
sidebar_position: 11
6
+
slug: /scraping-basics-python/parsing-variants
7
+
---
8
+
9
+
:::danger Work in progress
10
+
11
+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
12
+
13
+
:::
14
+
15
+
<!--
16
+
17
+
import Exercises from './_exercises.mdx';
18
+
19
+
**Blah blah.**
20
+
21
+
---
22
+
23
+
We'll need to change our code so that instead of having one item per product in the listing, we let the code which handles product detail pages to decide how many items it generates.
24
+
25
+
But first let's see if we can
26
+
27
+
The design of our program now assumes that a single URL from the products listing represents a single product. As it turns out, each URL from the product listing can represent one or more products. Instead of having one item per product in the listing, we should let the code which handles product detail pages to decide how many items it generates.
description: Lesson about building a Python application for watching prices. Using the Crawlee framework to simplify creating a scraper.
5
+
sidebar_position: 11
6
+
slug: /scraping-basics-python/framework
7
+
---
8
+
9
+
:::danger Work in progress
10
+
11
+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
12
+
13
+
:::
14
+
15
+
<!--
16
+
17
+
import Exercises from './_exercises.mdx';
18
+
19
+
**Blah blah.**
20
+
21
+
---
22
+
23
+
caveats:
24
+
- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
0 commit comments