Skip to content

Commit e9088e6

Browse files
authored
Merge branch 'master' into new-api-docs
2 parents 6b965e2 + 868b576 commit e9088e6

File tree

12 files changed

+793
-28
lines changed

12 files changed

+793
-28
lines changed

.github/styles/Apify/Apify.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,6 @@ level: warning
55
swap:
66
Apify Dashboard: Apify Console
77
apify freelancers: Apify freelancers
8-
Apify Platfrom: Apify platform
8+
Apify Platform: Apify platform
99
'(?:[Tt]he\s)?[Aa]pify\sproxy': Apify Proxy
1010
circa: approx.

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -140,12 +140,12 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'
140140

141141
<Exercises />
142142

143-
### Scrape Amazon
143+
### Scrape AliExpress
144144

145-
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
145+
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:
146146

147147
```text
148-
https://www.amazon.com/s?k=darth+vader
148+
https://www.aliexpress.com/w/wholesale-darth-vader.html
149149
```
150150

151151
<details>
@@ -154,13 +154,12 @@ https://www.amazon.com/s?k=darth+vader
154154
```py
155155
import httpx
156156

157-
url = "https://www.amazon.com/s?k=darth+vader"
157+
url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
158158
response = httpx.get(url)
159159
response.raise_for_status()
160160
print(response.text)
161161
```
162162

163-
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
164163
</details>
165164

166165
### Save downloaded HTML as a file

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):
122122

123123
This program does the same as the one we already had, but its code is more concise.
124124

125+
:::note Fragile code
126+
127+
We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this.
128+
129+
Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.
130+
131+
:::
132+
125133
## Precisely locating price
126134

127135
In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -199,8 +199,12 @@ def export_json(file, data):
199199
json.dump(data, file, default=serialize, indent=2)
200200

201201
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
202-
soup = download(listing_url)
203-
data = [parse_product(product) for product in soup.select(".product-item")]
202+
listing_soup = download(listing_url)
203+
204+
data = []
205+
for product in listing_soup.select(".product-item"):
206+
item = parse_product(product)
207+
data.append(item)
204208

205209
with open("products.csv", "w") as file:
206210
export_csv(file, data)
@@ -209,7 +213,7 @@ with open("products.json", "w") as file:
209213
export_json(file, data)
210214
```
211215

212-
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
216+
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code.
213217

214218
:::tip Refactoring
215219

@@ -300,9 +304,13 @@ Now we'll pass the base URL to the function in the main body of our program:
300304

301305
```py
302306
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
303-
soup = download(listing_url)
304-
# highlight-next-line
305-
data = [parse_product(product, listing_url) for product in soup.select(".product-item")]
307+
listing_soup = download(listing_url)
308+
309+
data = []
310+
for product in listing_soup.select(".product-item"):
311+
# highlight-next-line
312+
item = parse_product(product, listing_url)
313+
data.append(item)
306314
```
307315

308316
When we run the scraper now, we should see full URLs in our exports:

0 commit comments

Comments
 (0)