-
Notifications
You must be signed in to change notification settings - Fork 131
feat: lessons about crawling and scraping product detail pages in Python #1244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5e2399b
a0648f5
9e0a6a1
16ea039
ea4ec88
0749d5f
e0ecf7c
f88abc8
7680d36
3cf2893
017d2ae
30a75bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"): | |
|
|
||
| This program does the same as the one we already had, but its code is more concise. | ||
|
|
||
| :::note Fragile code | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Explicitly addressing some previous concerns of @vdusek. I thought about it and I think it's better if we're upfront about this decision and the reader knows that this is expected.
honzajavorek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this. | ||
|
|
||
| Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it. | ||
|
|
||
| ::: | ||
|
|
||
| ## Precisely locating price | ||
|
|
||
| In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -199,8 +199,12 @@ def export_json(file, data): | |
| json.dump(data, file, default=serialize, indent=2) | ||
|
|
||
| listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" | ||
| soup = download(listing_url) | ||
| data = [parse_product(product) for product in soup.select(".product-item")] | ||
| listing_soup = download(listing_url) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made this change because this way the code is better suited for the newly added lessons. |
||
|
|
||
| data = [] | ||
| for product in listing_soup.select(".product-item"): | ||
| item = parse_product(product) | ||
| data.append(item) | ||
|
|
||
| with open("products.csv", "w") as file: | ||
| export_csv(file, data) | ||
|
|
@@ -209,7 +213,7 @@ with open("products.json", "w") as file: | |
| export_json(file, data) | ||
| ``` | ||
|
|
||
| The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). | ||
| The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code. | ||
|
|
||
| :::tip Refactoring | ||
|
|
||
|
|
@@ -300,9 +304,13 @@ Now we'll pass the base URL to the function in the main body of our program: | |
|
|
||
| ```py | ||
| listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" | ||
| soup = download(listing_url) | ||
| # highlight-next-line | ||
| data = [parse_product(product, listing_url) for product in soup.select(".product-item")] | ||
| listing_soup = download(listing_url) | ||
|
|
||
| data = [] | ||
| for product in listing_soup.select(".product-item"): | ||
| # highlight-next-line | ||
| item = parse_product(product, listing_url) | ||
| data.append(item) | ||
| ``` | ||
|
|
||
| When we run the scraper now, we should see full URLs in our exports: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,15 +1,305 @@ | ||
| --- | ||
| title: Crawling websites with Python | ||
| sidebar_label: Crawling websites | ||
| description: TODO | ||
| description: Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages. | ||
| sidebar_position: 10 | ||
| slug: /scraping-basics-python/crawling | ||
| --- | ||
|
|
||
| :::danger Work in progress | ||
| import Exercises from './_exercises.mdx'; | ||
|
|
||
| This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem. | ||
| **In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.** | ||
|
|
||
| This particular page is a placeholder for several lessons which should teach crawling. | ||
| --- | ||
|
|
||
| In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products. | ||
|
|
||
| Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now: | ||
|
|
||
| ```py | ||
| import httpx | ||
| from bs4 import BeautifulSoup | ||
| from decimal import Decimal | ||
| import csv | ||
| import json | ||
| from urllib.parse import urljoin | ||
|
|
||
| def download(url): | ||
| response = httpx.get(url) | ||
| response.raise_for_status() | ||
|
|
||
| html_code = response.text | ||
| return BeautifulSoup(html_code, "html.parser") | ||
|
|
||
| def parse_product(product, base_url): | ||
| title_element = product.select_one(".product-item__title") | ||
| title = title_element.text.strip() | ||
| url = urljoin(base_url, title_element["href"]) | ||
|
|
||
| price_text = ( | ||
| product | ||
| .select_one(".price") | ||
| .contents[-1] | ||
| .strip() | ||
| .replace("$", "") | ||
| .replace(",", "") | ||
| ) | ||
| if price_text.startswith("From "): | ||
| min_price = Decimal(price_text.removeprefix("From ")) | ||
| price = None | ||
| else: | ||
| min_price = Decimal(price_text) | ||
| price = min_price | ||
|
|
||
| return {"title": title, "min_price": min_price, "price": price, "url": url} | ||
|
|
||
| def export_csv(file, data): | ||
| fieldnames = list(data[0].keys()) | ||
| writer = csv.DictWriter(file, fieldnames=fieldnames) | ||
| writer.writeheader() | ||
| for row in data: | ||
| writer.writerow(row) | ||
|
|
||
| def export_json(file, data): | ||
| def serialize(obj): | ||
| if isinstance(obj, Decimal): | ||
| return str(obj) | ||
| raise TypeError("Object not JSON serializable") | ||
|
|
||
| json.dump(data, file, default=serialize, indent=2) | ||
|
|
||
| listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" | ||
| listing_soup = download(listing_url) | ||
|
|
||
| data = [] | ||
| for product in listing_soup.select(".product-item"): | ||
| item = parse_product(product, listing_url) | ||
| data.append(item) | ||
|
|
||
| with open("products.csv", "w") as file: | ||
| export_csv(file, data) | ||
|
|
||
| with open("products.json", "w") as file: | ||
| export_json(file, data) | ||
| ``` | ||
|
|
||
| ## Extracting vendor name | ||
|
|
||
| Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. | ||
|
|
||
|  | ||
|
|
||
| Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: | ||
|
|
||
| ```html | ||
| <div class="product-meta"> | ||
| <h1 class="product-meta__title heading h1"> | ||
| Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | ||
| </h1> | ||
| <div class="product-meta__label-list"> | ||
| ... | ||
| </div> | ||
| <div class="product-meta__reference"> | ||
| <!-- highlight-next-line --> | ||
| <a class="product-meta__vendor link link--accented" href="/collections/sony"> | ||
| <!-- highlight-next-line --> | ||
| Sony | ||
| <!-- highlight-next-line --> | ||
| </a> | ||
| <span class="product-meta__sku"> | ||
| SKU: | ||
| <span class="product-meta__sku-number">SON-985594-XBR-65</span> | ||
| </span> | ||
| </div> | ||
| <a href="#product-reviews" class="product-meta__reviews-badge link" data-offset="30"> | ||
| <div class="rating"> | ||
| <div class="rating__stars" role="img" aria-label="4.0 out of 5.0 stars"> | ||
| ... | ||
| </div> | ||
| <span class="rating__caption">3 reviews</span> | ||
| </div> | ||
| </a> | ||
| ... | ||
| </div> | ||
| ``` | ||
|
|
||
| It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string: | ||
|
|
||
| ```py | ||
| vendor = product_soup.select_one(".product-meta__vendor").text.strip() | ||
| ``` | ||
|
|
||
| But where do we put this line in our program? | ||
|
|
||
| ## Crawling product detail pages | ||
|
|
||
| In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary: | ||
|
|
||
| ```py | ||
| ... | ||
|
|
||
| listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" | ||
| listing_soup = download(listing_url) | ||
|
|
||
| data = [] | ||
| for product in listing_soup.select(".product-item"): | ||
| item = parse_product(product, listing_url) | ||
| # highlight-next-line | ||
| product_soup = download(item["url"]) | ||
| # highlight-next-line | ||
| item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip() | ||
| data.append(item) | ||
|
|
||
| ... | ||
| ``` | ||
honzajavorek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| If you run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name: | ||
|
|
||
| <!-- eslint-skip --> | ||
| ```json title=products.json | ||
| [ | ||
| { | ||
| "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", | ||
| "min_price": "74.95", | ||
| "price": "74.95", | ||
| "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", | ||
| "vendor": "JBL" | ||
| }, | ||
| { | ||
| "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", | ||
| "min_price": "1398.00", | ||
| "price": null, | ||
| "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", | ||
| "vendor": "Sony" | ||
| }, | ||
| ... | ||
| ] | ||
| ``` | ||
|
|
||
| ## Extracting price | ||
|
|
||
| Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices! | ||
|
|
||
| Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… | ||
|
|
||
|  | ||
|
|
||
| In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. | ||
|
|
||
| --- | ||
|
|
||
| <Exercises /> | ||
|
|
||
| ### Scrape calling codes of African countries | ||
|
|
||
| This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the calling code from the info table. Print the URL and the calling code for each country. Start with this URL: | ||
|
|
||
| ```text | ||
| https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa | ||
| ``` | ||
|
|
||
| Your program should print the following: | ||
|
|
||
| ```text | ||
| https://en.wikipedia.org/wiki/Algeria +213 | ||
| https://en.wikipedia.org/wiki/Angola +244 | ||
| https://en.wikipedia.org/wiki/Benin +229 | ||
| https://en.wikipedia.org/wiki/Botswana +267 | ||
| https://en.wikipedia.org/wiki/Burkina_Faso +226 | ||
| https://en.wikipedia.org/wiki/Burundi None | ||
| https://en.wikipedia.org/wiki/Cameroon +237 | ||
| ... | ||
| ``` | ||
|
|
||
| Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup. | ||
|
|
||
| <details> | ||
| <summary>Solution</summary> | ||
|
|
||
| ```py | ||
| import httpx | ||
| from bs4 import BeautifulSoup | ||
| from urllib.parse import urljoin | ||
|
|
||
| def download(url): | ||
| response = httpx.get(url) | ||
| response.raise_for_status() | ||
| return BeautifulSoup(response.text, "html.parser") | ||
|
|
||
| def parse_calling_code(soup): | ||
| for label in soup.select("th.infobox-label"): | ||
| if label.text.strip() == "Calling code": | ||
| data = label.parent.select_one("td.infobox-data") | ||
| return data.text.strip() | ||
| return None | ||
|
|
||
| listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" | ||
| listing_soup = download(listing_url) | ||
| for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"): | ||
| link = name_cell.select_one("a") | ||
| country_url = urljoin(listing_url, link["href"]) | ||
| country_soup = download(country_url) | ||
| calling_code = parse_calling_code(country_soup) | ||
| print(country_url, calling_code) | ||
| ``` | ||
|
|
||
| </details> | ||
|
|
||
| ### Scrape authors of F1 news articles | ||
|
|
||
| This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL: | ||
|
|
||
| ```text | ||
| https://www.theguardian.com/sport/formulaone | ||
| ``` | ||
|
|
||
| Your program should print something like this: | ||
|
|
||
| ```text | ||
| Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics | ||
| Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire? | ||
| Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1 | ||
| Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull | ||
| PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying | ||
| ... | ||
| ``` | ||
|
|
||
| Hints: | ||
|
|
||
| - You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values. | ||
| - Sometimes a person authors the article, but other times it's contributed by a news agency. | ||
|
|
||
| <details> | ||
| <summary>Solution</summary> | ||
|
|
||
| ```py | ||
| import httpx | ||
| from bs4 import BeautifulSoup | ||
| from urllib.parse import urljoin | ||
|
|
||
| def download(url): | ||
| response = httpx.get(url) | ||
| response.raise_for_status() | ||
| return BeautifulSoup(response.text, "html.parser") | ||
|
|
||
| def parse_author(article_soup): | ||
| link = article_soup.select_one('aside a[rel="author"]') | ||
| if link: | ||
| return link.text.strip() | ||
| address = article_soup.select_one('aside address') | ||
| if address: | ||
| return address.text.strip() | ||
| return None | ||
|
|
||
| listing_url = "https://www.theguardian.com/sport/formulaone" | ||
| listing_soup = download(listing_url) | ||
| for item in listing_soup.select("#maincontent ul li"): | ||
| link = item.select_one("a") | ||
| article_url = urljoin(listing_url, link["href"]) | ||
| article_soup = download(article_url) | ||
| title = article_soup.select_one("h1").text.strip() | ||
| author = parse_author(article_soup) | ||
| print(f"{author}: {title}") | ||
| ``` | ||
|
|
||
| ::: | ||
| </details> | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to do this because this is like... the very first exercise, and I don't want the reader to end up with an error. I don't want to direct them to a different course when they're touching the internet for the first time with their tiny program. So hoping for better results with Ali, which is a real world e-commerce website as well, but doesn't seem to give so much damn about scraping.