|
1 | 1 | --- |
2 | 2 | title: Crawling websites with Python |
3 | 3 | sidebar_label: Crawling websites |
4 | | -description: TODO |
| 4 | +description: Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages. |
5 | 5 | sidebar_position: 10 |
6 | 6 | slug: /scraping-basics-python/crawling |
7 | 7 | --- |
8 | 8 |
|
| 9 | +import Exercises from './_exercises.mdx'; |
| 10 | + |
| 11 | +**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them, and BeautifulSoup to process them.** |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +In previous lessons we've managed to download HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products. |
| 16 | + |
| 17 | +Thanks to the refactoring we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now: |
| 18 | + |
| 19 | +```python |
| 20 | +import httpx |
| 21 | +from bs4 import BeautifulSoup |
| 22 | +from decimal import Decimal |
| 23 | +import csv |
| 24 | +import json |
| 25 | +from urllib.parse import urljoin |
| 26 | + |
| 27 | +def download(url): |
| 28 | + response = httpx.get(url) |
| 29 | + response.raise_for_status() |
| 30 | + |
| 31 | + html_code = response.text |
| 32 | + return BeautifulSoup(html_code, "html.parser") |
| 33 | + |
| 34 | +def parse_product(product, base_url): |
| 35 | + title_element = product.select_one(".product-item__title") |
| 36 | + title = title_element.text.strip() |
| 37 | + url = urljoin(base_url, title_element["href"]) |
| 38 | + |
| 39 | + price_text = ( |
| 40 | + product |
| 41 | + .select_one(".price") |
| 42 | + .contents[-1] |
| 43 | + .strip() |
| 44 | + .replace("$", "") |
| 45 | + .replace(",", "") |
| 46 | + ) |
| 47 | + if price_text.startswith("From "): |
| 48 | + min_price = Decimal(price_text.removeprefix("From ")) |
| 49 | + price = None |
| 50 | + else: |
| 51 | + min_price = Decimal(price_text) |
| 52 | + price = min_price |
| 53 | + |
| 54 | + return {"title": title, "min_price": min_price, "price": price, "url": url} |
| 55 | + |
| 56 | +def export_csv(file, data): |
| 57 | + fieldnames = list(data[0].keys()) |
| 58 | + writer = csv.DictWriter(file, fieldnames=fieldnames) |
| 59 | + writer.writeheader() |
| 60 | + for row in data: |
| 61 | + writer.writerow(row) |
| 62 | + |
| 63 | +def export_json(file, data): |
| 64 | + def serialize(obj): |
| 65 | + if isinstance(obj, Decimal): |
| 66 | + return str(obj) |
| 67 | + raise TypeError("Object not JSON serializable") |
| 68 | + |
| 69 | + json.dump(data, file, default=serialize, indent=2) |
| 70 | + |
| 71 | +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" |
| 72 | +listing_soup = download(listing_url) |
| 73 | +data = [ |
| 74 | + parse_product(product, listing_url) |
| 75 | + for product in listing_soup.select(".product-item") |
| 76 | +] |
| 77 | + |
| 78 | +with open("products.csv", "w") as file: |
| 79 | + export_csv(file, data) |
| 80 | + |
| 81 | +with open("products.json", "w") as file: |
| 82 | + export_json(file, data) |
| 83 | +``` |
| 84 | + |
| 85 | +## Crawling product URLs |
| 86 | + |
| 87 | +In a new loop below the list comprehension we'll go through the product URLs, download and parse each of them, and extract some new data, e.g. name of the vendor. Then we'll save the data to the `product` dictionary as a new key. |
| 88 | + |
| 89 | +```python |
| 90 | +... |
| 91 | + |
| 92 | +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" |
| 93 | +listing_soup = download(listing_url) |
| 94 | +data = [ |
| 95 | + parse_product(product, listing_url) |
| 96 | + for product in listing_soup.select(".product-item") |
| 97 | +] |
| 98 | + |
| 99 | +# highlight-next-line |
| 100 | +for product in data: |
| 101 | + # highlight-next-line |
| 102 | + product_soup = download(product["url"]) |
| 103 | + # highlight-next-line |
| 104 | + product["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip() |
| 105 | + |
| 106 | +... |
| 107 | +``` |
| 108 | + |
| 109 | +If you run the program now, it will take longer to finish, but should produce exports with a new field containing the vendor: |
| 110 | + |
| 111 | +<!-- eslint-skip --> |
| 112 | +```json title=products.json |
| 113 | +[ |
| 114 | + { |
| 115 | + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", |
| 116 | + "min_price": "74.95", |
| 117 | + "price": "74.95", |
| 118 | + "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", |
| 119 | + "vendor": "JBL" |
| 120 | + }, |
| 121 | + { |
| 122 | + "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", |
| 123 | + "min_price": "1398.00", |
| 124 | + "price": null, |
| 125 | + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", |
| 126 | + "vendor": "Sony" |
| 127 | + }, |
| 128 | + ... |
| 129 | +] |
| 130 | +``` |
| 131 | + |
| 132 | +<!-- |
| 133 | +- show image of how we figured out the vendor or have a note about devtools |
| 134 | +
|
| 135 | +caveats: |
| 136 | +- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links |
| 137 | +- scrape price for the variants |
| 138 | +
|
| 139 | +caveats and reasons for framework: |
| 140 | +- it's slow |
| 141 | +- logging |
| 142 | +- a lot of boilerplate code |
| 143 | +- anti-scraping protection |
| 144 | +- browser crawling support |
| 145 | +--> |
| 146 | + |
| 147 | + |
9 | 148 | :::danger Work in progress |
10 | 149 |
|
11 | 150 | This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem. |
|
0 commit comments