diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md index 4928fec652..e664de2af4 100644 --- a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md +++ b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md @@ -140,12 +140,12 @@ Letting our program visibly crash on error is enough for our purposes. Now, let' -### Scrape Amazon +### Scrape AliExpress -Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results: +Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results: ```text -https://www.amazon.com/s?k=darth+vader +https://www.aliexpress.com/w/wholesale-darth-vader.html ```
@@ -154,13 +154,12 @@ https://www.amazon.com/s?k=darth+vader ```py import httpx - url = "https://www.amazon.com/s?k=darth+vader" + url = "https://www.aliexpress.com/w/wholesale-darth-vader.html" response = httpx.get(url) response.raise_for_status() print(response.text) ``` - If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
### Save downloaded HTML as a file diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md index 3475fa7f2d..78fbda4ab6 100644 --- a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md +++ b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md @@ -122,6 +122,14 @@ for product in soup.select(".product-item"): This program does the same as the one we already had, but its code is more concise. +:::note Fragile code + +We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this. + +Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it. + +::: + ## Precisely locating price In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this: diff --git a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md index 69c9bcc487..32e671ad8a 100644 --- a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md +++ b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md @@ -199,8 +199,12 @@ def export_json(file, data): json.dump(data, file, default=serialize, indent=2) listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" -soup = download(listing_url) -data = [parse_product(product) for product in soup.select(".product-item")] +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product) + data.append(item) with open("products.csv", "w") as file: export_csv(file, data) @@ -209,7 +213,7 @@ with open("products.json", "w") as file: export_json(file, data) ``` -The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). +The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code. :::tip Refactoring @@ -300,9 +304,13 @@ Now we'll pass the base URL to the function in the main body of our program: ```py listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" -soup = download(listing_url) -# highlight-next-line -data = [parse_product(product, listing_url) for product in soup.select(".product-item")] +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + # highlight-next-line + item = parse_product(product, listing_url) + data.append(item) ``` When we run the scraper now, we should see full URLs in our exports: diff --git a/sources/academy/webscraping/scraping_basics_python/10_crawling.md b/sources/academy/webscraping/scraping_basics_python/10_crawling.md index 39b5083e87..4de67279f0 100644 --- a/sources/academy/webscraping/scraping_basics_python/10_crawling.md +++ b/sources/academy/webscraping/scraping_basics_python/10_crawling.md @@ -1,15 +1,305 @@ --- title: Crawling websites with Python sidebar_label: Crawling websites -description: TODO +description: Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages. sidebar_position: 10 slug: /scraping-basics-python/crawling --- -:::danger Work in progress +import Exercises from './_exercises.mdx'; -This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem. +**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.** -This particular page is a placeholder for several lessons which should teach crawling. +--- + +In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products. + +Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json +from urllib.parse import urljoin + +def download(url): + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + return BeautifulSoup(html_code, "html.parser") + +def parse_product(product, base_url): + title_element = product.select_one(".product-item__title") + title = title_element.text.strip() + url = urljoin(base_url, title_element["href"]) + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + return {"title": title, "min_price": min_price, "price": price, "url": url} + +def export_csv(file, data): + fieldnames = list(data[0].keys()) + writer = csv.DictWriter(file, fieldnames=fieldnames) + writer.writeheader() + for row in data: + writer.writerow(row) + +def export_json(file, data): + def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + + json.dump(data, file, default=serialize, indent=2) + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + data.append(item) + +with open("products.csv", "w") as file: + export_csv(file, data) + +with open("products.json", "w") as file: + export_json(file, data) +``` + +## Extracting vendor name + +Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. + +![Product detail page](./images/pdp.png) + +Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: + +```html +
+

+ Sony XBR-950G BRAVIA 4K HDR Ultra HD TV +

+
+ ... +
+
+ + + + Sony + + + + SKU: + SON-985594-XBR-65 + +
+ +
+ + 3 reviews +
+
+ ... +
+``` + +It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string: + +```py +vendor = product_soup.select_one(".product-meta__vendor").text.strip() +``` + +But where do we put this line in our program? + +## Crawling product detail pages + +In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary: + +```py +... + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + # highlight-next-line + product_soup = download(item["url"]) + # highlight-next-line + item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip() + data.append(item) + +... +``` + +If you run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name: + + +```json title=products.json +[ + { + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", + "min_price": "74.95", + "price": "74.95", + "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", + "vendor": "JBL" + }, + { + "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", + "min_price": "1398.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", + "vendor": "Sony" + }, + ... +] +``` + +## Extracting price + +Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices! + +Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… + +![Morpheus revealing the existence of product variants](images/variants.png) + +In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. + +--- + + + +### Scrape calling codes of African countries + +This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the calling code from the info table. Print the URL and the calling code for each country. Start with this URL: + +```text +https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa +``` + +Your program should print the following: + +```text +https://en.wikipedia.org/wiki/Algeria +213 +https://en.wikipedia.org/wiki/Angola +244 +https://en.wikipedia.org/wiki/Benin +229 +https://en.wikipedia.org/wiki/Botswana +267 +https://en.wikipedia.org/wiki/Burkina_Faso +226 +https://en.wikipedia.org/wiki/Burundi None +https://en.wikipedia.org/wiki/Cameroon +237 +... +``` + +Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + def download(url): + response = httpx.get(url) + response.raise_for_status() + return BeautifulSoup(response.text, "html.parser") + + def parse_calling_code(soup): + for label in soup.select("th.infobox-label"): + if label.text.strip() == "Calling code": + data = label.parent.select_one("td.infobox-data") + return data.text.strip() + return None + + listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" + listing_soup = download(listing_url) + for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"): + link = name_cell.select_one("a") + country_url = urljoin(listing_url, link["href"]) + country_soup = download(country_url) + calling_code = parse_calling_code(country_soup) + print(country_url, calling_code) + ``` + +
+ +### Scrape authors of F1 news articles + +This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL: + +```text +https://www.theguardian.com/sport/formulaone +``` + +Your program should print something like this: + +```text +Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics +Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire? +Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1 +Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull +PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying +... +``` + +Hints: + +- You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values. +- Sometimes a person authors the article, but other times it's contributed by a news agency. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + def download(url): + response = httpx.get(url) + response.raise_for_status() + return BeautifulSoup(response.text, "html.parser") + + def parse_author(article_soup): + link = article_soup.select_one('aside a[rel="author"]') + if link: + return link.text.strip() + address = article_soup.select_one('aside address') + if address: + return address.text.strip() + return None + + listing_url = "https://www.theguardian.com/sport/formulaone" + listing_soup = download(listing_url) + for item in listing_soup.select("#maincontent ul li"): + link = item.select_one("a") + article_url = urljoin(listing_url, link["href"]) + article_soup = download(article_url) + title = article_soup.select_one("h1").text.strip() + author = parse_author(article_soup) + print(f"{author}: {title}") + ``` -::: +
diff --git a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md new file mode 100644 index 0000000000..677414b310 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md @@ -0,0 +1,420 @@ +--- +title: Scraping product variants with Python +sidebar_label: Scraping product variants +description: Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to extract product variants and exporting them as separate items. +sidebar_position: 11 +slug: /scraping-basics-python/scraping-variants +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** + +--- + +We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL. + +## Locating variants + +First, let's extract information about the variants. If we go to [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv) and open the DevTools, we can see that the buttons for switching between variants look like this: + +```html +
+
+ + +
+
+ + +
+
+``` + +Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display information about the variants. + +![Switching variants](images/variants-js.gif) + +If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible. + +After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data! + +```html +
+ +
+ +
+
+``` + +These elements aren't visible to regular visitors. They're there just in case JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight. + +## Extracting variants + +Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need: + +```py +... + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + product_soup = download(item["url"]) + vendor = product_soup.select_one(".product-meta__vendor").text.strip() + + if variants := product_soup.select(".product-form__option.no-js option"): + for variant in variants: + data.append(item | {"variant_name": variant.text.strip()}) + else: + item["variant_name"] = None + data.append(item) + +... +``` + +The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper. + +Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty. + +:::tip Python syntax you might not know + +Since Python 3.8, you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called an _assignment expression_ or _walrus operator_. You can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/). + +Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https://docs.python.org/3/library/stdtypes.html#dict) aren't clear enough, check out the [proposal document](https://peps.python.org/pep-0584/) for more details. + +::: + +If you run the program, you should see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page. + + +```json title=products.json +[ + ... + { + "variant_name": null, + "title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit", + "min_price": "324.00", + "price": "324.00", + "url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1", + "vendor": "Klipsch" + }, + ... +] +``` + +Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items. + + +```json title=products.json +[ + ... + { + "variant_name": "Red - $178.00", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + { + "variant_name": "Black - $178.00", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + ... +] +``` + +Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like _$74.95_, instead of _from $74.95_. + + +```json title=products.json +[ + ... + { + "variant_name": "Red - $74.95", + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", + "min_price": "74.95", + "price": "74.95", + "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", + "vendor": "JBL" + }, + ... +] +``` + +## Parsing price + +The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` key. Let's introduce a new function to handle that: + +```py +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} +``` + +First, we split the text into two parts, then we parse the price as a decimal number. This part is similar to what we already do for parsing product listing prices. The function returns a dictionary we can merge with `item`. + +## Saving price + +Now, if we use our new function, we should finally get a program that can scrape exact prices for all products, even if they have variants. The whole code should look like this now: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json +from urllib.parse import urljoin + +def download(url): + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + return BeautifulSoup(html_code, "html.parser") + +def parse_product(product, base_url): + title_element = product.select_one(".product-item__title") + title = title_element.text.strip() + url = urljoin(base_url, title_element["href"]) + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + return {"title": title, "min_price": min_price, "price": price, "url": url} + +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} + +def export_csv(file, data): + fieldnames = list(data[0].keys()) + writer = csv.DictWriter(file, fieldnames=fieldnames) + writer.writeheader() + for row in data: + writer.writerow(row) + +def export_json(file, data): + def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + + json.dump(data, file, default=serialize, indent=2) + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + product_soup = download(item["url"]) + vendor = product_soup.select_one(".product-meta__vendor").text.strip() + + if variants := product_soup.select(".product-form__option.no-js option"): + for variant in variants: + # highlight-next-line + data.append(item | parse_variant(variant)) + else: + item["variant_name"] = None + data.append(item) + +with open("products.csv", "w") as file: + export_csv(file, data) + +with open("products.json", "w") as file: + export_json(file, data) +``` + +Run the scraper and see for yourself if all the items in the data contain prices: + + +```json title=products.json +[ + ... + { + "variant_name": "Red", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": "178.00", + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + { + "variant_name": "Black", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": "178.00", + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + ... +] +``` + +Success! We managed to build a Python application for watching prices! + +Is this the end? Maybe! In the next lesson, we'll use a scraping framework to build the same application, but with less code, faster requests, and better visibility into what's happening while we wait for the program to finish. + +--- + + + +### Build a scraper for watching Python jobs + +You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria: + +- Tagged as "Database" +- Posted within the last 60 days + +For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data: + +- Job title +- Company +- URL to the job posting +- Date of posting + +Your output should look something like this: + +```text +{'title': 'Senior Full Stack Developer', + 'company': 'Baserow', + 'url': 'https://www.python.org/jobs/7705/', + 'posted_on': datetime.date(2024, 9, 16)} +{'title': 'Senior Python Engineer', + 'company': 'Active Prime', + 'url': 'https://www.python.org/jobs/7699/', + 'posted_on': datetime.date(2024, 9, 5)} +... +``` + +You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and `timedelta()`. + +
+ Solution + + After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually. + + ```py + from pprint import pp + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + from datetime import datetime, date, timedelta + + today = date.today() + jobs_url = "https://www.python.org/jobs/type/database/" + response = httpx.get(jobs_url) + response.raise_for_status() + soup = BeautifulSoup(response.text, "html.parser") + + for job in soup.select(".list-recent-jobs li"): + link = job.select_one(".listing-company-name a") + + time = job.select_one(".listing-posted time") + posted_at = datetime.fromisoformat(time["datetime"]) + posted_on = posted_at.date() + posted_ago = today - posted_on + + if posted_ago <= timedelta(days=60): + title = link.text.strip() + company = list(job.select_one(".listing-company-name").stripped_strings)[-1] + url = urljoin(jobs_url, link["href"]) + pp({"title": title, "company": company, "url": url, "posted_on": posted_on}) + ``` + +
+ +### Find the shortest CNN article which made it to the Sports homepage + +Scrape the [CNN Sports](https://edition.cnn.com/sport) homepage. For each linked article, calculate its length in characters: + +- Locate the element that holds the main content of the article. +- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to extract all the content as plain text. +- Use `len()` to calculate the character count. + +Skip pages without text (like those that only have a video). Sort the results and print the URL of the shortest article that made it to the homepage. + +At the time of writing, the shortest article on the CNN Sports homepage is [about a donation to the Augusta National Golf Club](https://edition.cnn.com/2024/10/03/sport/masters-donation-hurricane-helene-relief-spt-intl/), which is just 1,642 characters long. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + def download(url): + response = httpx.get(url) + response.raise_for_status() + return BeautifulSoup(response.text, "html.parser") + + listing_url = "https://edition.cnn.com/sport" + listing_soup = download(listing_url) + + data = [] + for card in listing_soup.select(".layout__main .card"): + link = card.select_one(".container__link") + article_url = urljoin(listing_url, link["href"]) + article_soup = download(article_url) + if content := article_soup.select_one(".article__content"): + length = len(content.get_text()) + data.append((length, article_url)) + + data.sort() + shortest_item = data[0] + item_url = shortest_item[1] + print(item_url) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md new file mode 100644 index 0000000000..4845c025b9 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_python/12_framework.md @@ -0,0 +1,30 @@ +--- +title: Using a scraping framework with Python +sidebar_label: Using a framework +description: Lesson about building a Python application for watching prices. Using the Crawlee framework to simplify creating a scraper. +sidebar_position: 12 +slug: /scraping-basics-python/framework +--- + +:::danger Work in progress + +This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem. + +::: + + diff --git a/sources/academy/webscraping/scraping_basics_python/13_platform.md b/sources/academy/webscraping/scraping_basics_python/13_platform.md new file mode 100644 index 0000000000..28d66989ac --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_python/13_platform.md @@ -0,0 +1,13 @@ +--- +title: Using a scraping platform with Python +sidebar_label: Using a platform +description: Lesson about building a Python application for watching prices. Using the Apify platform to deploy a scraper. +sidebar_position: 13 +slug: /scraping-basics-python/platform +--- + +:::danger Work in progress + +This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem. + +::: diff --git a/sources/academy/webscraping/scraping_basics_python/images/pdp.png b/sources/academy/webscraping/scraping_basics_python/images/pdp.png new file mode 100644 index 0000000000..8f4825b0b1 Binary files /dev/null and b/sources/academy/webscraping/scraping_basics_python/images/pdp.png differ diff --git a/sources/academy/webscraping/scraping_basics_python/images/variants-js.gif b/sources/academy/webscraping/scraping_basics_python/images/variants-js.gif new file mode 100644 index 0000000000..f1f982afa5 Binary files /dev/null and b/sources/academy/webscraping/scraping_basics_python/images/variants-js.gif differ diff --git a/sources/academy/webscraping/scraping_basics_python/images/variants.png b/sources/academy/webscraping/scraping_basics_python/images/variants.png new file mode 100644 index 0000000000..17dcdd9d98 Binary files /dev/null and b/sources/academy/webscraping/scraping_basics_python/images/variants.png differ