feat: complete the basic stub of the lesson

honzajavorek · honzajavorek · commit f779270a956d · 2025-03-13T16:37:57.000+01:00
diff --git a/sources/academy/webscraping/scraping_basics_python/13_platform.md b/sources/academy/webscraping/scraping_basics_python/13_platform.md
@@ -54,7 +54,7 @@ $ apify login
 Success: You are logged in to Apify as user1234!
 ```
 
-## Creating a package
+## Starting a real-world project
 
 Until now we kept our scrapers minimal, each being represented by just a single Python module, such as `main.py`. Also we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment.
 
@@ -74,15 +74,52 @@ Info: To run your code in the cloud, run "apify push" and deploy your code to Ap
 Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory.
 ```
 
-A new `warehouse-watchdog` subdirectory should appear. Inside, we should see a `src` directory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template. Let's edit the file and overwrite the contents with our scraper (full code is provided in the previous lesson as the [last code example](./12_framework.md#logging)).
+## Adjusting the template
 
-## Deploying to platform
+Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template.
+
+The file contains a single asynchronous function `main()`. In the beginning it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then it passes that input to a small crawler built on top of the Crawlee framework.
+
+Each program which should run on the Apify platform needs to be first packaged as a so-called Actor, a standardized box with places for input and output. Crawlee scrapers automatically connect to the Actor output, but the input needs to be explicitly handled in code.
+
+We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file `crawler.py` inside the `warehouse-watchdog/src` directory, and we'll fill this new file with the [final code](./12_framework.md#logging) from the previous lesson:
+
+```py title=warehouse-watchdog/src/crawler.py
+import asyncio
+from decimal import Decimal
+from crawlee.crawlers import BeautifulSoupCrawler
+
+async def main():
+    crawler = BeautifulSoupCrawler()
+
+    @crawler.router.default_handler
+    async def handle_listing(context):
+        context.log.info("Looking for product detail pages")
+        await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
+
+    ...
+```
+
+Now let's change the contents of `warehouse-watchdog/src/main.py` to this:
+
+```py title=warehouse-watchdog/src/main.py
+from apify import Actor
+from .crawler import main as crawl
+
+async def main():
+    async with Actor:
+        await crawl()
+```
+
+We imported our program as a function and we await result of that function inside the Actor block. Unlike the sample scraper, our program doesn't expect any input data, so we could delete the code handling that part.
 
-As a first step, we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
+Now we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
 
 ```text
 $ apify run
 Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
+[apify] INFO  Initializing Actor...
+[apify] INFO  System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
 [BeautifulSoupCrawler] INFO  Current request statistics:
 ┌───────────────────────────────┬──────────┐
 │ requests_finished             │ 0        │
@@ -104,7 +141,56 @@ Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
 ...
 ```
 
-If the scraper run ends without errors, we can proceed to deploying:
+## Deploying to platform
+
+The Actor configuration coming from the template instructs the platform to expect input, so we better change that before we attempt to run our scraper in the cloud.
+
+Inside `warehouse-watchdog` there is a directory called `.actor`. Inside, we'll edit the `input_schema.json` file. It comes out of the template like this:
+
+```json title=warehouse-watchdog/src/.actor/input_schema.json
+{
+    "title": "Python Crawlee BeautifulSoup Scraper",
+    "type": "object",
+    "schemaVersion": 1,
+    "properties": {
+        "start_urls": {
+            "title": "Start URLs",
+            "type": "array",
+            "description": "URLs to start with",
+            "prefill": [
+                { "url": "https://apify.com" }
+            ],
+            "editor": "requestListSources"
+        }
+    },
+    "required": ["start_urls"]
+}
+```
+
+:::tip Hidden dot files
+
+Beware that on some systems `.actor` might be by default hidden in the directory listing, because it starts with a dot. Try to locate the file using your editor's built-in file explorer.
+
+:::
+
+We'll empty the expected properties and remove the list of required ones. After our changes, the file should look like this:
+
+```json title=warehouse-watchdog/src/.actor/input_schema.json
+{
+    "title": "Python Crawlee BeautifulSoup Scraper",
+    "type": "object",
+    "schemaVersion": 1,
+    "properties": {}
+}
+```
+
+:::danger Trailing commas in JSON
+
+Make sure there is no trailing comma after `{}`, otherwise the file won't be valid JSON.
+
+:::
+
+Now we can proceed to deploying:
 
 ```text
 $ apify push
@@ -117,9 +203,172 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
 ? Do you want to open the Actor detail in your browser? (Y/n)
 ```
 
-Let's agree to opening the Actor detail in browser. There we'll find a button to **Start Actor**.
+After agreeing to opening the Actor detail in our browser, assuming we're logged in we'll see a button to **Start Actor**. Hitting it brings us to a screen for specifying Actor input and run options. Without chaning anything we'll continue by hitting **Start** and immediately we should be presented with logs of the scraper run, similar to what we'd normally see in our terminal, but this time a remote copy of our program runs on a cloud platform.
+
+When the run finishes, the interface should turn green. On the **Output** tab we're able to preview results of the scraper as a table or as JSON, and there's even a button to export the data to many different formats, including CSV, XML, Excel, RSS, and more.
+
+:::note Accessing data programmatically
+
+You don't need to click buttons to download the data. The same can be achieved by [Apify's API](https://docs.apify.com/api/v2/dataset-items-get), the [`apify datasets`](https://docs.apify.com/cli/docs/reference#datasets) CLI command, or by the [`apify`](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync) Python SDK.
+
+:::
+
+## Running the scraper periodically
+
+Let's say we want our scraper to collect data about sale prices daily. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Hitting **Create new** will open a setup screen where we can specify periodicity (daily is the default) and identify Actors which should be started. When we're done, we can hit **Enable**. That's it!
+
+From now on, the Actor will execute daily. We'll be able to inspect every single run. For each run we'll have access to its logs, and the data collected. We'll see stats, monitoring charts, and we'll be able to setup alerts sending us a notification under specified conditions.
+
+## Adding support for proxies
+
+If monitoring of our scraper starts showing that it regularly fails to reach the Warehouse shop website, we're most likely getting blocked. In such case we can use proxies so that our requests come from different locations in the world and our scraping is less apparent in the volume of the website's standard traffic.
+
+Proxy configuration is a type of Actor input, so let's start with re-introducing the code taking care of that. We'll change `warehouse-watchdog/src/main.py` to this:
+
+```py title=warehouse-watchdog/src/main.py
+from apify import Actor
+from .crawler import main as crawl
+
+async def main():
+    async with Actor:
+        input_data = await Actor.get_input()
+
+        if actor_proxy_input := input_data.get("proxyConfig"):
+            proxy_config = await Actor.create_proxy_configuration(actor_proxy_input=actor_proxy_input)
+        else:
+            proxy_config = None
+
+        await crawl(proxy_config)
+```
+
+Now we'll add `proxy_config` as an optional parameter to the scraper in `warehouse-watchdog/src/crawler.py`. Thanks to built-in integration between Apify and Crawlee, it's enough to pass it down to `BeautifulSoupCrawler()` and the class will do the rest:
+
+
+```py title=warehouse-watchdog/src/crawler.py
+import asyncio
+from decimal import Decimal
+from crawlee.crawlers import BeautifulSoupCrawler
+
+# highlight-next-line
+async def main(proxy_config = None):
+    # highlight-next-line
+    crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config)
+    # highlight-next-line
+    crawler.log.info(f"Using proxy: {'yes' if proxy_config else 'no'}")
+
+    @crawler.router.default_handler
+    async def handle_listing(context):
+        context.log.info("Looking for product detail pages")
+        await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
+
+    ...
+```
+
+Last but not least we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` so that it includes the `proxyConfig` input parameter:
+
+```json title=warehouse-watchdog/src/.actor/input_schema.json
+{
+    "title": "Python Crawlee BeautifulSoup Scraper",
+    "type": "object",
+    "schemaVersion": 1,
+    "properties": {
+        "proxyConfig": {
+            "title": "Proxy config",
+            "description": "Proxy configuration",
+            "type": "object",
+            "editor": "proxy",
+            "prefill": {
+                "useApifyProxy": true,
+                "apifyProxyGroups": []
+            },
+            "default": {
+                "useApifyProxy": true,
+                "apifyProxyGroups": []
+            }
+        }
+    }
+}
+```
+
+Now if we run the scraper locally, it should all work without error. We'll use the `apify run` command again, but with the `--purge` option, which makes sure we're not re-using anything from the previous run.
+
+```text
+$ apify run --purge
+Info: All default local stores were purged.
+Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
+[apify] INFO  Initializing Actor...
+[apify] INFO  System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
+[BeautifulSoupCrawler] INFO  Using proxy: no
+[BeautifulSoupCrawler] INFO  Current request statistics:
+┌───────────────────────────────┬──────────┐
+│ requests_finished             │ 0        │
+│ requests_failed               │ 0        │
+│ retry_histogram               │ [0]      │
+│ request_avg_failed_duration   │ None     │
+│ request_avg_finished_duration │ None     │
+│ requests_finished_per_minute  │ 0        │
+│ requests_failed_per_minute    │ 0        │
+│ request_total_duration        │ 0.0      │
+│ requests_total                │ 0        │
+│ crawler_runtime               │ 0.014976 │
+└───────────────────────────────┴──────────┘
+[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
+[BeautifulSoupCrawler] INFO  Looking for product detail pages
+[BeautifulSoupCrawler] INFO  Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
+[BeautifulSoupCrawler] INFO  Saving a product variant
+[BeautifulSoupCrawler] INFO  Saving a product variant
+...
+```
+
+In the logs we can notice a line `Using proxy: no`. When running the scraper locally, the Actor input doesn't contain any proxy configuration. The requests will be all made from our location, the same way as previously. Now let's update our cloud copy of the scraper with `apify push` so that it's based on our latest edits to the code:
+
+```text
+$ apify push
+Info: Deploying Actor 'warehouse-watchdog' to Apify.
+Run: Updated version 0.0 for Actor warehouse-watchdog.
+Run: Building Actor warehouse-watchdog
+(timestamp) ACTOR: Found input schema referenced from .actor/actor.json
+...
+? Do you want to open the Actor detail in your browser? (Y/n)
+```
+
+After opening the Actor detail in our browser, we should see the **Source** screen. We'll switch to the **Input** tab of that screen. There we can now see the **Proxy config** input option. By default it's set to **Datacenter - Automatic**, and we'll leave it like that. Let's hit the **Start** button! In the logs we should see the following:
+
+```text
+(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.
+(timestamp) ACTOR: Creating Docker container.
+(timestamp) ACTOR: Starting Docker container.
+(timestamp) [apify] INFO  Initializing Actor...
+(timestamp) [apify] INFO  System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
+(timestamp) [BeautifulSoupCrawler] INFO  Using proxy: yes
+(timestamp) [BeautifulSoupCrawler] INFO  Current request statistics:
+(timestamp) ┌───────────────────────────────┬──────────┐
+(timestamp) │ requests_finished             │ 0        │
+(timestamp) │ requests_failed               │ 0        │
+(timestamp) │ retry_histogram               │ [0]      │
+(timestamp) │ request_avg_failed_duration   │ None     │
+(timestamp) │ request_avg_finished_duration │ None     │
+(timestamp) │ requests_finished_per_minute  │ 0        │
+(timestamp) │ requests_failed_per_minute    │ 0        │
+(timestamp) │ request_total_duration        │ 0.0      │
+(timestamp) │ requests_total                │ 0        │
+(timestamp) │ crawler_runtime               │ 0.036449 │
+(timestamp) └───────────────────────────────┴──────────┘
+(timestamp) [crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
+(timestamp) [crawlee.storages._request_queue] INFO  The queue still contains requests locked by another client
+(timestamp) [BeautifulSoupCrawler] INFO  Looking for product detail pages
+(timestamp) [BeautifulSoupCrawler] INFO  Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
+(timestamp) [BeautifulSoupCrawler] INFO  Saving a product variant
+...
+```
+
+The logs contain `Using proxy: yes`, which confirms that the scraper now uses proxies provided by the Apify platform.
+
+## Congratulations!
+
+You reached the end of the course, congratulations! Together we've built a program which crawls a shop, extracts data about products and their prices and exports the data. We managed to simplify our work with a framework and deployed our scraper to a scraping platform so that it can periodically run unassisted and accumulate data over time. The platform also helps us with monitoring or anti-scraping.
 
-## What's next
+We hope that this will serve as a good start for your next adventure in the world of scraping. Perhaps creating scrapers which you publish for a fee so that others can re-use them? 😉
 
 ---