Skip to content

Commit f779270

Browse files
committed
feat: complete the basic stub of the lesson
1 parent 09b3176 commit f779270

File tree

1 file changed

+256
-7
lines changed

1 file changed

+256
-7
lines changed

sources/academy/webscraping/scraping_basics_python/13_platform.md

Lines changed: 256 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ $ apify login
5454
Success: You are logged in to Apify as user1234!
5555
```
5656

57-
## Creating a package
57+
## Starting a real-world project
5858

5959
Until now we kept our scrapers minimal, each being represented by just a single Python module, such as `main.py`. Also we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment.
6060

@@ -74,15 +74,52 @@ Info: To run your code in the cloud, run "apify push" and deploy your code to Ap
7474
Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory.
7575
```
7676

77-
A new `warehouse-watchdog` subdirectory should appear. Inside, we should see a `src` directory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template. Let's edit the file and overwrite the contents with our scraper (full code is provided in the previous lesson as the [last code example](./12_framework.md#logging)).
77+
## Adjusting the template
7878

79-
## Deploying to platform
79+
Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template.
80+
81+
The file contains a single asynchronous function `main()`. In the beginning it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then it passes that input to a small crawler built on top of the Crawlee framework.
82+
83+
Each program which should run on the Apify platform needs to be first packaged as a so-called Actor, a standardized box with places for input and output. Crawlee scrapers automatically connect to the Actor output, but the input needs to be explicitly handled in code.
84+
85+
We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file `crawler.py` inside the `warehouse-watchdog/src` directory, and we'll fill this new file with the [final code](./12_framework.md#logging) from the previous lesson:
86+
87+
```py title=warehouse-watchdog/src/crawler.py
88+
import asyncio
89+
from decimal import Decimal
90+
from crawlee.crawlers import BeautifulSoupCrawler
91+
92+
async def main():
93+
crawler = BeautifulSoupCrawler()
94+
95+
@crawler.router.default_handler
96+
async def handle_listing(context):
97+
context.log.info("Looking for product detail pages")
98+
await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
99+
100+
...
101+
```
102+
103+
Now let's change the contents of `warehouse-watchdog/src/main.py` to this:
104+
105+
```py title=warehouse-watchdog/src/main.py
106+
from apify import Actor
107+
from .crawler import main as crawl
108+
109+
async def main():
110+
async with Actor:
111+
await crawl()
112+
```
113+
114+
We imported our program as a function and we await result of that function inside the Actor block. Unlike the sample scraper, our program doesn't expect any input data, so we could delete the code handling that part.
80115

81-
As a first step, we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
116+
Now we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
82117

83118
```text
84119
$ apify run
85120
Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
121+
[apify] INFO Initializing Actor...
122+
[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
86123
[BeautifulSoupCrawler] INFO Current request statistics:
87124
┌───────────────────────────────┬──────────┐
88125
│ requests_finished │ 0 │
@@ -104,7 +141,56 @@ Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
104141
...
105142
```
106143

107-
If the scraper run ends without errors, we can proceed to deploying:
144+
## Deploying to platform
145+
146+
The Actor configuration coming from the template instructs the platform to expect input, so we better change that before we attempt to run our scraper in the cloud.
147+
148+
Inside `warehouse-watchdog` there is a directory called `.actor`. Inside, we'll edit the `input_schema.json` file. It comes out of the template like this:
149+
150+
```json title=warehouse-watchdog/src/.actor/input_schema.json
151+
{
152+
"title": "Python Crawlee BeautifulSoup Scraper",
153+
"type": "object",
154+
"schemaVersion": 1,
155+
"properties": {
156+
"start_urls": {
157+
"title": "Start URLs",
158+
"type": "array",
159+
"description": "URLs to start with",
160+
"prefill": [
161+
{ "url": "https://apify.com" }
162+
],
163+
"editor": "requestListSources"
164+
}
165+
},
166+
"required": ["start_urls"]
167+
}
168+
```
169+
170+
:::tip Hidden dot files
171+
172+
Beware that on some systems `.actor` might be by default hidden in the directory listing, because it starts with a dot. Try to locate the file using your editor's built-in file explorer.
173+
174+
:::
175+
176+
We'll empty the expected properties and remove the list of required ones. After our changes, the file should look like this:
177+
178+
```json title=warehouse-watchdog/src/.actor/input_schema.json
179+
{
180+
"title": "Python Crawlee BeautifulSoup Scraper",
181+
"type": "object",
182+
"schemaVersion": 1,
183+
"properties": {}
184+
}
185+
```
186+
187+
:::danger Trailing commas in JSON
188+
189+
Make sure there is no trailing comma after `{}`, otherwise the file won't be valid JSON.
190+
191+
:::
192+
193+
Now we can proceed to deploying:
108194

109195
```text
110196
$ apify push
@@ -117,9 +203,172 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
117203
? Do you want to open the Actor detail in your browser? (Y/n)
118204
```
119205

120-
Let's agree to opening the Actor detail in browser. There we'll find a button to **Start Actor**.
206+
After agreeing to opening the Actor detail in our browser, assuming we're logged in we'll see a button to **Start Actor**. Hitting it brings us to a screen for specifying Actor input and run options. Without chaning anything we'll continue by hitting **Start** and immediately we should be presented with logs of the scraper run, similar to what we'd normally see in our terminal, but this time a remote copy of our program runs on a cloud platform.
207+
208+
When the run finishes, the interface should turn green. On the **Output** tab we're able to preview results of the scraper as a table or as JSON, and there's even a button to export the data to many different formats, including CSV, XML, Excel, RSS, and more.
209+
210+
:::note Accessing data programmatically
211+
212+
You don't need to click buttons to download the data. The same can be achieved by [Apify's API](https://docs.apify.com/api/v2/dataset-items-get), the [`apify datasets`](https://docs.apify.com/cli/docs/reference#datasets) CLI command, or by the [`apify`](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync) Python SDK.
213+
214+
:::
215+
216+
## Running the scraper periodically
217+
218+
Let's say we want our scraper to collect data about sale prices daily. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Hitting **Create new** will open a setup screen where we can specify periodicity (daily is the default) and identify Actors which should be started. When we're done, we can hit **Enable**. That's it!
219+
220+
From now on, the Actor will execute daily. We'll be able to inspect every single run. For each run we'll have access to its logs, and the data collected. We'll see stats, monitoring charts, and we'll be able to setup alerts sending us a notification under specified conditions.
221+
222+
## Adding support for proxies
223+
224+
If monitoring of our scraper starts showing that it regularly fails to reach the Warehouse shop website, we're most likely getting blocked. In such case we can use proxies so that our requests come from different locations in the world and our scraping is less apparent in the volume of the website's standard traffic.
225+
226+
Proxy configuration is a type of Actor input, so let's start with re-introducing the code taking care of that. We'll change `warehouse-watchdog/src/main.py` to this:
227+
228+
```py title=warehouse-watchdog/src/main.py
229+
from apify import Actor
230+
from .crawler import main as crawl
231+
232+
async def main():
233+
async with Actor:
234+
input_data = await Actor.get_input()
235+
236+
if actor_proxy_input := input_data.get("proxyConfig"):
237+
proxy_config = await Actor.create_proxy_configuration(actor_proxy_input=actor_proxy_input)
238+
else:
239+
proxy_config = None
240+
241+
await crawl(proxy_config)
242+
```
243+
244+
Now we'll add `proxy_config` as an optional parameter to the scraper in `warehouse-watchdog/src/crawler.py`. Thanks to built-in integration between Apify and Crawlee, it's enough to pass it down to `BeautifulSoupCrawler()` and the class will do the rest:
245+
246+
247+
```py title=warehouse-watchdog/src/crawler.py
248+
import asyncio
249+
from decimal import Decimal
250+
from crawlee.crawlers import BeautifulSoupCrawler
251+
252+
# highlight-next-line
253+
async def main(proxy_config = None):
254+
# highlight-next-line
255+
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config)
256+
# highlight-next-line
257+
crawler.log.info(f"Using proxy: {'yes' if proxy_config else 'no'}")
258+
259+
@crawler.router.default_handler
260+
async def handle_listing(context):
261+
context.log.info("Looking for product detail pages")
262+
await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
263+
264+
...
265+
```
266+
267+
Last but not least we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` so that it includes the `proxyConfig` input parameter:
268+
269+
```json title=warehouse-watchdog/src/.actor/input_schema.json
270+
{
271+
"title": "Python Crawlee BeautifulSoup Scraper",
272+
"type": "object",
273+
"schemaVersion": 1,
274+
"properties": {
275+
"proxyConfig": {
276+
"title": "Proxy config",
277+
"description": "Proxy configuration",
278+
"type": "object",
279+
"editor": "proxy",
280+
"prefill": {
281+
"useApifyProxy": true,
282+
"apifyProxyGroups": []
283+
},
284+
"default": {
285+
"useApifyProxy": true,
286+
"apifyProxyGroups": []
287+
}
288+
}
289+
}
290+
}
291+
```
292+
293+
Now if we run the scraper locally, it should all work without error. We'll use the `apify run` command again, but with the `--purge` option, which makes sure we're not re-using anything from the previous run.
294+
295+
```text
296+
$ apify run --purge
297+
Info: All default local stores were purged.
298+
Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
299+
[apify] INFO Initializing Actor...
300+
[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
301+
[BeautifulSoupCrawler] INFO Using proxy: no
302+
[BeautifulSoupCrawler] INFO Current request statistics:
303+
┌───────────────────────────────┬──────────┐
304+
│ requests_finished │ 0 │
305+
│ requests_failed │ 0 │
306+
│ retry_histogram │ [0] │
307+
│ request_avg_failed_duration │ None │
308+
│ request_avg_finished_duration │ None │
309+
│ requests_finished_per_minute │ 0 │
310+
│ requests_failed_per_minute │ 0 │
311+
│ request_total_duration │ 0.0 │
312+
│ requests_total │ 0 │
313+
│ crawler_runtime │ 0.014976 │
314+
└───────────────────────────────┴──────────┘
315+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
316+
[BeautifulSoupCrawler] INFO Looking for product detail pages
317+
[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
318+
[BeautifulSoupCrawler] INFO Saving a product variant
319+
[BeautifulSoupCrawler] INFO Saving a product variant
320+
...
321+
```
322+
323+
In the logs we can notice a line `Using proxy: no`. When running the scraper locally, the Actor input doesn't contain any proxy configuration. The requests will be all made from our location, the same way as previously. Now let's update our cloud copy of the scraper with `apify push` so that it's based on our latest edits to the code:
324+
325+
```text
326+
$ apify push
327+
Info: Deploying Actor 'warehouse-watchdog' to Apify.
328+
Run: Updated version 0.0 for Actor warehouse-watchdog.
329+
Run: Building Actor warehouse-watchdog
330+
(timestamp) ACTOR: Found input schema referenced from .actor/actor.json
331+
...
332+
? Do you want to open the Actor detail in your browser? (Y/n)
333+
```
334+
335+
After opening the Actor detail in our browser, we should see the **Source** screen. We'll switch to the **Input** tab of that screen. There we can now see the **Proxy config** input option. By default it's set to **Datacenter - Automatic**, and we'll leave it like that. Let's hit the **Start** button! In the logs we should see the following:
336+
337+
```text
338+
(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.
339+
(timestamp) ACTOR: Creating Docker container.
340+
(timestamp) ACTOR: Starting Docker container.
341+
(timestamp) [apify] INFO Initializing Actor...
342+
(timestamp) [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
343+
(timestamp) [BeautifulSoupCrawler] INFO Using proxy: yes
344+
(timestamp) [BeautifulSoupCrawler] INFO Current request statistics:
345+
(timestamp) ┌───────────────────────────────┬──────────┐
346+
(timestamp) │ requests_finished │ 0 │
347+
(timestamp) │ requests_failed │ 0 │
348+
(timestamp) │ retry_histogram │ [0] │
349+
(timestamp) │ request_avg_failed_duration │ None │
350+
(timestamp) │ request_avg_finished_duration │ None │
351+
(timestamp) │ requests_finished_per_minute │ 0 │
352+
(timestamp) │ requests_failed_per_minute │ 0 │
353+
(timestamp) │ request_total_duration │ 0.0 │
354+
(timestamp) │ requests_total │ 0 │
355+
(timestamp) │ crawler_runtime │ 0.036449 │
356+
(timestamp) └───────────────────────────────┴──────────┘
357+
(timestamp) [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
358+
(timestamp) [crawlee.storages._request_queue] INFO The queue still contains requests locked by another client
359+
(timestamp) [BeautifulSoupCrawler] INFO Looking for product detail pages
360+
(timestamp) [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
361+
(timestamp) [BeautifulSoupCrawler] INFO Saving a product variant
362+
...
363+
```
364+
365+
The logs contain `Using proxy: yes`, which confirms that the scraper now uses proxies provided by the Apify platform.
366+
367+
## Congratulations!
368+
369+
You reached the end of the course, congratulations! Together we've built a program which crawls a shop, extracts data about products and their prices and exports the data. We managed to simplify our work with a framework and deployed our scraper to a scraping platform so that it can periodically run unassisted and accumulate data over time. The platform also helps us with monitoring or anti-scraping.
121370

122-
## What's next
371+
We hope that this will serve as a good start for your next adventure in the world of scraping. Perhaps creating scrapers which you publish for a fee so that others can re-use them? 😉
123372

124373
---
125374

0 commit comments

Comments
 (0)