You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/13_platform.md
+256-7Lines changed: 256 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ $ apify login
54
54
Success: You are logged in to Apify as user1234!
55
55
```
56
56
57
-
## Creating a package
57
+
## Starting a real-world project
58
58
59
59
Until now we kept our scrapers minimal, each being represented by just a single Python module, such as `main.py`. Also we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment.
60
60
@@ -74,15 +74,52 @@ Info: To run your code in the cloud, run "apify push" and deploy your code to Ap
74
74
Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory.
75
75
```
76
76
77
-
A new `warehouse-watchdog` subdirectory should appear. Inside, we should see a `src` directory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template. Let's edit the file and overwrite the contents with our scraper (full code is provided in the previous lesson as the [last code example](./12_framework.md#logging)).
77
+
## Adjusting the template
78
78
79
-
## Deploying to platform
79
+
Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template.
80
+
81
+
The file contains a single asynchronous function `main()`. In the beginning it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then it passes that input to a small crawler built on top of the Crawlee framework.
82
+
83
+
Each program which should run on the Apify platform needs to be first packaged as a so-called Actor, a standardized box with places for input and output. Crawlee scrapers automatically connect to the Actor output, but the input needs to be explicitly handled in code.
84
+
85
+
We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file `crawler.py` inside the `warehouse-watchdog/src` directory, and we'll fill this new file with the [final code](./12_framework.md#logging) from the previous lesson:
86
+
87
+
```py title=warehouse-watchdog/src/crawler.py
88
+
import asyncio
89
+
from decimal import Decimal
90
+
from crawlee.crawlers import BeautifulSoupCrawler
91
+
92
+
asyncdefmain():
93
+
crawler = BeautifulSoupCrawler()
94
+
95
+
@crawler.router.default_handler
96
+
asyncdefhandle_listing(context):
97
+
context.log.info("Looking for product detail pages")
Now let's change the contents of `warehouse-watchdog/src/main.py` to this:
104
+
105
+
```py title=warehouse-watchdog/src/main.py
106
+
from apify import Actor
107
+
from .crawler import main as crawl
108
+
109
+
asyncdefmain():
110
+
asyncwith Actor:
111
+
await crawl()
112
+
```
113
+
114
+
We imported our program as a function and we await result of that function inside the Actor block. Unlike the sample scraper, our program doesn't expect any input data, so we could delete the code handling that part.
80
115
81
-
As a first step, we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
116
+
Now we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
86
123
[BeautifulSoupCrawler] INFO Current request statistics:
If the scraper run ends without errors, we can proceed to deploying:
144
+
## Deploying to platform
145
+
146
+
The Actor configuration coming from the template instructs the platform to expect input, so we better change that before we attempt to run our scraper in the cloud.
147
+
148
+
Inside `warehouse-watchdog` there is a directory called `.actor`. Inside, we'll edit the `input_schema.json` file. It comes out of the template like this:
Beware that on some systems `.actor` might be by default hidden in the directory listing, because it starts with a dot. Try to locate the file using your editor's built-in file explorer.
173
+
174
+
:::
175
+
176
+
We'll empty the expected properties and remove the list of required ones. After our changes, the file should look like this:
Make sure there is no trailing comma after `{}`, otherwise the file won't be valid JSON.
190
+
191
+
:::
192
+
193
+
Now we can proceed to deploying:
108
194
109
195
```text
110
196
$ apify push
@@ -117,9 +203,172 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
117
203
? Do you want to open the Actor detail in your browser? (Y/n)
118
204
```
119
205
120
-
Let's agree to opening the Actor detail in browser. There we'll find a button to **Start Actor**.
206
+
After agreeing to opening the Actor detail in our browser, assuming we're logged in we'll see a button to **Start Actor**. Hitting it brings us to a screen for specifying Actor input and run options. Without chaning anything we'll continue by hitting **Start** and immediately we should be presented with logs of the scraper run, similar to what we'd normally see in our terminal, but this time a remote copy of our program runs on a cloud platform.
207
+
208
+
When the run finishes, the interface should turn green. On the **Output** tab we're able to preview results of the scraper as a table or as JSON, and there's even a button to export the data to many different formats, including CSV, XML, Excel, RSS, and more.
209
+
210
+
:::note Accessing data programmatically
211
+
212
+
You don't need to click buttons to download the data. The same can be achieved by [Apify's API](https://docs.apify.com/api/v2/dataset-items-get), the [`apify datasets`](https://docs.apify.com/cli/docs/reference#datasets) CLI command, or by the [`apify`](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync) Python SDK.
213
+
214
+
:::
215
+
216
+
## Running the scraper periodically
217
+
218
+
Let's say we want our scraper to collect data about sale prices daily. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Hitting **Create new** will open a setup screen where we can specify periodicity (daily is the default) and identify Actors which should be started. When we're done, we can hit **Enable**. That's it!
219
+
220
+
From now on, the Actor will execute daily. We'll be able to inspect every single run. For each run we'll have access to its logs, and the data collected. We'll see stats, monitoring charts, and we'll be able to setup alerts sending us a notification under specified conditions.
221
+
222
+
## Adding support for proxies
223
+
224
+
If monitoring of our scraper starts showing that it regularly fails to reach the Warehouse shop website, we're most likely getting blocked. In such case we can use proxies so that our requests come from different locations in the world and our scraping is less apparent in the volume of the website's standard traffic.
225
+
226
+
Proxy configuration is a type of Actor input, so let's start with re-introducing the code taking care of that. We'll change `warehouse-watchdog/src/main.py` to this:
227
+
228
+
```py title=warehouse-watchdog/src/main.py
229
+
from apify import Actor
230
+
from .crawler import main as crawl
231
+
232
+
asyncdefmain():
233
+
asyncwith Actor:
234
+
input_data =await Actor.get_input()
235
+
236
+
if actor_proxy_input := input_data.get("proxyConfig"):
Now we'll add `proxy_config` as an optional parameter to the scraper in `warehouse-watchdog/src/crawler.py`. Thanks to built-in integration between Apify and Crawlee, it's enough to pass it down to `BeautifulSoupCrawler()` and the class will do the rest:
Last but not least we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` so that it includes the `proxyConfig` input parameter:
Now if we run the scraper locally, it should all work without error. We'll use the `apify run` command again, but with the `--purge` option, which makes sure we're not re-using anything from the previous run.
[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
301
+
[BeautifulSoupCrawler] INFO Using proxy: no
302
+
[BeautifulSoupCrawler] INFO Current request statistics:
303
+
┌───────────────────────────────┬──────────┐
304
+
│ requests_finished │ 0 │
305
+
│ requests_failed │ 0 │
306
+
│ retry_histogram │ [0] │
307
+
│ request_avg_failed_duration │ None │
308
+
│ request_avg_finished_duration │ None │
309
+
│ requests_finished_per_minute │ 0 │
310
+
│ requests_failed_per_minute │ 0 │
311
+
│ request_total_duration │ 0.0 │
312
+
│ requests_total │ 0 │
313
+
│ crawler_runtime │ 0.014976 │
314
+
└───────────────────────────────┴──────────┘
315
+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
316
+
[BeautifulSoupCrawler] INFO Looking for product detail pages
317
+
[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
318
+
[BeautifulSoupCrawler] INFO Saving a product variant
319
+
[BeautifulSoupCrawler] INFO Saving a product variant
320
+
...
321
+
```
322
+
323
+
In the logs we can notice a line `Using proxy: no`. When running the scraper locally, the Actor input doesn't contain any proxy configuration. The requests will be all made from our location, the same way as previously. Now let's update our cloud copy of the scraper with `apify push` so that it's based on our latest edits to the code:
324
+
325
+
```text
326
+
$ apify push
327
+
Info: Deploying Actor 'warehouse-watchdog' to Apify.
328
+
Run: Updated version 0.0 for Actor warehouse-watchdog.
329
+
Run: Building Actor warehouse-watchdog
330
+
(timestamp) ACTOR: Found input schema referenced from .actor/actor.json
331
+
...
332
+
? Do you want to open the Actor detail in your browser? (Y/n)
333
+
```
334
+
335
+
After opening the Actor detail in our browser, we should see the **Source** screen. We'll switch to the **Input** tab of that screen. There we can now see the **Proxy config** input option. By default it's set to **Datacenter - Automatic**, and we'll leave it like that. Let's hit the **Start** button! In the logs we should see the following:
336
+
337
+
```text
338
+
(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.
339
+
(timestamp) ACTOR: Creating Docker container.
340
+
(timestamp) ACTOR: Starting Docker container.
341
+
(timestamp) [apify] INFO Initializing Actor...
342
+
(timestamp) [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"})
343
+
(timestamp) [BeautifulSoupCrawler] INFO Using proxy: yes
344
+
(timestamp) [BeautifulSoupCrawler] INFO Current request statistics:
(timestamp) [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
358
+
(timestamp) [crawlee.storages._request_queue] INFO The queue still contains requests locked by another client
359
+
(timestamp) [BeautifulSoupCrawler] INFO Looking for product detail pages
360
+
(timestamp) [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
361
+
(timestamp) [BeautifulSoupCrawler] INFO Saving a product variant
362
+
...
363
+
```
364
+
365
+
The logs contain `Using proxy: yes`, which confirms that the scraper now uses proxies provided by the Apify platform.
366
+
367
+
## Congratulations!
368
+
369
+
You reached the end of the course, congratulations! Together we've built a program which crawls a shop, extracts data about products and their prices and exports the data. We managed to simplify our work with a framework and deployed our scraper to a scraping platform so that it can periodically run unassisted and accumulate data over time. The platform also helps us with monitoring or anti-scraping.
121
370
122
-
## What's next
371
+
We hope that this will serve as a good start for your next adventure in the world of scraping. Perhaps creating scrapers which you publish for a fee so that others can re-use them? 😉
0 commit comments