Skip to content

Commit 96b92a6

Browse files
committed
feat: rework the lesson flow in respect to how apify CLI works
1 parent ff22747 commit 96b92a6

File tree

2 files changed

+85
-42
lines changed

2 files changed

+85
-42
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,11 @@ Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ...
4444

4545
## Running Crawlee
4646

47-
Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way, we can keep peeking at the original implementation while working on the new one. The initial content will look like this:
47+
Now let's use the framework to create a new version of our scraper. Rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, create a new, empty `main.py`. The initial content will look like this:
4848

49-
```py title="newmain.py"
49+
```py
5050
import asyncio
51-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
51+
from crawlee.crawlers import BeautifulSoupCrawler
5252

5353
async def main():
5454
crawler = BeautifulSoupCrawler()
@@ -74,8 +74,8 @@ In the code, we do the following:
7474
Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:
7575

7676
```text
77-
$ python newmain.py
78-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
77+
$ python main.py
78+
[BeautifulSoupCrawler] INFO Current request statistics:
7979
┌───────────────────────────────┬──────────┐
8080
│ requests_finished │ 0 │
8181
│ requests_failed │ 0 │
@@ -91,7 +91,7 @@ $ python newmain.py
9191
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
9292
Sales
9393
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
94-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
94+
[BeautifulSoupCrawler] INFO Final request statistics:
9595
┌───────────────────────────────┬──────────┐
9696
│ requests_finished │ 1 │
9797
│ requests_failed │ 0 │
@@ -122,7 +122,7 @@ For example, it takes a single line of code to extract and follow links to produ
122122

123123
```py
124124
import asyncio
125-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
125+
from crawlee.crawlers import BeautifulSoupCrawler
126126

127127
async def main():
128128
crawler = BeautifulSoupCrawler()
@@ -152,8 +152,8 @@ Below that, we give the crawler another asynchronous function, `handle_detail()`
152152
If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way:
153153

154154
```text
155-
$ python newmain.py
156-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
155+
$ python main.py
156+
[BeautifulSoupCrawler] INFO Current request statistics:
157157
┌───────────────────────────────┬──────────┐
158158
...
159159
└───────────────────────────────┴──────────┘
@@ -164,7 +164,7 @@ https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-s
164164
https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
165165
...
166166
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
167-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
167+
[BeautifulSoupCrawler] INFO Final request statistics:
168168
┌───────────────────────────────┬──────────┐
169169
│ requests_finished │ 25 │
170170
│ requests_failed │ 0 │
@@ -232,7 +232,7 @@ Finally, the variants. We can reuse the `parse_variant()` function as-is, and in
232232
```py
233233
import asyncio
234234
from decimal import Decimal
235-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
235+
from crawlee.crawlers import BeautifulSoupCrawler
236236

237237
async def main():
238238
crawler = BeautifulSoupCrawler()
@@ -309,7 +309,7 @@ async def main():
309309
await context.push_data(item)
310310
```
311311

312-
That's it! If you run the program now, there should be a `storage` directory alongside the `newmain.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
312+
That's it! If you run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
313313

314314
![Single dataset item](images/dataset-item.png)
315315

@@ -335,7 +335,7 @@ Crawlee gives us stats about HTTP requests and concurrency, but we don't get muc
335335
```py
336336
import asyncio
337337
from decimal import Decimal
338-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
338+
from crawlee.crawlers import BeautifulSoupCrawler
339339

340340
async def main():
341341
crawler = BeautifulSoupCrawler()
@@ -398,7 +398,7 @@ if __name__ == '__main__':
398398

399399
Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are just [standard Python loggers](https://docs.python.org/3/library/logging.html).
400400

401-
Even with the extra logging we've added, we've managed to cut at least 20 lines of code compared to the original program. Throughout this lesson, we've been adding features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
401+
If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
402402

403403
In the next lesson, we'll use a scraping platform to set up our application to run automatically every day.
404404

@@ -454,7 +454,7 @@ Hints:
454454
import asyncio
455455
from datetime import datetime
456456

457-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
457+
from crawlee.crawlers import BeautifulSoupCrawler
458458

459459
async def main():
460460
crawler = BeautifulSoupCrawler()
@@ -554,7 +554,7 @@ When navigating to the first search result, you might find it helpful to know th
554554
from urllib.parse import quote_plus
555555

556556
from crawlee import Request
557-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
557+
from crawlee.crawlers import BeautifulSoupCrawler
558558

559559
async def main():
560560
crawler = BeautifulSoupCrawler()

sources/academy/webscraping/scraping_basics_python/13_platform.md

Lines changed: 69 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -25,66 +25,109 @@ In this lesson, we'll use a platform to address all of these issues. Generic clo
2525

2626
Scraping platforms come in many varieties, offering a wide range of tools and approaches. As the course authors, we're obviously a bit biased toward Apify—we think it's both powerful and complete.
2727

28-
That said, the main goal of this lesson is to show how deploying to **any platform** can make life easier—it's not Apify-specific. Plus, everything we cover here fits within [Apify's free tier](https://apify.com/pricing).
28+
That said, the main goal of this lesson is to show how deploying to **any platform** can make life easier. Plus, everything we cover here fits within [Apify's free tier](https://apify.com/pricing).
2929

3030
:::
3131

32-
## Packaging the project
32+
## Registering
33+
34+
First, let's [create a new Apify account](https://console.apify.com/sign-up). The process includes several verifications that you're a human being and that your e-mail address is valid. While annoying, these are necessary measures to prevent abuse of the platform.
3335

34-
Until now we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment. If we were to send our code to a friend, they wouldn't know what they need to install before they can run the scraper without import errors. The same applies if we were to send our code to a cloud platform.
36+
Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But we'll overcome our curiosity for now and leave exploring the Apify Store for later.
3537

36-
In the root of the project, let's create a file called `requirements.txt`, with a single line consisting of a single word:
38+
## Getting access from the command line
39+
40+
To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download).
3741

38-
```text title="requirements.txt"
39-
crawlee
42+
After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version:
43+
44+
```text
45+
$ apify --version
46+
apify-cli/0.0.0 system-arch00 node-v0.0.0
4047
```
4148

42-
Each line in the file represents a single dependency, but so far our program has just one. With `requirements.txt` in place, Apify can run `pip install -r requirements.txt` to download and install all dependencies of the project before starting our program.
49+
Now let's connect the CLI with the cloud platform using our account from previous step:
4350

44-
:::tip Packaging projects
51+
```text
52+
$ apify login
53+
...
54+
Success: You are logged in to Apify as user1234!
55+
```
4556

46-
The [requirements file](https://pip.pypa.io/en/latest/user_guide/#requirements-files) is an obsolete approach to packaging a Python project, but it still works and it's the simplest, which is convenient for the purposes of this lesson.
57+
## Creating a package
4758

48-
For any serious work the best and future-proof approach to packaging is to create the [`pyproject.toml`](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) configuration file. We recommend the official [Python Packaging User Guide](https://packaging.python.org/) for more info.
59+
Until now we kept our scrapers minimal, each being represented by just a single Python module, such as `main.py`. Also we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment.
4960

50-
:::
61+
If we were to send our code to a friend like this, they wouldn't know what they need to install before they can run the scraper without import errors. The same applies if we were to send our code to a cloud platform.
5162

52-
## Registering
63+
To be able to share what we've built we need a packaged Python project. The best way to do that is to follow the official [Python Packaging User Guide](https://packaging.python.org/), but for the sake of this course let's take a shortcut with the Apify CLI.
5364

54-
As a second step, let's [create a new Apify account](https://console.apify.com/sign-up). The process includes several verifications that you're a human being and that your e-mail address is valid. While annoying, these are necessary measures to prevent abuse of the platform.
65+
Change directory in your terminal to a place where you start new projects. Then run the following command-it will create a new subdirectory called `warehouse-watchdog` for the new project, which will contain all the necessary files:
5566

56-
Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. We'll overcome our curiosity for now and leave exploring the Apify Store for later.
67+
```text
68+
$ apify create warehouse-watchdog --template=python-crawlee-beautifulsoup
69+
Info: Python version 0.0.0 detected.
70+
Info: Creating a virtual environment in ...
71+
...
72+
Success: Actor 'warehouse-watchdog' was created. To run it, run "cd warehouse-watchdog" and "apify run".
73+
Info: To run your code in the cloud, run "apify push" and deploy your code to Apify Console.
74+
Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory.
75+
```
5776

58-
## Getting access from the command line
77+
A new `warehouse-watchdog` subdirectory should appear. Inside, we should see a `src` directory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template. Let's edit the file and overwrite the contents with our scraper (full code is provided in the previous lesson as the [last code example](./12_framework.md#logging)).
5978

60-
To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download).
79+
## Deploying to platform
6180

62-
After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version:
81+
As a first step, we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
6382

6483
```text
65-
$ apify --version
66-
apify-cli/0.0.0 system-arch00 node-v0.0.0
84+
$ apify run
85+
Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
86+
[BeautifulSoupCrawler] INFO Current request statistics:
87+
┌───────────────────────────────┬──────────┐
88+
│ requests_finished │ 0 │
89+
│ requests_failed │ 0 │
90+
│ retry_histogram │ [0] │
91+
│ request_avg_failed_duration │ None │
92+
│ request_avg_finished_duration │ None │
93+
│ requests_finished_per_minute │ 0 │
94+
│ requests_failed_per_minute │ 0 │
95+
│ request_total_duration │ 0.0 │
96+
│ requests_total │ 0 │
97+
│ crawler_runtime │ 0.016736 │
98+
└───────────────────────────────┴──────────┘
99+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
100+
[BeautifulSoupCrawler] INFO Looking for product detail pages
101+
[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
102+
[BeautifulSoupCrawler] INFO Saving a product variant
103+
[BeautifulSoupCrawler] INFO Saving a product variant
104+
...
67105
```
68106

69-
Now let's connect the CLI with the platform using our account:
107+
If the scraper run ends without errors, we can proceed to deploying:
70108

71109
```text
72-
$ apify login
110+
$ apify push
111+
Info: Created Actor with name warehouse-watchdog on Apify.
112+
Info: Deploying Actor 'warehouse-watchdog' to Apify.
113+
Run: Updated version 0.0 for Actor warehouse-watchdog.
114+
Run: Building Actor warehouse-watchdog
73115
...
74-
Success: You are logged in to Apify as user1234!
116+
Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.0.1
117+
? Do you want to open the Actor detail in your browser? (Y/n)
75118
```
76119

120+
Let's agree to opening the Actor detail in browser. There we'll find a button to **Start Actor**.
121+
77122
<!--
123+
TODO we'll need to remove INPUT from the config
124+
78125
it seems apify init won't recognize the project only with requirements.txt
79126
https://crawlee.dev/python/docs/introduction/deployment
80127
https://packaging.python.org/en/latest/tutorials/installing-packages/
81128
https://docs.apify.com/sdk/python/docs/overview/introduction
82129
-->
83130

84-
## Creating an Actor
85-
86-
...
87-
88131
## What's next
89132

90133
---

0 commit comments

Comments
 (0)