Skip to content

Commit f5dbee8

Browse files
TC-MOvladfrangu
andauthored
chore: docs for v1 (#891)
Co-authored-by: Vlad Frangu <[email protected]>
1 parent 22a24c1 commit f5dbee8

File tree

10 files changed

+1706
-0
lines changed

10 files changed

+1706
-0
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Overview
3+
---
4+
5+
Apify command-line interface (Apify CLI) helps you create, develop, build and run
6+
[Apify Actors](https://apify.com/actors),
7+
and manage the Apify cloud platform from any computer.
8+
9+
Apify Actors are cloud programs that can perform arbitrary web scraping, automation or data processing job.
10+
They accept input, perform their job and generate output.
11+
While you can develop Actors in an online IDE directly in the [Apify web application](https://console.apify.com/),
12+
for complex projects it is more convenient to develop Actors locally on your computer
13+
using <a href="https://github.com/apify/apify-sdk-js">Apify SDK</a>
14+
and only push the Actors to the Apify cloud during deployment.
15+
This is where the Apify CLI comes in.
16+
17+
:::note Run Actors in Docker
18+
19+
Actors running on the Apify platform are executed in Docker containers, so with an appropriate `Dockerfile`
20+
you can build your Actors in any programming language.
21+
However, we recommend using JavaScript/Node.js and Python, for which we provide most libraries and support.
22+
23+
:::
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: Installation
3+
description: Learn how to install Apify CLI using installation scripts, Homebrew, or NPM.
4+
---
5+
6+
Learn how to install Apify CLI using installation scripts, Homebrew, or NPM.
7+
8+
---
9+
10+
The recommended way to install Apify CLI is by using our installation scripts. This means you don't need to install Node.js to use the CLI, which is useful for Python users or anyone who doesn't want to manage Node.js dependencies.
11+
12+
## Preferred methods
13+
14+
### MacOS / Unix
15+
16+
```bash
17+
curl -fsSL https://apify.com/install-cli.sh | bash
18+
```
19+
20+
### Windows
21+
22+
```powershell
23+
irm https://apify.com/install-cli.ps1 | iex
24+
```
25+
26+
## Other methods
27+
28+
### Homebrew
29+
30+
```bash
31+
brew install apify-cli
32+
```
33+
34+
### NPM
35+
36+
First, make sure you have [Node.js](https://nodejs.org) version 22 or higher with NPM installed on your computer:
37+
38+
```bash showLineNumbers
39+
node --version
40+
npm --version
41+
```
42+
43+
Install or upgrade Apify CLI by running:
44+
45+
```bash
46+
npm install -g apify-cli
47+
```
48+
49+
:::tip Troubleshooting
50+
51+
If you receive a permission error, read npm's [official guide](https://docs.npmjs.com/resolving-eacces-permissions-errors-when-installing-packages-globally) on installing packages globally.
52+
53+
:::
54+
55+
## Verify installation
56+
57+
You can verify the installation process by running the following command:
58+
59+
```bash
60+
apify --version
61+
```
62+
63+
The output should resemble the following (exact details like version or platform may vary):
64+
65+
```bash
66+
apify-cli/1.0.1 (0dfcfd8) running on darwin-arm64 with bun-1.2.19 (emulating node 24.3.0), installed via bundle
67+
```
68+
69+
## Upgrading
70+
71+
Upgrading Apify CLI is as simple as running the following command:
72+
73+
```bash
74+
apify upgrade
75+
```
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
---
2+
title: Integrating Scrapy projects
3+
description: Learn how to run Scrapy projects as Apify Actors and deploy them on the Apify platform.
4+
sidebar_label: Integrating Scrapy projects
5+
---
6+
7+
[Scrapy](https://scrapy.org/) is a widely used open-source web scraping framework for Python. Scrapy projects can now be executed on the Apify platform using our dedicated wrapping tool. This tool allows users to transform their Scrapy projects into [Apify Actors](https://docs.apify.com/platform/actors) with just a few simple commands.
8+
9+
## Prerequisites
10+
11+
Before you begin, make sure you have the Apify CLI installed on your system. If you haven't installed it yet, follow the [installation guide](./installation.md).
12+
13+
## Actorization of your existing Scrapy spider
14+
15+
Assuming your Scrapy project is set up, navigate to the project root where the `scrapy.cfg` file is located.
16+
17+
```bash
18+
cd your_scraper
19+
```
20+
21+
Verify the directory contents to ensure the correct location.
22+
23+
```bash showLineNumbers
24+
$ ls -R
25+
.:
26+
your_scraper README.md requirements.txt scrapy.cfg
27+
28+
./your_scraper:
29+
__init__.py items.py __main__.py main.py pipelines.py settings.py spiders
30+
31+
./your_scraper/spiders:
32+
your_spider.py __init__.py
33+
```
34+
35+
To convert your Scrapy project into an Apify Actor, initiate the wrapping process by executing the following command:
36+
37+
```bash
38+
apify init
39+
```
40+
41+
The script will prompt you with a series of questions. Upon completion, the output might resemble the following:
42+
43+
```bash showLineNumbers
44+
Info: The current directory looks like a Scrapy project. Using automatic project wrapping.
45+
? Enter the Scrapy BOT_NAME (see settings.py): books_scraper
46+
? What folder are the Scrapy spider modules stored in? (see SPIDER_MODULES in settings.py): books_scraper.spiders
47+
? Pick the Scrapy spider you want to wrap: BookSpider (/home/path/to/actor-scrapy-books-example/books_scraper/spiders/book.py)
48+
Info: Downloading the latest Scrapy wrapper template...
49+
Info: Wrapping the Scrapy project...
50+
Success: The Scrapy project has been wrapped successfully.
51+
```
52+
53+
For example, here is a [source code](https://github.com/apify/actor-scrapy-books-example) of an actorized Scrapy project, and [here](https://apify.com/vdusek/scrapy-books-example) the corresponding Actor in Apify Store.
54+
55+
### Run the Actor locally
56+
57+
Create a Python virtual environment by running:
58+
59+
```bash
60+
python -m virtualenv .venv
61+
```
62+
63+
Activate the virtual environment:
64+
65+
```bash
66+
source .venv/bin/activate
67+
```
68+
69+
Install Python dependencies using the provided requirements file named `requirements_apify.txt`. Ensure these requirements are installed before executing your project as an Apify Actor locally. You can put your own dependencies there as well.
70+
71+
```bash
72+
pip install -r requirements-apify.txt [-r requirements.txt]
73+
```
74+
75+
Finally execute the Apify Actor.
76+
77+
```bash
78+
apify run [--purge]
79+
```
80+
81+
If [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py) is configured, the Actor's output will be stored in the `storage/datasets/default/` directory.
82+
83+
### Run the scraper as Scrapy project
84+
85+
The project remains executable as a Scrapy project.
86+
87+
```bash
88+
scrapy crawl your_spider -o books.json
89+
```
90+
91+
## Deploy on Apify
92+
93+
### Log in to Apify
94+
95+
You will need to provide your [Apify API Token](https://console.apify.com/settings/integrations) to complete this action.
96+
97+
```bash
98+
apify login
99+
```
100+
101+
### Deploy your Actor
102+
103+
This command will deploy and build the Actor on the Apify platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my).
104+
105+
```bash
106+
apify push
107+
```
108+
109+
## What the wrapping process does
110+
111+
The initialization command enhances your project by adding necessary files and updating some of them while preserving its functionality as a typical Scrapy project. The additional requirements file, named `requirements_apify.txt`, includes the Apify Python SDK and other essential requirements. The `.actor/` directory contains basic configuration of your Actor. We provide two new Python files [main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) and [\_\_main\_\_.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/__main__.py), where we encapsulate the Scrapy project within an Actor. We also import and use there a few Scrapy components from our [Python SDK](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy). These components facilitate the integration of the Scrapy projects with the Apify platform. Further details about these components are provided in the following subsections.
112+
113+
### Scheduler
114+
115+
The [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) is a core component of Scrapy responsible for receiving and providing requests to be processed. To leverage the [Apify request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests, a custom scheduler becomes necessary. Fortunately, Scrapy is a modular framework, allowing the creation of custom components. As a result, we have implemented the [ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/scheduler.py). When using the Apify CLI wrapping tool, the scheduler is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor.
116+
117+
### Dataset push pipeline
118+
119+
[Item pipelines](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) are used for the processing of the results produced by your spiders. To handle the transmission of result data to the [Apify dataset](https://docs.apify.com/platform/storage/dataset), we have implemented the [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py). When using the Apify CLI wrapping tool, the pipeline is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. It is assigned the highest integer value (1000), ensuring its execution as the final step in the pipeline sequence.
120+
121+
### Retry middleware
122+
123+
[Downloader middlewares](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) are a way how to hook into Scrapy's request/response processing. Scrapy comes with various default middlewares, including the [RetryMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry), designed to handle retries for requests that may have failed due to temporary issues. When integrating with the [Apify request queue](https://docs.apify.com/platform/storage/request-queue), it becomes necessary to enhance this middleware to facilitate communication with the request queue marking the requests either as handled or ready for a retry. When using the Apify CLI wrapping tool, the default `RetryMiddleware` is disabled, and [ApifyRetryMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_retry.py) takes its place. Configuration for the middlewares is established in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor.
124+
125+
### HTTP proxy middleware
126+
127+
Another default Scrapy [downloader middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that requires replacement is [HttpProxyMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy). To utilize the use of proxies managed through the Apify [ProxyConfiguration](https://github.com/apify/apify-sdk-python/blob/master/src/apify/proxy_configuration.py), we provide [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py). When using the Apify CLI wrapping tool, the default `HttpProxyMiddleware` is disabled, and [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py) takes its place. Additionally, inspect the [.actor/input_schema.json](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/.actor/input_schema.json) file, where proxy configuration is specified as an input property for your Actor. The processing of this input is carried out together with the middleware configuration in [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py).
128+
129+
## Known limitations
130+
131+
There are some known limitations of running the Scrapy projects on Apify platform we are aware of.
132+
133+
### Asynchronous code in spiders and other components
134+
135+
Scrapy asynchronous execution is based on the [Twisted](https://twisted.org/) library, not the
136+
[AsyncIO](https://docs.python.org/3/library/asyncio.html), which brings some complications on the table.
137+
138+
Due to the asynchronous nature of the Actors, all of their code is executed as a coroutine inside the `asyncio.run`.
139+
In order to execute Scrapy code inside an Actor, following the section
140+
[Run Scrapy from a script](https://docs.scrapy.org/en/latest/topics/practices.html?highlight=CrawlerProcess#run-scrapy-from-a-script)
141+
from the official Scrapy documentation, we need to invoke a
142+
[`CrawlProcess.start`](https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/crawler.py#L393:L427)
143+
method. This method triggers Twisted's event loop, also known as a reactor.
144+
Consequently, Twisted's event loop is executed within AsyncIO's event loop.
145+
On top of that, when employing AsyncIO code in spiders or other components, it necessitates the creation of a new
146+
AsyncIO event loop, within which the coroutines from these components are executed. This means there is
147+
an execution of the AsyncIO event loop inside the Twisted event loop inside the AsyncIO event loop.
148+
149+
We have resolved this issue by leveraging the [nest-asyncio](https://pypi.org/project/nest-asyncio/) library,
150+
enabling the execution of nested AsyncIO event loops. For executing a coroutine within a spider or other component,
151+
it is recommended to use Apify's instance of the nested event loop. Refer to the code example below or derive
152+
inspiration from Apify's Scrapy components, such as the
153+
[ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/v1.5.0/src/apify/scrapy/scheduler.py#L114).
154+
155+
```python
156+
from apify.scrapy.utils import nested_event_loop
157+
...
158+
159+
# Coroutine execution inside a spider
160+
nested_event_loop.run_until_complete(my_coroutine())
161+
```
162+
163+
### More spiders per Actor
164+
165+
It is recommended to execute only one Scrapy spider per Apify Actor.
166+
167+
Mapping more Scrapy spiders to a single Apify Actor does not make much sense. We would have to create a separate
168+
instace of the [request queue](https://docs.apify.com/platform/storage/request-queue) for every spider.
169+
Also, every spider can produce a different output resulting in a mess in an output
170+
[dataset](https://docs.apify.com/platform/storage/dataset). A solution for this could be to store an output
171+
of every spider to a different [key-value store](https://docs.apify.com/platform/storage/key-value-store). However,
172+
a much more simple solution to this problem would be to just have a single spider per Actor.
173+
174+
If you want to share common Scrapy components (middlewares, item pipelines, ...) among more spiders (Actors), you
175+
can use a dedicated Python package containing your components and install it to your Actors environment. The
176+
other solution to this problem could be to have more spiders per Actor, but keep only one spider run per Actor.
177+
What spider is going to be executed in an Actor run can be specified in the
178+
[input schema](https://docs.apify.com/academy/deploying-your-code/input-schema).
179+
180+
## Additional links
181+
182+
- [Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example)
183+
- [Python Actor Scrapy template](https://apify.com/templates/python-scrapy)
184+
- [Apify SDK for Python](https://docs.apify.com/sdk/python)
185+
- [Apify platform](https://docs.apify.com/platform)
186+
- [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU)
187+
188+
> We welcome any feedback! Please feel free to contact us at [[email protected]](mailto:[email protected]). Thank you for your valuable input.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Quick Start
3+
description: Learn how to create, run, and manage Actors using Apify CLI.
4+
---
5+
6+
Learn how to create, run, and manage Actors using Apify CLI.
7+
8+
## Prerequisites
9+
10+
Before you begin, make sure you have the Apify CLI installed on your system. If you haven't installed it yet, follow the [installation guide](./installation.md).
11+
12+
## Step 1: Create your Actor
13+
14+
Run the following command in your terminal. It will guide you step by step through the creation process.
15+
16+
```bash
17+
apify create
18+
```
19+
20+
:::info Explore Actor templates
21+
22+
The Apify CLI will prompt you to choose a template. Browse the [full list of templates](https://apify.com/templates) to find the best fit for your Actor.
23+
24+
:::
25+
26+
## Step 2: Run your Actor
27+
28+
Once the Actor is initialized, you can run it:
29+
30+
```bash
31+
apify run
32+
```
33+
34+
You'll see output similar to this in your terminal:
35+
36+
```bash
37+
INFO System info {"apifyVersion":"3.4.3","apifyClientVersion":"2.12.6","crawleeVersion":"3.13.10","osType":"Darwin","nodeVersion":"v22.17.0"}
38+
Extracted heading { level: 'h1', text: 'Your full‑stack platform for web scraping' }
39+
Extracted heading { level: 'h3', text: 'TikTok Scraper' }
40+
Extracted heading { level: 'h3', text: 'Google Maps Scraper' }
41+
Extracted heading { level: 'h3', text: 'Instagram Scraper' }
42+
```
43+
44+
## Step 3: Push your Actor
45+
46+
Once you are ready, you can push your Actor to the Apify platform, where you can schedule runs, or make the Actor public for other developers.
47+
48+
### Login to Apify Console
49+
50+
```bash
51+
apify login
52+
```
53+
54+
:::note Create an Apify account
55+
56+
Before you can interact with the Apify Console, [create an Apify account](https://console.apify.com/).
57+
When you run `apify login`, you can choose one of the following methods:
58+
59+
- Sign in via the Apify Console in your browser — recommended.
60+
- Provide an [Apify API token](https://console.apify.com/settings/integrations) — alternative method.
61+
62+
The interactive prompt will guide you through either option.
63+
64+
:::
65+
66+
### Push to Apify Console
67+
68+
```bash
69+
apify push
70+
```
71+
72+
## Step 4: Call your Actor (optional)
73+
74+
You can run your Actor on the Apify platform. In the following example, the command runs `apify/hello-world` on the Apify platform.
75+
76+
```bash
77+
apify call apify/hello-world
78+
```
79+
80+
## Next steps
81+
82+
- Check the [command reference](./reference.md) for more information about individual commands.
83+
- If you have a problem with the Apify CLI, check the [troubleshooting](./troubleshooting.md) guide.
84+
- Learn more about [Actors](https://docs.apify.com/platform/actors).

0 commit comments

Comments
 (0)