|
| 1 | +--- |
| 2 | +title: Integrating Scrapy projects |
| 3 | +description: Learn how to run Scrapy projects as Apify Actors and deploy them on the Apify platform. |
| 4 | +sidebar_label: Integrating Scrapy projects |
| 5 | +--- |
| 6 | + |
| 7 | +[Scrapy](https://scrapy.org/) is a widely used open-source web scraping framework for Python. Scrapy projects can now be executed on the Apify platform using our dedicated wrapping tool. This tool allows users to transform their Scrapy projects into [Apify Actors](https://docs.apify.com/platform/actors) with just a few simple commands. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +Before you begin, make sure you have the Apify CLI installed on your system. If you haven't installed it yet, follow the [installation guide](./installation.md). |
| 12 | + |
| 13 | +## Actorization of your existing Scrapy spider |
| 14 | + |
| 15 | +Assuming your Scrapy project is set up, navigate to the project root where the `scrapy.cfg` file is located. |
| 16 | + |
| 17 | +```bash |
| 18 | +cd your_scraper |
| 19 | +``` |
| 20 | + |
| 21 | +Verify the directory contents to ensure the correct location. |
| 22 | + |
| 23 | +```bash showLineNumbers |
| 24 | +$ ls -R |
| 25 | +.: |
| 26 | +your_scraper README.md requirements.txt scrapy.cfg |
| 27 | + |
| 28 | +./your_scraper: |
| 29 | +__init__.py items.py __main__.py main.py pipelines.py settings.py spiders |
| 30 | + |
| 31 | +./your_scraper/spiders: |
| 32 | +your_spider.py __init__.py |
| 33 | +``` |
| 34 | + |
| 35 | +To convert your Scrapy project into an Apify Actor, initiate the wrapping process by executing the following command: |
| 36 | + |
| 37 | +```bash |
| 38 | +apify init |
| 39 | +``` |
| 40 | + |
| 41 | +The script will prompt you with a series of questions. Upon completion, the output might resemble the following: |
| 42 | + |
| 43 | +```bash showLineNumbers |
| 44 | +Info: The current directory looks like a Scrapy project. Using automatic project wrapping. |
| 45 | +? Enter the Scrapy BOT_NAME (see settings.py): books_scraper |
| 46 | +? What folder are the Scrapy spider modules stored in? (see SPIDER_MODULES in settings.py): books_scraper.spiders |
| 47 | +? Pick the Scrapy spider you want to wrap: BookSpider (/home/path/to/actor-scrapy-books-example/books_scraper/spiders/book.py) |
| 48 | +Info: Downloading the latest Scrapy wrapper template... |
| 49 | +Info: Wrapping the Scrapy project... |
| 50 | +Success: The Scrapy project has been wrapped successfully. |
| 51 | +``` |
| 52 | + |
| 53 | +For example, here is a [source code](https://github.com/apify/actor-scrapy-books-example) of an actorized Scrapy project, and [here](https://apify.com/vdusek/scrapy-books-example) the corresponding Actor in Apify Store. |
| 54 | + |
| 55 | +### Run the Actor locally |
| 56 | + |
| 57 | +Create a Python virtual environment by running: |
| 58 | + |
| 59 | +```bash |
| 60 | +python -m virtualenv .venv |
| 61 | +``` |
| 62 | + |
| 63 | +Activate the virtual environment: |
| 64 | + |
| 65 | +```bash |
| 66 | +source .venv/bin/activate |
| 67 | +``` |
| 68 | + |
| 69 | +Install Python dependencies using the provided requirements file named `requirements_apify.txt`. Ensure these requirements are installed before executing your project as an Apify Actor locally. You can put your own dependencies there as well. |
| 70 | + |
| 71 | +```bash |
| 72 | +pip install -r requirements-apify.txt [-r requirements.txt] |
| 73 | +``` |
| 74 | + |
| 75 | +Finally execute the Apify Actor. |
| 76 | + |
| 77 | +```bash |
| 78 | +apify run [--purge] |
| 79 | +``` |
| 80 | + |
| 81 | +If [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py) is configured, the Actor's output will be stored in the `storage/datasets/default/` directory. |
| 82 | + |
| 83 | +### Run the scraper as Scrapy project |
| 84 | + |
| 85 | +The project remains executable as a Scrapy project. |
| 86 | + |
| 87 | +```bash |
| 88 | +scrapy crawl your_spider -o books.json |
| 89 | +``` |
| 90 | + |
| 91 | +## Deploy on Apify |
| 92 | + |
| 93 | +### Log in to Apify |
| 94 | + |
| 95 | +You will need to provide your [Apify API Token](https://console.apify.com/settings/integrations) to complete this action. |
| 96 | + |
| 97 | +```bash |
| 98 | +apify login |
| 99 | +``` |
| 100 | + |
| 101 | +### Deploy your Actor |
| 102 | + |
| 103 | +This command will deploy and build the Actor on the Apify platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my). |
| 104 | + |
| 105 | +```bash |
| 106 | +apify push |
| 107 | +``` |
| 108 | + |
| 109 | +## What the wrapping process does |
| 110 | + |
| 111 | +The initialization command enhances your project by adding necessary files and updating some of them while preserving its functionality as a typical Scrapy project. The additional requirements file, named `requirements_apify.txt`, includes the Apify Python SDK and other essential requirements. The `.actor/` directory contains basic configuration of your Actor. We provide two new Python files [main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) and [\_\_main\_\_.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/__main__.py), where we encapsulate the Scrapy project within an Actor. We also import and use there a few Scrapy components from our [Python SDK](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy). These components facilitate the integration of the Scrapy projects with the Apify platform. Further details about these components are provided in the following subsections. |
| 112 | + |
| 113 | +### Scheduler |
| 114 | + |
| 115 | +The [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) is a core component of Scrapy responsible for receiving and providing requests to be processed. To leverage the [Apify request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests, a custom scheduler becomes necessary. Fortunately, Scrapy is a modular framework, allowing the creation of custom components. As a result, we have implemented the [ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/scheduler.py). When using the Apify CLI wrapping tool, the scheduler is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. |
| 116 | + |
| 117 | +### Dataset push pipeline |
| 118 | + |
| 119 | +[Item pipelines](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) are used for the processing of the results produced by your spiders. To handle the transmission of result data to the [Apify dataset](https://docs.apify.com/platform/storage/dataset), we have implemented the [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py). When using the Apify CLI wrapping tool, the pipeline is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. It is assigned the highest integer value (1000), ensuring its execution as the final step in the pipeline sequence. |
| 120 | + |
| 121 | +### Retry middleware |
| 122 | + |
| 123 | +[Downloader middlewares](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) are a way how to hook into Scrapy's request/response processing. Scrapy comes with various default middlewares, including the [RetryMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry), designed to handle retries for requests that may have failed due to temporary issues. When integrating with the [Apify request queue](https://docs.apify.com/platform/storage/request-queue), it becomes necessary to enhance this middleware to facilitate communication with the request queue marking the requests either as handled or ready for a retry. When using the Apify CLI wrapping tool, the default `RetryMiddleware` is disabled, and [ApifyRetryMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_retry.py) takes its place. Configuration for the middlewares is established in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. |
| 124 | + |
| 125 | +### HTTP proxy middleware |
| 126 | + |
| 127 | +Another default Scrapy [downloader middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that requires replacement is [HttpProxyMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy). To utilize the use of proxies managed through the Apify [ProxyConfiguration](https://github.com/apify/apify-sdk-python/blob/master/src/apify/proxy_configuration.py), we provide [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py). When using the Apify CLI wrapping tool, the default `HttpProxyMiddleware` is disabled, and [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py) takes its place. Additionally, inspect the [.actor/input_schema.json](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/.actor/input_schema.json) file, where proxy configuration is specified as an input property for your Actor. The processing of this input is carried out together with the middleware configuration in [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py). |
| 128 | + |
| 129 | +## Known limitations |
| 130 | + |
| 131 | +There are some known limitations of running the Scrapy projects on Apify platform we are aware of. |
| 132 | + |
| 133 | +### Asynchronous code in spiders and other components |
| 134 | + |
| 135 | +Scrapy asynchronous execution is based on the [Twisted](https://twisted.org/) library, not the |
| 136 | +[AsyncIO](https://docs.python.org/3/library/asyncio.html), which brings some complications on the table. |
| 137 | + |
| 138 | +Due to the asynchronous nature of the Actors, all of their code is executed as a coroutine inside the `asyncio.run`. |
| 139 | +In order to execute Scrapy code inside an Actor, following the section |
| 140 | +[Run Scrapy from a script](https://docs.scrapy.org/en/latest/topics/practices.html?highlight=CrawlerProcess#run-scrapy-from-a-script) |
| 141 | +from the official Scrapy documentation, we need to invoke a |
| 142 | +[`CrawlProcess.start`](https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/crawler.py#L393:L427) |
| 143 | +method. This method triggers Twisted's event loop, also known as a reactor. |
| 144 | +Consequently, Twisted's event loop is executed within AsyncIO's event loop. |
| 145 | +On top of that, when employing AsyncIO code in spiders or other components, it necessitates the creation of a new |
| 146 | +AsyncIO event loop, within which the coroutines from these components are executed. This means there is |
| 147 | +an execution of the AsyncIO event loop inside the Twisted event loop inside the AsyncIO event loop. |
| 148 | + |
| 149 | +We have resolved this issue by leveraging the [nest-asyncio](https://pypi.org/project/nest-asyncio/) library, |
| 150 | +enabling the execution of nested AsyncIO event loops. For executing a coroutine within a spider or other component, |
| 151 | +it is recommended to use Apify's instance of the nested event loop. Refer to the code example below or derive |
| 152 | +inspiration from Apify's Scrapy components, such as the |
| 153 | +[ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/v1.5.0/src/apify/scrapy/scheduler.py#L114). |
| 154 | + |
| 155 | +```python |
| 156 | +from apify.scrapy.utils import nested_event_loop |
| 157 | +... |
| 158 | + |
| 159 | +# Coroutine execution inside a spider |
| 160 | +nested_event_loop.run_until_complete(my_coroutine()) |
| 161 | +``` |
| 162 | + |
| 163 | +### More spiders per Actor |
| 164 | + |
| 165 | +It is recommended to execute only one Scrapy spider per Apify Actor. |
| 166 | + |
| 167 | +Mapping more Scrapy spiders to a single Apify Actor does not make much sense. We would have to create a separate |
| 168 | +instace of the [request queue](https://docs.apify.com/platform/storage/request-queue) for every spider. |
| 169 | +Also, every spider can produce a different output resulting in a mess in an output |
| 170 | +[dataset](https://docs.apify.com/platform/storage/dataset). A solution for this could be to store an output |
| 171 | +of every spider to a different [key-value store](https://docs.apify.com/platform/storage/key-value-store). However, |
| 172 | +a much more simple solution to this problem would be to just have a single spider per Actor. |
| 173 | + |
| 174 | +If you want to share common Scrapy components (middlewares, item pipelines, ...) among more spiders (Actors), you |
| 175 | +can use a dedicated Python package containing your components and install it to your Actors environment. The |
| 176 | +other solution to this problem could be to have more spiders per Actor, but keep only one spider run per Actor. |
| 177 | +What spider is going to be executed in an Actor run can be specified in the |
| 178 | +[input schema](https://docs.apify.com/academy/deploying-your-code/input-schema). |
| 179 | + |
| 180 | +## Additional links |
| 181 | + |
| 182 | +- [Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example) |
| 183 | +- [Python Actor Scrapy template](https://apify.com/templates/python-scrapy) |
| 184 | +- [Apify SDK for Python](https://docs.apify.com/sdk/python) |
| 185 | +- [Apify platform](https://docs.apify.com/platform) |
| 186 | +- [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU) |
| 187 | + |
| 188 | +> We welcome any feedback! Please feel free to contact us at [[email protected]](mailto:[email protected]). Thank you for your valuable input. |
0 commit comments