Skip to content

Commit 4fb9f87

Browse files
committed
Update the Scrapy guide
1 parent 1f93e0c commit 4fb9f87

File tree

1 file changed

+38
-28
lines changed

1 file changed

+38
-28
lines changed

docs/02_guides/05_scrapy.mdx

Lines changed: 38 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -13,54 +13,56 @@ import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py';
1313
import TitleSpiderExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py';
1414
import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';
1515

16-
[Scrapy](https://scrapy.org/) is an open-source web scraping framework written in Python. It provides a complete set of tools for web scraping, including the ability to define how to extract data from websites, handle pagination and navigation.
16+
[Scrapy](https://scrapy.org/) is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify [Actors](https://docs.apify.com/platform/actors), integrated with Apify [storages](https://docs.apify.com/platform/storage), and executed on the Apify [platform](https://docs.apify.com/platform).
1717

18-
:::tip
18+
## Integrating Scrapy with the Apify platform
1919

20-
Our CLI now supports transforming Scrapy projects into Apify Actors with a single command! Check out the [Scrapy migration guide](https://docs.apify.com/cli/docs/integrating-scrapy) for more information.
20+
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
2121

22-
:::
22+
<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
23+
{UnderscoreMainExample}
24+
</CodeBlock>
2325

24-
Some of the key features of Scrapy for web scraping include:
26+
In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
2527

26-
- **Request and response handling** - Scrapy provides an easy-to-use interface for making HTTP requests and handling responses,
27-
allowing you to navigate through web pages and extract data.
28-
- **Robust Spider framework** - Scrapy has a spider framework that allows you to define how to scrape data from websites,
29-
including how to follow links, how to handle pagination, and how to parse the data.
30-
- **Built-in data extraction** - Scrapy includes built-in support for data extraction using XPath and CSS selectors,
31-
allowing you to easily extract data from HTML and XML documents.
32-
- **Integration with other tool** - Scrapy can be integrated with other Python tools like BeautifulSoup and Selenium for more advanced scraping tasks.
28+
<CodeBlock className="language-python" title="main.py: The Actor main coroutine">
29+
{MainExample}
30+
</CodeBlock>
3331

34-
## Using Scrapy template
32+
Within the Actor's main coroutine, the Actor's input is processed as usual. The function `apify.scrapy.apply_apify_settings` is then used to configure Scrapy settings with Apify-specific components before the spider is executed. The key components and other helper functions are described in the next section.
3533

36-
The fastest way to start using Scrapy in Apify Actors is by leveraging the [Scrapy Actor template](https://apify.com/templates/categories/python). This template provides a pre-configured structure and setup necessary to integrate Scrapy into your Actors seamlessly. It includes: setting up the Scrapy settings, `asyncio` reactor, Actor logger, and item pipeline as necessary to make Scrapy spiders run in Actors and save their outputs in Apify datasets.
34+
## Key integration components
3735

38-
## Manual setup
36+
The Apify SDK provides several custom components to support integration with the Apify platform:
3937

40-
If you prefer not to use the template, you will need to manually configure several components to integrate Scrapy with the Apify SDK.
38+
- [`apify.scrapy.ApifyScheduler`](https://docs.apify.com/sdk/python/reference/class/ApifyScheduler) - Replaces Scrapy's default [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) with one that uses Apify's [request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests. It manages enqueuing, dequeuing, and maintaining the state and priority of requests.
39+
- [`apify.scrapy.ActorDatasetPushPipeline`](https://docs.apify.com/sdk/python/reference/class/ActorDatasetPushPipeline) - A Scrapy [item pipeline](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) that pushes scraped items to Apify's [dataset](https://docs.apify.com/platform/storage/dataset). When enabled, every item produced by the spider is sent to the dataset.
40+
- [`apify.scrapy.ApifyHttpProxyMiddleware`](https://docs.apify.com/sdk/python/reference/class/ApifyHttpProxyMiddleware) - A Scrapy [middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that manages proxy configurations. This middleware replaces Scrapy's default `HttpProxyMiddleware` to facilitate the use of Apify's proxy service.
4141

42-
### Event loop & reactor
42+
Additional helper functions in the [`apify.scrapy`](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy) subpackage include:
4343

44-
The Apify SDK is built on Python's asynchronous [`asyncio`](https://docs.python.org/3/library/asyncio.html) library, whereas Scrapy uses [`twisted`](https://twisted.org/) for its asynchronous operations. To make these two frameworks work together, you need to:
44+
- `apply_apify_settings` - Applies Apify-specific components to Scrapy settings.
45+
- `to_apify_request` and `to_scrapy_request` - Convert between Apify and Scrapy request objects.
46+
- `initialize_logging` - Configures logging for the Actor environment.
47+
- `run_scrapy_actor` - Bridges asyncio and Twisted event loops.
4548

46-
- Set the [`AsyncioSelectorReactor`](https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor) in Scrapy's project settings: This reactor is `twisted`'s implementation of the `asyncio` event loop, enabling compatibility between the two libraries.
47-
- Install [`nest_asyncio`](https://pypi.org/project/nest-asyncio/): The `nest_asyncio` package allows the asyncio event loop to run within an already running loop, which is essential for integration with the Apify SDK.
49+
## Create a new Apify-Scrapy project
4850

49-
By making these adjustments, you can ensure collaboration between `twisted`-based Scrapy and the `asyncio`-based Apify SDK.
51+
The simplest way to start using Scrapy in Apify Actors is to use the [Scrapy Actor template](https://apify.com/templates/categories/python). The template provides a pre-configured project structure and setup that includes all necessary components to run Scrapy spiders as Actors and store their output in Apify datasets. If you prefer manual setup, refer to the example Actor section below for configuration details.
5052

51-
### Other components
53+
## Wrapping an existing Scrapy project
5254

53-
We also prepared other Scrapy components to work with Apify SDK, they are available in the [`apify/scrapy`](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy) sub-package. These components include:
55+
The Apify CLI supports converting an existing Scrapy project into an Apify Actor with a single command. The CLI expects the project to follow the standard Scrapy layout (including a `scrapy.cfg` file in the project root). During the wrapping process, the CLI:
5456

55-
- `ApifyScheduler`: A Scrapy scheduler that uses the Apify Request Queue to manage requests.
56-
- `ApifyHttpProxyMiddleware`: A Scrapy middleware for working with Apify proxies.
57-
- `ActorDatasetPushPipeline`: A Scrapy item pipeline that pushes scraped items into the Apify dataset.
57+
- Creates the necessary files and directories for an Apify Actor.
58+
- Installs the Apify SDK and required dependencies.
59+
- Updates Scrapy settings to include Apify-specific components.
5860

59-
The module contains other helper functions, like `apply_apify_settings` for applying these components to Scrapy settings, and `to_apify_request` and `to_scrapy_request` for converting between Apify and Scrapy request objects.
61+
For further details, see the [Scrapy migration guide](https://docs.apify.com/cli/docs/integrating-scrapy).
6062

6163
## Example Actor
6264

63-
Here is an example of a Scrapy Actor that scrapes the titles of web pages and enqueues all links found on each page. This example is identical to the one provided in the Apify Actor templates.
65+
The following example demonstrates a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
6466

6567
<Tabs>
6668
<TabItem value="__main__.py" label="__main.py__">
@@ -93,3 +95,11 @@ Here is an example of a Scrapy Actor that scrapes the titles of web pages and en
9395
## Conclusion
9496

9597
In this guide you learned how to use Scrapy in Apify Actors. You can now start building your own web scraping projects using Scrapy, the Apify SDK and host them on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
98+
99+
## Additional resources
100+
101+
- [Apify CLI: Integrating Scrapy projects](https://docs.apify.com/cli/docs/integrating-scrapy)
102+
- [Apify: Run Scrapy spiders on Apify](https://apify.com/run-scrapy-in-cloud)
103+
- [Apify templates: Pyhon Actor Scrapy template](https://apify.com/templates/python-scrapy)
104+
- [Apify store: Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example)
105+
- [Scrapy: Official documentation](https://docs.scrapy.org/)

0 commit comments

Comments
 (0)