Skip to content

Latest commit

 

History

History
258 lines (161 loc) · 6.64 KB

File metadata and controls

258 lines (161 loc) · 6.64 KB

Initial setup

Learn how to get scrapy-zyte-api installed and configured on an existing :doc:`Scrapy <scrapy:index>` project.

Tip

:ref:`Zyte’s web scraping tutorial <zyte:tutorial>` covers scrapy-zyte-api setup as well.

Requirements

You need at least:

:doc:`scrapy-poet <scrapy-poet:index>` integration requires Scrapy 2.6+.

Installation

For a basic installation:

pip install scrapy-zyte-api

For :ref:`scrapy-poet integration <scrapy-poet>`, install the provider extra:

pip install scrapy-zyte-api[provider]

For :ref:`x402 support <x402>`, install the x402 extra:

pip install scrapy-zyte-api[x402]

Note that you can install multiple extras:

pip install scrapy-zyte-api[provider,x402]

Configuration

To configure scrapy-zyte-api, :ref:`set up authentication <auth>` and either :ref:`enable the add-on <config-addon>` (Scrapy ≥ 2.10) or :ref:`configure all components separately <config-components>`.

Authentication

Sign up for a Zyte API account, copy your API key and do either of the following:

  • Define an environment variable named ZYTE_API_KEY with your API key:

    • On Windows’ CMD:

      > set ZYTE_API_KEY=YOUR_API_KEY
    • On macOS and Linux:

      $ export ZYTE_API_KEY=YOUR_API_KEY
  • Add your API key to your setting module:

    ZYTE_API_KEY = "YOUR_API_KEY"

To use x402 instead, see :ref:`x402`.

Enabling the add-on

If you are using Scrapy 2.10 or higher, you can set up scrapy-zyte-api integration using the following :ref:`add-on <topics-addons>` with any priority:

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}

Note

The addon enables :ref:`transparent mode <transparent>` by default.

Enabling all components separately

If :ref:`enabling the add-on <config-addon>` is not an option, you can set up scrapy-zyte-api integration as follows:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}
DOWNLOADER_MIDDLEWARES = {
    "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 633,
}
SPIDER_MIDDLEWARES = {
    "scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware": 100,
    "scrapy_zyte_api.ScrapyZyteAPIRefererSpiderMiddleware": 1000,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

By default, scrapy-zyte-api doesn't change the spider behavior. To switch your spider to use Zyte API for all requests, set the following setting as well:

ZYTE_API_TRANSPARENT_MODE = True

For :ref:`scrapy-poet integration <scrapy-poet>`, :ref:`configure scrapy-poet <scrapy-poet:setup>` first, and then add the following provider to the SCRAPY_POET_PROVIDERS setting:

SCRAPY_POET_PROVIDERS = {
    "scrapy_zyte_api.providers.ZyteApiProvider": 1100,
}

If you already had a custom value for :setting:`REQUEST_FINGERPRINTER_CLASS <scrapy:REQUEST_FINGERPRINTER_CLASS>`, set that value on :setting:`ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS` instead.

ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS = "myproject.CustomRequestFingerprinter"

For :ref:`session management support <session>`, add the following downloader middleware to the :setting:`DOWNLOADER_MIDDLEWARES <scrapy:DOWNLOADER_MIDDLEWARES>` setting:

DOWNLOADER_MIDDLEWARES = {
    "scrapy_zyte_api.ScrapyZyteAPISessionDownloaderMiddleware": 667,
}

Changing reactors may require code changes

If your :setting:`TWISTED_REACTOR <scrapy:TWISTED_REACTOR>` setting was not set to "twisted.internet.asyncioreactor.AsyncioSelectorReactor" before, you will be changing the Twisted reactor that your Scrapy project uses, and your existing code may need changes, such as:

  • :ref:`asyncio-preinstalled-reactor`.

    Some Twisted imports install the default, non-asyncio Twisted reactor as a side effect. Once a reactor is installed, it cannot be changed for the whole run time.

  • :ref:`asyncio-await-dfd`.

    Note that you might be using Deferreds without realizing it through some Scrapy functions and methods. For example, when you yield the return value of self.crawler.engine.download() from a spider callback, you are yielding a Deferred.

x402

It is possible to use :ref:`Zyte API <zyte-api>` without a Zyte API account by using the x402 protocol to handle payments:

  1. Read the Zyte Terms of Service. By using Zyte API, you are accepting them.
  2. During :ref:`installation <install>`, make sure to install the x402 extra.
  3. :ref:`Configure <eth-key>` the private key of your Ethereum account to authorize payments.

Configuring your Ethereum private key

It is recommended to configure your Ethereum private key through an environment variable, so that it also works when you use :doc:`python-zyte-api <python-zyte-api:index>`:

  • On Windows’ CMD:

    > set ZYTE_API_ETH_KEY=YOUR_ETH_PRIVATE_KEY
  • On macOS and Linux:

    $ export ZYTE_API_ETH_KEY=YOUR_ETH_PRIVATE_KEY

Alternatively, you can add your Ethereum private key to the settings module:

ZYTE_API_ETH_KEY = "YOUR_ETH_PRIVATE_KEY"