scrapy-zyte-api

Requirements

Python 3.7+
Scrapy

Installation

pip install scrapy-zyte-api

This package requires Python 3.7+.

Configuration

Replace the default http and https in Scrapy's DOWNLOAD_HANDLERS in the settings.py of your Scrapy project.

You also need to set the ZYTE_API_KEY.

Lastly, make sure to install the asyncio-based Twisted reactor in the settings.py file as well:

Here's an example of the things needed inside a Scrapy project's settings.py file:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler"
}

# Having the following in the env var would also work.
ZYTE_API_KEY = "<your API key>"

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Usage

To enable every request to be sent through Zyte API, you can set the following in the settings.py file or any other settings within Scrapy:

ZYTE_API_DEFAULT_PARAMS = {
    "browserHtml": True,
    "geolocation": "US",
}

You can see the full list of parameters in the Zyte API Specification.

On the other hand, you could also control it on a per-request basis by setting the zyte_api key in Request.meta. When doing so, it will override any parameters set in the ZYTE_API_DEFAULT_PARAMS setting.

import scrapy


class SampleQuotesSpider(scrapy.Spider):
    name = "sample_quotes"

    def start_requests(self):

        yield scrapy.Request(
            url="http://books.toscrape.com/",
            callback=self.parse,
            meta={
                "zyte_api": {
                    "browserHtml": True,
                    "geolocation": "US",  # You can set any Geolocation region you want.
                    "javascript": True,
                    "echoData": {"some_value_I_could_track": 123},
                }
            },
        )

    def parse(self, response):
        yield {"URL": response.url, "status": response.status, "HTML": response.body}

        print(response.zyte_api)
        # {
        #     'url': 'https://quotes.toscrape.com/',
        #     'browserHtml': '<html> ... </html>',
        #     'echoData': {'some_value_I_could_track': 123},
        # }

        print(response.request.meta)
        # {
        #     'zyte_api': {
        #         'browserHtml': True,
        #         'geolocation': 'US',
        #         'javascript': True,
        #         'echoData': {'some_value_I_could_track': 123}
        #     },
        #     'download_timeout': 180.0,
        #     'download_slot': 'quotes.toscrape.com'
        # }

The raw Zyte Data API response can be accessed via the zyte_api attribute of the response object. Note that such responses are of ZyteAPIResponse and ZyteAPITextResponse types, which are respectively subclasses of scrapy.http.Response and scrapy.http.TextResponse. Such classes are needed to hold the raw Zyte Data API responses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapy-zyte-api

Requirements

Installation

Configuration

Usage

FilesExpand file tree

README.rst

Latest commit

History

README.rst

File metadata and controls

scrapy-zyte-api

Requirements

Installation

Configuration

Usage