Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
e0ac909
Remove Python 3.9.
wRAR Nov 9, 2025
c268d5b
Do not rely on Scrapy providing create_instance
AdrianAtZyte Jan 9, 2026
817a8a3
Silence mypy issue
AdrianAtZyte Jan 9, 2026
d1f6b74
WIP
AdrianAtZyte Jan 9, 2026
d334b27
WIP
AdrianAtZyte Jan 9, 2026
871615d
Update default fallback download handler expectations for Scrapy 2.14+
AdrianAtZyte Jan 12, 2026
5c94151
Remove unneeded @deferred_f_from_coro_f
AdrianAtZyte Jan 12, 2026
6861a2e
Update download_request() output type expectations based on the Scrap…
AdrianAtZyte Jan 12, 2026
dde5300
Address deprecation warnings
AdrianAtZyte Jan 12, 2026
1e174ff
Merge remote-tracking branch 'origin/remove-py39' into scrapy-2.14
AdrianAtZyte Jan 12, 2026
04cffcc
Make download_request() compatible with Scrapy 2.0-2.14
AdrianAtZyte Jan 12, 2026
7e20b4d
Add backward compatibility to download_request() overrides in tests
AdrianAtZyte Jan 12, 2026
e9846c8
Wrap crawler.crawl() with maybe_deferred_to_future()
AdrianAtZyte Jan 12, 2026
ef64eec
Fix download_request calls in tests to omit spiders in Scrapy 2.14+
AdrianAtZyte Jan 12, 2026
75a5ad3
Fix if typo
AdrianAtZyte Jan 12, 2026
6bb97e7
Add remaining maybe_deferred_to_future
AdrianAtZyte Jan 12, 2026
663defb
Use maybe_deferred_to_future() for crawler.stop()
AdrianAtZyte Jan 12, 2026
04c4b21
Move deferred_to_future to utils
AdrianAtZyte Jan 12, 2026
9126793
Support Python 3.14
AdrianAtZyte Jan 12, 2026
f409531
Fix tests not passing with Scrapy 2.0
AdrianAtZyte Jan 12, 2026
2d0a055
Revert "Support Python 3.14"
AdrianAtZyte Jan 12, 2026
d5cb518
WIP
AdrianAtZyte Jan 12, 2026
263dfbd
WIP
AdrianAtZyte Jan 12, 2026
6b6ff62
WIP
AdrianAtZyte Jan 13, 2026
a89af8f
WIP
AdrianAtZyte Jan 13, 2026
51de625
WIP
AdrianAtZyte Jan 13, 2026
45ac1df
Silence mypy issues
AdrianAtZyte Jan 13, 2026
a8999df
WIP
AdrianAtZyte Jan 13, 2026
16927bc
Address warnings
AdrianAtZyte Jan 13, 2026
15270cb
Address warnings
AdrianAtZyte Jan 13, 2026
4ad2a7d
Keep slot_request() backward-compatible
AdrianAtZyte Jan 13, 2026
462004e
Changes to maintain backward compatibility
AdrianAtZyte Jan 13, 2026
a88c5ec
Add missing decorator
AdrianAtZyte Jan 13, 2026
be5fb60
Add missing maybe_deferred_to_future
AdrianAtZyte Jan 13, 2026
36bae3e
Add missing parent class
AdrianAtZyte Jan 13, 2026
e1bf6d2
Simplify test code
AdrianAtZyte Jan 13, 2026
efb9b3b
Silence scrapy-poet warnings
AdrianAtZyte Jan 13, 2026
ea8474a
Address Twisted warning
AdrianAtZyte Jan 13, 2026
dec48b8
Remove undefined type
AdrianAtZyte Jan 13, 2026
694c1a1
Silence mypy issues
AdrianAtZyte Jan 13, 2026
eb8684c
Use caplog.clear() in test that gets leaked logs from other tests
AdrianAtZyte Jan 14, 2026
0ad0684
Use a Deferred-based close() for lower Scrapy versions
AdrianAtZyte Jan 14, 2026
698ecf1
Silence mypy
AdrianAtZyte Jan 14, 2026
cb2d418
Disable the telnet console during tests
AdrianAtZyte Jan 14, 2026
2c93743
Clean up
AdrianAtZyte Jan 14, 2026
e857eef
Address feedback and make test_higher_concurrency more resilient
AdrianAtZyte Jan 15, 2026
17f32c3
Fix issues
AdrianAtZyte Jan 15, 2026
f9b93ac
Address feedback
AdrianAtZyte Jan 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,23 @@ jobs:
fail-fast: false
matrix:
include:
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x0
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x1
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x3
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x4
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x5
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x6
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-scrapy-2x7
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-extra
- python-version: '3.9'
- python-version: '3.10'
toxenv: min-provider
- python-version: '3.10'
toxenv: min-x402
Expand Down
9 changes: 3 additions & 6 deletions docs/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,12 @@ You need at least:
- A :ref:`Zyte API <zyte-api>` subscription (there’s a :ref:`free trial
<zapi-trial>`).

- Python 3.9+
- Python 3.10+

- Scrapy 2.0.1+

:doc:`scrapy-poet <scrapy-poet:index>` integration requires Scrapy 2.6+.

:ref:`x402 support <x402>` requires Python 3.10+.


.. _install:

Expand All @@ -38,14 +36,13 @@ For a basic installation:

pip install scrapy-zyte-api

For :ref:`scrapy-poet integration <scrapy-poet>`:
For :ref:`scrapy-poet integration <scrapy-poet>`, install the ``provider`` extra:

.. code-block:: shell

pip install scrapy-zyte-api[provider]

For :ref:`x402 support <x402>`, make sure you have Python 3.10+ and install
the ``x402`` extra:
For :ref:`x402 support <x402>`, install the ``x402`` extra:

.. code-block:: shell

Expand Down
2 changes: 1 addition & 1 deletion docs/usage/automap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For example:
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"

def start_requests(self):
async def start(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
Expand Down
4 changes: 2 additions & 2 deletions docs/usage/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ For example:
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"

def start_requests(self):
async def start(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
Expand All @@ -48,7 +48,7 @@ remember to also request :http:`request:httpResponseHeaders`:
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"

def start_requests(self):
async def start(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
Expand Down
13 changes: 10 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,12 @@ classifiers = [
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
]
requires-python = ">=3.9"
requires-python = ">=3.10"
# Sync with [pinned] @ tox.ini
dependencies = [
"packaging>=20.0",
Expand Down Expand Up @@ -120,5 +119,13 @@ testpaths = [
]
minversion = "6.0"
filterwarnings = [
"ignore::DeprecationWarning:twisted.web.http",
"ignore::DeprecationWarning:twisted\\.web\\.http",
"ignore::DeprecationWarning:scrapy\\.core\\.downloader\\.contextfactory", # https://github.com/scrapy/scrapy/issues/3288

# scrapy-poet warnings for Scrapy 2.14:
"ignore:CollectorPipeline\\.:scrapy.exceptions.ScrapyDeprecationWarning",
"ignore:DownloaderStatsMiddleware\\.:scrapy.exceptions.ScrapyDeprecationWarning",
"ignore:.*?InjectionMiddleware\\.:scrapy.exceptions.ScrapyDeprecationWarning",
"ignore:RetryMiddleware\\.process_spider_exception\\(\\):scrapy.exceptions.ScrapyDeprecationWarning",
"ignore::scrapy.exceptions.ScrapyDeprecationWarning:scrapy_poet",
]
121 changes: 75 additions & 46 deletions scrapy_zyte_api/_middlewares.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@
from logging import getLogger
from typing import cast
from warnings import warn

from scrapy import Request
from scrapy.exceptions import IgnoreRequest
from scrapy import Request, Spider
from scrapy.exceptions import IgnoreRequest, ScrapyDeprecationWarning
from scrapy.utils.python import global_object_name
from zyte_api import RequestError

from ._params import _ParamParser
from .utils import _AUTOTHROTTLE_DONT_ADJUST_DELAY_SUPPORT
from .utils import (
_AUTOTHROTTLE_DONT_ADJUST_DELAY_SUPPORT,
_GET_SLOT_NEEDS_SPIDER,
_LOG_DEFERRED_IS_DEPRECATED,
_close_spider,
_schedule_coro,
maybe_deferred_to_future,
)

logger = getLogger(__name__)
_start_requests_processed = object()
Expand All @@ -27,7 +35,19 @@ def __init__(self, crawler):
not crawler.settings.getbool("AUTOTHROTTLE_ENABLED"),
)

def slot_request(self, request, spider, force=False):
def slot_request(
self, request: Request, spider: Spider | None = None, force: bool = False
):
if spider is not None:
warn(
f"Passing a 'spider' argument to "
f"{global_object_name(self.__class__)}.slot_request() is "
f"deprecated and the argument will be removed in a future "
f"scrapy-zyte-api version.",
category=ScrapyDeprecationWarning,
stacklevel=2,
)

if not force and self._param_parser.parse(request) is None:
return

Expand All @@ -38,12 +58,13 @@ def slot_request(self, request, spider, force=False):
try:
slot_id = downloader.get_slot_key(request)
except AttributeError: # Scrapy < 2.12
slot_id = downloader._get_slot_key(request, spider)
slot_id = downloader._get_slot_key(request, self._crawler.spider)
if not isinstance(slot_id, str) or not slot_id.startswith(self._slot_prefix):
slot_id = f"{self._slot_prefix}{slot_id}"
request.meta["download_slot"] = slot_id
if not self._preserve_delay:
_, slot = downloader._get_slot(request, spider)
args = (self._crawler.spider,) if _GET_SLOT_NEEDS_SPIDER else ()
_, slot = downloader._get_slot(request, *args)
slot.delay = 0


Expand All @@ -65,6 +86,7 @@ def __init__(self, crawler) -> None:
crawler.signals.connect(
self._start_requests_processed, signal=_start_requests_processed
)
self._crawler = crawler

def _get_spm_mw(self):
spm_mw_classes = []
Expand All @@ -89,15 +111,15 @@ def _get_spm_mw(self):
return middleware
return None

def _check_spm_conflict(self, spider):
def _check_spm_conflict(self):
checked = getattr(self, "_checked_spm_conflict", False)
if checked:
return
self._checked_spm_conflict = True
settings = self._crawler.settings
in_transparent_mode = settings.getbool("ZYTE_API_TRANSPARENT_MODE", False)
spm_mw = self._get_spm_mw()
spm_is_enabled = spm_mw and spm_mw.is_enabled(spider)
spm_is_enabled = spm_mw and spm_mw.is_enabled(self._crawler.spider)
if not in_transparent_mode or not spm_is_enabled:
return
logger.error(
Expand All @@ -114,35 +136,31 @@ def _check_spm_conflict(self, spider):
"request.meta to set dont_proxy to True and zyte_api_automap "
"either to True or to a dictionary of extra request fields."
)
from twisted.internet import reactor
from twisted.internet.interfaces import IReactorCore

reactor = cast(IReactorCore, reactor)
reactor.callLater(
0, self._crawler.engine.close_spider, spider, "plugin_conflict"
)
_close_spider(self._crawler, "plugin_conflict")

def _start_requests_processed(self, count):
self._total_start_request_count = count
self._maybe_close()

def process_request(self, request, spider):
self._check_spm_conflict(spider)
def process_request(self, request: Request, spider: Spider | None = None):
self._check_spm_conflict()

if self._param_parser.parse(request) is None:
return

self._request_count += 1
if self._max_requests and self._request_count > self._max_requests:
self._crawler.engine.close_spider(spider, "closespider_max_zapi_requests")
_close_spider(self._crawler, "closespider_max_zapi_requests")
raise IgnoreRequest(
f"The request {request} is skipped as {self._max_requests} max "
f"Zyte API requests have been reached."
)

self.slot_request(request, spider, force=True)
self.slot_request(request, force=True)

def process_exception(self, request, exception, spider):
def process_exception(
self, request: Request, exception: Exception, spider: Spider | None = None
):
if (
not request.meta.get("is_start_request")
or not isinstance(exception, RequestError)
Expand All @@ -162,60 +180,69 @@ def _maybe_close(self):
"Stopping the spider, all start requests failed because they "
"were pointing to a domain forbidden by Zyte API."
)
self._crawler.engine.close_spider(
self._crawler.spider, "failed_forbidden_domain"
)
_close_spider(self._crawler, "failed_forbidden_domain")


class ScrapyZyteAPISpiderMiddleware(_BaseMiddleware):
def __init__(self, crawler):
super().__init__(crawler)
self._send_signal = crawler.signals.send_catch_log
if _LOG_DEFERRED_IS_DEPRECATED:
self._send_signal = crawler.signals.send_catch_log_async
else:

async def _send_signal(signal, **kwargs):
await maybe_deferred_to_future(
crawler.signals.send_catch_log_deferred(signal, **kwargs)
)

self._send_signal = _send_signal

@staticmethod
def _get_header_set(request):
return {header.strip().lower() for header in request.headers}

async def process_start(self, start):
async def process_start(self, start, spider: Spider | None = None):
# Mark start requests and reports to the downloader middleware the
# number of them once all have been processed.
count = 0
async for item_or_request in start:
if isinstance(item_or_request, Request):
count += 1
item_or_request.meta["is_start_request"] = True
self._process_output_request(item_or_request, None)
self._process_output_request(item_or_request)
yield item_or_request
self._send_signal(_start_requests_processed, count=count)
await self._send_signal(_start_requests_processed, count=count)

def process_start_requests(self, start_requests, spider):
def process_start_requests(self, start_requests, spider: Spider):
count = 0
for item_or_request in start_requests:
if isinstance(item_or_request, Request):
count += 1
item_or_request.meta["is_start_request"] = True
self._process_output_request(item_or_request, spider)
self._process_output_request(item_or_request)
yield item_or_request
self._send_signal(_start_requests_processed, count=count)
_schedule_coro(self._send_signal(_start_requests_processed, count=count))

def _process_output_request(self, request, spider):
def _process_output_request(self, request: Request):
if "_pre_mw_headers" not in request.meta:
request.meta["_pre_mw_headers"] = self._get_header_set(request)
self.slot_request(request, spider)
self.slot_request(request)

def _process_output_item_or_request(self, item_or_request, spider):
def _process_output_item_or_request(self, item_or_request):
if not isinstance(item_or_request, Request):
return
self._process_output_request(item_or_request, spider)
self._process_output_request(item_or_request)

def process_spider_output(self, response, result, spider):
def process_spider_output(self, response, result, spider: Spider | None = None):
for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
self._process_output_item_or_request(item_or_request)
yield item_or_request

async def process_spider_output_async(self, response, result, spider):
async def process_spider_output_async(
self, response, result, spider: Spider | None = None
):
async for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
self._process_output_item_or_request(item_or_request)
yield item_or_request


Expand All @@ -230,22 +257,24 @@ def __init__(self, crawler):
)
self._param_parser = _ParamParser(crawler, cookies_enabled=False)

def process_spider_output(self, response, result, spider):
def process_spider_output(self, response, result, spider: Spider | None = None):
for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
self._process_output_item_or_request(item_or_request)
yield item_or_request

async def process_spider_output_async(self, response, result, spider):
async def process_spider_output_async(
self, response, result, spider: Spider | None = None
):
async for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
self._process_output_item_or_request(item_or_request)
yield item_or_request

def _process_output_item_or_request(self, item_or_request, spider):
def _process_output_item_or_request(self, item_or_request):
if not isinstance(item_or_request, Request):
return
self._process_output_request(item_or_request, spider)
self._process_output_request(item_or_request)

def _process_output_request(self, request, spider):
def _process_output_request(self, request: Request):
if self._is_zyte_api_request(request):
request.meta.setdefault("referrer_policy", self._default_policy)

Expand Down
Loading