Skip to content

Commit 5541784

Browse files
authored
Support the upcoming proxy API of Zyte API (#108)
1 parent 3761acb commit 5541784

File tree

8 files changed

+512
-205
lines changed

8 files changed

+512
-205
lines changed

.github/workflows/main.yml

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,27 @@ jobs:
66
strategy:
77
matrix:
88
include:
9-
- python-version: 2.7
9+
- python-version: 3.8
1010
env:
11-
TOXENV: py27,stack-scrapy-1.4,stack-scrapy-1.5
12-
- python-version: 3.5
11+
TOXENV: min
12+
- python-version: 3.8
1313
env:
14-
TOXENV: py35,stack-scrapy-1.8-py3,stack-scrapy-2.0-py3,stack-scrapy-2.1-py3,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3
15-
- python-version: 3.6
14+
TOXENV: py
15+
- python-version: 3.9
1616
env:
17-
TOXENV: py36,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3,stack-scrapy-2.4-py3
18-
- python-version: 3.7
17+
TOXENV: py
18+
- python-version: "3.10"
1919
env:
20-
TOXENV: py37,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3,stack-scrapy-2.4-py3
21-
- python-version: 3.8
20+
TOXENV: py
21+
- python-version: "3.11"
22+
env:
23+
TOXENV: py
24+
- python-version: "3.11"
25+
env:
26+
TOXENV: security
27+
- python-version: "3.11"
2228
env:
23-
TOXENV: py38,security,docs,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3,stack-scrapy-2.4-py3
29+
TOXENV: docs
2430
steps:
2531
- uses: actions/checkout@v2
2632
- name: Set up Python ${{ matrix.python-version }}

docs/headers.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
Headers
2+
=======
3+
4+
The Zyte proxy API services that you can use with this downloader middleware
5+
each support a different set of HTTP request and response headers that give
6+
you access to additional features. You can find more information about those
7+
headers in the documentation of each service, `Zyte API’s <zyte-api-headers>`_
8+
and `Zyte Smart Proxy Manager’s <spm-headers>`_.
9+
10+
.. _zyte-api-headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html
11+
.. _spm-headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers
12+
13+
If you try to use a header for one service while using the other service, this
14+
downloader middleware will try to translate your header into the right header
15+
for the target service and, regardless of whether or not translation was done,
16+
the original header will be dropped.
17+
18+
Also, response headers that can be translated will be always translated,
19+
without dropping the original header, so code expecting a response header from
20+
one service can work even if a different service was used.
21+
22+
Translation is supported for the following headers:
23+
24+
========================= ===========================
25+
Zyte API Zyte Smart Proxy Manager
26+
========================= ===========================
27+
``Zyte-Client`` ``X-Crawlera-Client``
28+
``Zyte-Device`` ``X-Crawlera-Profile``
29+
``Zyte-Error`` ``X-Crawlera-Error``
30+
``Zyte-Geolocation`` ``X-Crawlera-Region``
31+
``Zyte-JobId`` ``X-Crawlera-JobId``
32+
``Zyte-Override-Headers`` ``X-Crawlera-Profile-Pass``
33+
========================= ===========================
34+
35+
Also, if a request is not being proxied and includes a header for any of these
36+
services, it will be dropped, to prevent leaking data to external websites.
37+
This downloader middleware assumes that a header prefixed with ``Zyte-`` is a
38+
Zyte API header, and that a header prefixed with ``X-Crawlera-`` is a Zyte
39+
Smart Proxy Manager header, even if they are not known headers otherwise.
40+
41+
When dropping a header, be it as part of header translation or to avoid leaking
42+
data, a warning message with details will be logged.

docs/index.rst

Lines changed: 97 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -2,104 +2,138 @@
22
scrapy-zyte-smartproxy |version| documentation
33
==============================================
44

5-
scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to interact with
6-
`Zyte Smart Proxy Manager`_ (formerly Crawlera) automatically.
5+
.. toctree::
6+
:hidden:
7+
8+
headers
9+
settings
10+
news
11+
12+
scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to use one of
13+
Zyte’s proxy APIs: either the proxy API of `Zyte API`_ or `Zyte Smart Proxy
14+
Manager`_ (formerly Crawlera).
715

816
.. _Scrapy downloader middleware: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
17+
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html
918
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/
1019

1120
Configuration
1221
=============
1322

14-
.. toctree::
15-
:caption: Configuration
23+
#. Add the downloader middleware to your ``DOWNLOADER_MIDDLEWARES`` Scrapy
24+
setting:
1625

26+
.. code-block:: python
27+
:caption: settings.py
1728
18-
* Add the Zyte Smart Proxy Manager middleware including it into the ``DOWNLOADER_MIDDLEWARES`` in your ``settings.py`` file::
29+
DOWNLOADER_MIDDLEWARES = {
30+
...
31+
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610
32+
}
1933
20-
DOWNLOADER_MIDDLEWARES = {
21-
...
22-
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610
23-
}
34+
#. Enable the middleware and configure your API key, either through Scrapy
35+
settings:
2436

25-
* Then there are two ways to enable it
37+
.. code-block:: python
38+
:caption: settings.py
2639
27-
* Through ``settings.py``::
40+
ZYTE_SMARTPROXY_ENABLED = True
41+
ZYTE_SMARTPROXY_APIKEY = 'apikey'
2842
29-
ZYTE_SMARTPROXY_ENABLED = True
30-
ZYTE_SMARTPROXY_APIKEY = 'apikey'
43+
Or through spider attributes:
3144

32-
* Through spider attributes::
45+
.. code-block:: python
3346
34-
class MySpider:
35-
zyte_smartproxy_enabled = True
36-
zyte_smartproxy_apikey = 'apikey'
47+
class MySpider(scrapy.Spider):
48+
zyte_smartproxy_enabled = True
49+
zyte_smartproxy_apikey = 'apikey'
3750
51+
.. _ZYTE_SMARTPROXY_URL:
3852

39-
* (optional) If you are not using the default Zyte Smart Proxy Manager proxy (``http://proxy.zyte.com:8011``),
40-
for example if you have a dedicated or private instance,
41-
make sure to also set ``ZYTE_SMARTPROXY_URL`` in ``settings.py``, e.g.::
53+
#. Set the ``ZYTE_SMARTPROXY_URL`` Scrapy setting as needed:
4254

43-
ZYTE_SMARTPROXY_URL = 'http://myinstance.zyte.com:8011'
55+
- To use the proxy API of Zyte API, set it to
56+
``http://api.zyte.com:8011``:
4457

45-
How to use it
46-
=============
58+
.. code-block:: python
59+
:caption: settings.py
4760
48-
.. toctree::
49-
:caption: How to use it
50-
:hidden:
61+
ZYTE_SMARTPROXY_URL = "http://api.zyte.com:8011"
5162
52-
settings
63+
- To use the default Zyte Smart Proxy Manager endpoint, leave it unset.
5364

54-
:doc:`settings`
55-
All configurable Scrapy Settings added by the Middleware.
65+
- To use a custom Zyte Smart Proxy Manager endpoint, in case you have a
66+
dedicated or private instance, set it to your custom endpoint. For
67+
example:
5668

69+
.. code-block:: python
70+
:caption: settings.py
5771
58-
With the middleware, the usage of Zyte Smart Proxy Manager is automatic, every request will go through Zyte Smart Proxy Manager without nothing to worry about.
59-
If you want to *disable* Zyte Smart Proxy Manager on a specific Request, you can do so by updating `meta` with `dont_proxy=True`::
72+
ZYTE_SMARTPROXY_URL = "http://myinstance.zyte.com:8011"
6073
6174
62-
scrapy.Request(
63-
'http://example.com',
64-
meta={
65-
'dont_proxy': True,
66-
...
67-
},
68-
)
75+
Usage
76+
=====
6977

78+
Once the downloader middleware is properly configured, every request goes
79+
through the configured Zyte proxy API.
7080

71-
Remember that you are now making requests to Zyte Smart Proxy Manager, and the Zyte Smart Proxy Manager service will be the one actually making the requests to the different sites.
81+
.. _override:
7282

73-
If you need to specify special `Zyte Smart Proxy Manager headers <https://docs.zyte.com/smart-proxy-manager.html#request-headers>`_, just apply them as normal `Scrapy headers <https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers>`_.
83+
Although the plugin configuration only allows defining a single proxy API
84+
endpoint and API key, it is possible to override them for specific requests, so
85+
that you can use different combinations for different requests within the same
86+
spider.
7487

75-
Here we have an example of specifying a Zyte Smart Proxy Manager header into a Scrapy request::
88+
To **override** which combination of endpoint and API key is used for a given
89+
request, set ``proxy`` in the request metadata to a URL indicating both the
90+
target endpoint and the API key to use. For example:
7691

77-
scrapy.Request(
78-
'http://example.com',
79-
headers={
80-
'X-Crawlera-Max-Retries': 1,
81-
...
82-
},
83-
)
92+
.. code-block:: python
8493
85-
Remember that you could also set which headers to use by default by all
86-
requests with `DEFAULT_REQUEST_HEADERS <http://doc.scrapy.org/en/1.0/topics/settings.html#default-request-headers>`_
94+
scrapy.Request(
95+
"https://topscrape.com",
96+
meta={
97+
"proxy": "http://[email protected]:8011",
98+
...
99+
},
100+
)
87101
88-
.. note:: Zyte Smart Proxy Manager headers are removed from requests when the middleware is activated but Zyte Smart Proxy Manager
89-
is disabled. For example, if you accidentally disable Zyte Smart Proxy Manager via ``zyte_smartproxy_enabled = False``
90-
but keep sending ``X-Crawlera-*`` headers in your requests, those will be removed from the
91-
request headers.
102+
.. TODO: Check that a colon after the API key is not needed in this case.
92103
93-
This Middleware also adds some configurable Scrapy Settings, check :ref:`the complete list here <settings>`.
104+
To **disable** proxying altogether for a given request, set ``dont_proxy`` to
105+
``True`` on the request metadata:
94106

95-
All the rest
96-
============
107+
.. code-block:: python
97108
98-
.. toctree::
99-
:caption: All the rest
100-
:hidden:
109+
scrapy.Request(
110+
"https://topscrape.com",
111+
meta={
112+
"dont_proxy": True,
113+
...
114+
},
115+
)
101116
102-
news
117+
You can set `Zyte API proxy headers`_ or `Zyte Smart Proxy Manager headers`_ as
118+
regular `Scrapy headers`_, e.g. using the ``headers`` parameter of ``Request``
119+
or using the DEFAULT_REQUEST_HEADERS_ setting. For example:
120+
121+
.. code-block:: python
122+
123+
scrapy.Request(
124+
"https://topscrape.com",
125+
headers={
126+
"Zyte-Geolocation": "FR",
127+
...
128+
},
129+
)
130+
131+
.. _Zyte API proxy headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html
132+
.. _Zyte Smart Proxy Manager headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers
133+
.. _Scrapy headers: https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers
134+
.. _DEFAULT_REQUEST_HEADERS: https://doc.scrapy.org/en/latest/topics/settings.html#default-request-headers
135+
136+
For information about proxy-specific header processing, see :doc:`headers`.
103137

104-
:doc:`news`
105-
See what has changed in recent scrapy-zyte-smartproxy versions.
138+
See also :ref:`settings` for the complete list of settings that this downloader
139+
middleware supports.

docs/settings.rst

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,53 +3,62 @@ Settings
33
========
44

55
This Scrapy downloader middleware adds some settings to configure how to work
6-
with Zyte Smart Proxy Manager.
6+
with your Zyte proxy API.
77

88
ZYTE_SMARTPROXY_APIKEY
99
----------------------
1010

1111
Default: ``None``
1212

13-
Unique Zyte Smart Proxy Manager API key provided for authentication.
13+
Default API key for your Zyte proxy API service.
14+
15+
Note that Zyte API and Zyte Smart Proxy Manager have different API keys.
16+
17+
You can :ref:`override this value on specific requests <override>`.
18+
1419

1520
ZYTE_SMARTPROXY_URL
1621
-------------------
1722

1823
Default: ``'http://proxy.zyte.com:8011'``
1924

20-
Zyte Smart Proxy Manager instance URL, it varies depending on adquiring a private or dedicated instance. If Zyte Smart Proxy Manager didn't provide
21-
you with a private instance URL, you don't need to specify it.
25+
Default endpoint for your Zyte proxy API service.
26+
27+
For guidelines on setting a value, see the :ref:`initial configuration
28+
instructions <ZYTE_SMARTPROXY_URL>`.
29+
30+
You can :ref:`override this value on specific requests <override>`.
2231

2332
ZYTE_SMARTPROXY_MAXBANS
2433
-----------------------
2534

2635
Default: ``400``
2736

28-
Number of consecutive bans from Zyte Smart Proxy Manager necessary to stop the spider.
37+
Number of consecutive bans necessary to stop the spider.
2938

3039
ZYTE_SMARTPROXY_DOWNLOAD_TIMEOUT
3140
--------------------------------
3241

3342
Default: ``190``
3443

35-
Timeout for processing Zyte Smart Proxy Manager requests. It overrides Scrapy's ``DOWNLOAD_TIMEOUT``.
44+
Timeout for processing proxied requests. It overrides Scrapy's ``DOWNLOAD_TIMEOUT``.
3645

3746
ZYTE_SMARTPROXY_PRESERVE_DELAY
3847
------------------------------
3948

4049
Default: ``False``
4150

42-
If ``False`` Sets Scrapy's ``DOWNLOAD_DELAY`` to ``0``, making the spider to crawl faster. If set to ``True``, it will
51+
If ``False`` sets Scrapy's ``DOWNLOAD_DELAY`` to ``0``, making the spider to crawl faster. If set to ``True``, it will
4352
respect the provided ``DOWNLOAD_DELAY`` from Scrapy.
4453

4554
ZYTE_SMARTPROXY_DEFAULT_HEADERS
4655
-------------------------------
4756

4857
Default: ``{}``
4958

50-
Default headers added only to Zyte Smart Proxy Manager requests. Headers defined on ``DEFAULT_REQUEST_HEADERS`` will take precedence as long as the ``ZyteSmartProxyMiddleware`` is placed after the ``DefaultHeadersMiddleware``. Headers set on the requests have precedence over the two settings.
59+
Default headers added only to proxied requests. Headers defined on ``DEFAULT_REQUEST_HEADERS`` will take precedence as long as the ``ZyteSmartProxyMiddleware`` is placed after the ``DefaultHeadersMiddleware``. Headers set on the requests have precedence over the two settings.
5160

52-
* This is the default behavior, ``DefaultHeadersMiddleware`` default priority is ``400`` and we recommend ``ZyteSmartProxyMiddleware`` priority to be ``610``
61+
* This is the default behavior, ``DefaultHeadersMiddleware`` default priority is ``400`` and we recommend ``ZyteSmartProxyMiddleware`` priority to be ``610``.
5362

5463
ZYTE_SMARTPROXY_BACKOFF_STEP
5564
----------------------------
@@ -70,9 +79,9 @@ ZYTE_SMARTPROXY_FORCE_ENABLE_ON_HTTP_CODES
7079

7180
Default: ``[]``
7281

73-
List of HTTP response status codes that warrant enabling Zyte Smart Proxy Manager for the
74-
corresponding domain.
82+
List of HTTP response status codes that warrant enabling your Zyte proxy API
83+
service for the corresponding domain.
7584

76-
When a response with one of these HTTP status codes is received after a request
77-
that did not go through Zyte Smart Proxy Manager, the request is retried with Zyte Smart Proxy Manager, and any
78-
new request to the same domain is also sent through Zyte Smart Proxy Manager.
85+
When a response with one of these HTTP status codes is received after an
86+
unproxied request, the request is retried with your Zyte proxy API service, and
87+
any new request to the same domain is also proxied.

0 commit comments

Comments
 (0)