|
2 | 2 | scrapy-zyte-smartproxy |version| documentation |
3 | 3 | ============================================== |
4 | 4 |
|
5 | | -scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to interact with |
6 | | -`Zyte Smart Proxy Manager`_ (formerly Crawlera) automatically. |
| 5 | +.. toctree:: |
| 6 | + :hidden: |
| 7 | + |
| 8 | + headers |
| 9 | + settings |
| 10 | + news |
| 11 | + |
| 12 | +scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to use one of |
| 13 | +Zyte’s proxy APIs: either the proxy API of `Zyte API`_ or `Zyte Smart Proxy |
| 14 | +Manager`_ (formerly Crawlera). |
7 | 15 |
|
8 | 16 | .. _Scrapy downloader middleware: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html |
| 17 | +.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html |
9 | 18 | .. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/ |
10 | 19 |
|
11 | 20 | Configuration |
12 | 21 | ============= |
13 | 22 |
|
14 | | -.. toctree:: |
15 | | - :caption: Configuration |
| 23 | +#. Add the downloader middleware to your ``DOWNLOADER_MIDDLEWARES`` Scrapy |
| 24 | + setting: |
16 | 25 |
|
| 26 | + .. code-block:: python |
| 27 | + :caption: settings.py |
17 | 28 |
|
18 | | -* Add the Zyte Smart Proxy Manager middleware including it into the ``DOWNLOADER_MIDDLEWARES`` in your ``settings.py`` file:: |
| 29 | + DOWNLOADER_MIDDLEWARES = { |
| 30 | + ... |
| 31 | + 'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610 |
| 32 | + } |
19 | 33 |
|
20 | | - DOWNLOADER_MIDDLEWARES = { |
21 | | - ... |
22 | | - 'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610 |
23 | | - } |
| 34 | +#. Enable the middleware and configure your API key, either through Scrapy |
| 35 | + settings: |
24 | 36 |
|
25 | | -* Then there are two ways to enable it |
| 37 | + .. code-block:: python |
| 38 | + :caption: settings.py |
26 | 39 |
|
27 | | - * Through ``settings.py``:: |
| 40 | + ZYTE_SMARTPROXY_ENABLED = True |
| 41 | + ZYTE_SMARTPROXY_APIKEY = 'apikey' |
28 | 42 |
|
29 | | - ZYTE_SMARTPROXY_ENABLED = True |
30 | | - ZYTE_SMARTPROXY_APIKEY = 'apikey' |
| 43 | + Or through spider attributes: |
31 | 44 |
|
32 | | - * Through spider attributes:: |
| 45 | + .. code-block:: python |
33 | 46 |
|
34 | | - class MySpider: |
35 | | - zyte_smartproxy_enabled = True |
36 | | - zyte_smartproxy_apikey = 'apikey' |
| 47 | + class MySpider(scrapy.Spider): |
| 48 | + zyte_smartproxy_enabled = True |
| 49 | + zyte_smartproxy_apikey = 'apikey' |
37 | 50 |
|
| 51 | +.. _ZYTE_SMARTPROXY_URL: |
38 | 52 |
|
39 | | -* (optional) If you are not using the default Zyte Smart Proxy Manager proxy (``http://proxy.zyte.com:8011``), |
40 | | - for example if you have a dedicated or private instance, |
41 | | - make sure to also set ``ZYTE_SMARTPROXY_URL`` in ``settings.py``, e.g.:: |
| 53 | +#. Set the ``ZYTE_SMARTPROXY_URL`` Scrapy setting as needed: |
42 | 54 |
|
43 | | - ZYTE_SMARTPROXY_URL = 'http://myinstance.zyte.com:8011' |
| 55 | + - To use the proxy API of Zyte API, set it to |
| 56 | + ``http://api.zyte.com:8011``: |
44 | 57 |
|
45 | | -How to use it |
46 | | -============= |
| 58 | + .. code-block:: python |
| 59 | + :caption: settings.py |
47 | 60 |
|
48 | | -.. toctree:: |
49 | | - :caption: How to use it |
50 | | - :hidden: |
| 61 | + ZYTE_SMARTPROXY_URL = "http://api.zyte.com:8011" |
51 | 62 |
|
52 | | - settings |
| 63 | + - To use the default Zyte Smart Proxy Manager endpoint, leave it unset. |
53 | 64 |
|
54 | | -:doc:`settings` |
55 | | - All configurable Scrapy Settings added by the Middleware. |
| 65 | + - To use a custom Zyte Smart Proxy Manager endpoint, in case you have a |
| 66 | + dedicated or private instance, set it to your custom endpoint. For |
| 67 | + example: |
56 | 68 |
|
| 69 | + .. code-block:: python |
| 70 | + :caption: settings.py |
57 | 71 |
|
58 | | -With the middleware, the usage of Zyte Smart Proxy Manager is automatic, every request will go through Zyte Smart Proxy Manager without nothing to worry about. |
59 | | -If you want to *disable* Zyte Smart Proxy Manager on a specific Request, you can do so by updating `meta` with `dont_proxy=True`:: |
| 72 | + ZYTE_SMARTPROXY_URL = "http://myinstance.zyte.com:8011" |
60 | 73 |
|
61 | 74 |
|
62 | | - scrapy.Request( |
63 | | - 'http://example.com', |
64 | | - meta={ |
65 | | - 'dont_proxy': True, |
66 | | - ... |
67 | | - }, |
68 | | - ) |
| 75 | +Usage |
| 76 | +===== |
69 | 77 |
|
| 78 | +Once the downloader middleware is properly configured, every request goes |
| 79 | +through the configured Zyte proxy API. |
70 | 80 |
|
71 | | -Remember that you are now making requests to Zyte Smart Proxy Manager, and the Zyte Smart Proxy Manager service will be the one actually making the requests to the different sites. |
| 81 | +.. _override: |
72 | 82 |
|
73 | | -If you need to specify special `Zyte Smart Proxy Manager headers <https://docs.zyte.com/smart-proxy-manager.html#request-headers>`_, just apply them as normal `Scrapy headers <https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers>`_. |
| 83 | +Although the plugin configuration only allows defining a single proxy API |
| 84 | +endpoint and API key, it is possible to override them for specific requests, so |
| 85 | +that you can use different combinations for different requests within the same |
| 86 | +spider. |
74 | 87 |
|
75 | | -Here we have an example of specifying a Zyte Smart Proxy Manager header into a Scrapy request:: |
| 88 | +To **override** which combination of endpoint and API key is used for a given |
| 89 | +request, set ``proxy`` in the request metadata to a URL indicating both the |
| 90 | +target endpoint and the API key to use. For example: |
76 | 91 |
|
77 | | - scrapy.Request( |
78 | | - 'http://example.com', |
79 | | - headers={ |
80 | | - 'X-Crawlera-Max-Retries': 1, |
81 | | - ... |
82 | | - }, |
83 | | - ) |
| 92 | + .. code-block:: python |
84 | 93 |
|
85 | | -Remember that you could also set which headers to use by default by all |
86 | | -requests with `DEFAULT_REQUEST_HEADERS <http://doc.scrapy.org/en/1.0/topics/settings.html#default-request-headers>`_ |
| 94 | + scrapy.Request( |
| 95 | + "https://topscrape.com", |
| 96 | + meta={ |
| 97 | + "proxy": "http://[email protected]:8011", |
| 98 | + ... |
| 99 | + }, |
| 100 | + ) |
87 | 101 |
|
88 | | -.. note:: Zyte Smart Proxy Manager headers are removed from requests when the middleware is activated but Zyte Smart Proxy Manager |
89 | | - is disabled. For example, if you accidentally disable Zyte Smart Proxy Manager via ``zyte_smartproxy_enabled = False`` |
90 | | - but keep sending ``X-Crawlera-*`` headers in your requests, those will be removed from the |
91 | | - request headers. |
| 102 | +.. TODO: Check that a colon after the API key is not needed in this case. |
92 | 103 |
|
93 | | -This Middleware also adds some configurable Scrapy Settings, check :ref:`the complete list here <settings>`. |
| 104 | +To **disable** proxying altogether for a given request, set ``dont_proxy`` to |
| 105 | +``True`` on the request metadata: |
94 | 106 |
|
95 | | -All the rest |
96 | | -============ |
| 107 | + .. code-block:: python |
97 | 108 |
|
98 | | -.. toctree:: |
99 | | - :caption: All the rest |
100 | | - :hidden: |
| 109 | + scrapy.Request( |
| 110 | + "https://topscrape.com", |
| 111 | + meta={ |
| 112 | + "dont_proxy": True, |
| 113 | + ... |
| 114 | + }, |
| 115 | + ) |
101 | 116 |
|
102 | | - news |
| 117 | +You can set `Zyte API proxy headers`_ or `Zyte Smart Proxy Manager headers`_ as |
| 118 | +regular `Scrapy headers`_, e.g. using the ``headers`` parameter of ``Request`` |
| 119 | +or using the DEFAULT_REQUEST_HEADERS_ setting. For example: |
| 120 | + |
| 121 | + .. code-block:: python |
| 122 | +
|
| 123 | + scrapy.Request( |
| 124 | + "https://topscrape.com", |
| 125 | + headers={ |
| 126 | + "Zyte-Geolocation": "FR", |
| 127 | + ... |
| 128 | + }, |
| 129 | + ) |
| 130 | +
|
| 131 | +.. _Zyte API proxy headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html |
| 132 | +.. _Zyte Smart Proxy Manager headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers |
| 133 | +.. _Scrapy headers: https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers |
| 134 | +.. _DEFAULT_REQUEST_HEADERS: https://doc.scrapy.org/en/latest/topics/settings.html#default-request-headers |
| 135 | + |
| 136 | +For information about proxy-specific header processing, see :doc:`headers`. |
103 | 137 |
|
104 | | -:doc:`news` |
105 | | - See what has changed in recent scrapy-zyte-smartproxy versions. |
| 138 | +See also :ref:`settings` for the complete list of settings that this downloader |
| 139 | +middleware supports. |
0 commit comments