Skip to content

Latest commit

 

History

History
511 lines (375 loc) · 21 KB

File metadata and controls

511 lines (375 loc) · 21 KB

Session management

Zyte API provides powerful session APIs:

When using scrapy-zyte-api, you can use these session APIs through the corresponding Zyte API fields (:http:`request:session`, :http:`request:sessionContext`).

However, scrapy-zyte-api also provides its own session management API, similar to that of :ref:`server-managed sessions <zapi-session-contexts>`, but built on top of :ref:`client-managed sessions <zapi-session-id>`.

scrapy-zyte-api session management offers some advantages over :ref:`server-managed sessions <zapi-session-contexts>`:

However, scrapy-zyte-api session management is not a replacement for :ref:`server-managed sessions <zapi-session-contexts>` or :ref:`client-managed sessions <zapi-session-id>`:

Enabling session management

To enable session management for all requests, set :setting:`ZYTE_API_SESSION_ENABLED` to True. You can also toggle session management on or off for specific requests using the :reqmeta:`zyte_api_session_enabled` request metadata key, or override the :meth:`~scrapy_zyte_api.SessionConfig.enabled` method of a :ref:`session config override <session-configs>`.

By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each initialized with a :ref:`browser request <zapi-browser>` targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic through request metadata, settings and :ref:`session config overrides <session-configs>`.

For session management to work as expected, your :setting:`ZYTE_API_RETRY_POLICY` should not retry 520 and 521 responses:

Initializing sessions

To change the :ref:`default session initialization parameters <session-init-default>`, you have the following options:

Precedence, from higher to lower, is:

  1. :reqmeta:`zyte_api_session_params`
  2. :reqmeta:`zyte_api_session_location`
  3. :setting:`ZYTE_API_SESSION_PARAMS`
  4. :setting:`ZYTE_API_SESSION_LOCATION`
  5. :meth:`~scrapy_zyte_api.SessionConfig.location`
  6. :meth:`~scrapy_zyte_api.SessionConfig.params`

Checking sessions

Responses from a session can be checked for session validity. If a response does not pass a session validity check, the session is discarded, and the request is retried with a different session.

Session checking can be useful to work around scenarios where session initialization fails, e.g. due to rendering issues, IP-geolocation mismatches, A-B tests, etc. It can also help in cases where website sessions expire before Zyte API sessions.

By default, if a location is defined through :reqmeta:`zyte_api_session_location`, :setting:`ZYTE_API_SESSION_LOCATION` or :meth:`~scrapy_zyte_api.SessionConfig.location`, even if the parameters used for session initialization actually come from :reqmeta:`zyte_api_session_params` or :setting:`ZYTE_API_SESSION_LOCATION`, the outcome of the first setLocation action used, if any, is checked. If the action fails, the session is discarded. If the action is not even available for a given website, the spider is closed with unsupported_set_location as the close reason; in that case, you should define a proper :ref:`session initialization logic <session-init>` for requests targeting that website.

For sessions initialized without a configured location, no session check is performed, sessions are assumed to be fine until they expire or are banned. That is so even if session initialization parameters include a setLocation action.

To implement your own code to check session responses and determine whether their session should be kept or discarded, use the :setting:`ZYTE_API_SESSION_CHECKER` setting. If you need to check session validity for multiple websites, it is better to define a separate :ref:`session config override <session-configs>` for each website, each with its own implementation of :meth:`~scrapy_zyte_api.SessionConfig.check`.

The :reqmeta:`zyte_api_session_location` and :reqmeta:`zyte_api_session_params` request metadata keys, if present in a request that :ref:`triggers a session initialization request <pool-size>`, will be copied into the session initialization request, so that they are available when :setting:`ZYTE_API_SESSION_CHECKER` or :meth:`~scrapy_zyte_api.SessionConfig.check` are called for a session initialization request.

If your session checking implementation relies on the response body (e.g. it uses CSS or XPath expressions), you should make sure that you are getting one, which might not be the case if you are mostly using :ref:`Zyte API automatic extraction <zapi-extract>`, e.g. when using :doc:`Zyte spider templates <zyte-spider-templates:index>`. For example, you can use :setting:`ZYTE_API_AUTOMAP_PARAMS` and :setting:`ZYTE_API_PROVIDER_PARAMS` to force :http:`request:browserHtml` or :http:`request:httpResponseBody` to be set on every Zyte API request:

ZYTE_API_AUTOMAP_PARAMS = {"browserHtml": True}
ZYTE_API_PROVIDER_PARAMS = {"browserHtml": True}

Managing pools

scrapy-zyte-api can maintain multiple session pools.

By default, scrapy-zyte-api maintains a separate pool of sessions per domain.

If you use the :reqmeta:`zyte_api_session_params` or :reqmeta:`zyte_api_session_location` request metadata keys, scrapy-zyte-api will automatically use separate session pools within the target domain for those parameters or locations. See :meth:`~scrapy_zyte_api.SessionConfig.pool` for details.

If you want to customize further which pool is assigned to a given request, e.g. to have the same pool for multiple domains or use different pools within the same domain (e.g. for different URL patterns), you can either use the :reqmeta:`zyte_api_session_pool` request metadata key or use the :meth:`~scrapy_zyte_api.SessionConfig.pool` method of :ref:`session config overrides <session-configs>`.

The :setting:`ZYTE_API_SESSION_POOL_SIZE` setting determines the desired number of concurrent, active, working sessions per pool. The :setting:`ZYTE_API_SESSION_POOL_SIZES` setting allows defining different values for specific pools.

The actual number of sessions created for a session pool depends on the number of requests that ask for a session from that pool, and the life time of those sessions:

The session pool assigned to a request affects the :ref:`fingerprint <fingerprint>` of the request. 2 requests with a different session pool ID are considered different requests, i.e. not duplicate requests, even if they are otherwise identical.

Optimizing sessions

For faster crawls and lower costs, specially where session initialization requests are more expensive than session usage requests (e.g. scenarios where initialization relies on browserHtml while usage relies on httpResponseBody), you should try to make your sessions live as long as possible before they are discarded.

Here are some things you can try:

If you do not need :ref:`session checking <session-check>` and your :ref:`initialization parameters <session-init>` are only :http:`request:browserHtml` and :http:`request:actions`, :ref:`server-managed sessions <zapi-session-contexts>` might be a more cost-effective choice, as they live much longer than :ref:`client-managed sessions <zapi-session-id>`.

Overriding session configs

For spiders that target a single website, using settings and request metadata keys for :ref:`session initialization <session-init>` and :ref:`session checking <session-check>` should do the job. However, for broad-crawl spiders, :doc:`multi-website spiders <zyte-spider-templates:index>`, to modify session-using requests based on session initialization responses, or for code reusability purposes, you might want to define different session configs for different websites.

The default session config is implemented by the :class:`~scrapy_zyte_api.SessionConfig` class:

.. autoclass:: scrapy_zyte_api.SessionConfig
    :members:

To define a different session config for a given URL pattern, install :doc:`web-poet <web-poet:index>` and define a subclass of :class:`~scrapy_zyte_api.SessionConfig` decorated with :func:`~scrapy_zyte_api.session_config`:

.. autofunction:: scrapy_zyte_api.session_config

If you only need to override the :meth:`SessionConfig.check <scrapy_zyte_api.SessionConfig.check>` or :meth:`SessionConfig.params <scrapy_zyte_api.SessionConfig.params>` methods for scenarios involving a location, you may subclass :class:`~scrapy_zyte_api.LocationSessionConfig` instead:

.. autoclass:: scrapy_zyte_api.LocationSessionConfig
    :members: location_check, location_params

If in a session config implementation or in any other Scrapy component you need to tell whether a request is a :ref:`session initialization request <session-init>` or not, use :func:`~scrapy_zyte_api.is_session_init_request`:

.. autofunction:: scrapy_zyte_api.is_session_init_request

To get the session ID of a given request, use:

.. autofunction:: scrapy_zyte_api.get_request_session_id

Classes decorated with :func:`~scrapy_zyte_api.session_config` are registered into :data:`~scrapy_zyte_api.session_config_registry`:

.. autodata:: scrapy_zyte_api.session_config_registry
    :annotation:

Cookie handling

All requests involved in session management, both requests to initialize a session and requests that are assigned a session, have their :reqmeta:`dont_merge_cookies <scrapy:dont_merge_cookies>` request metadata key set to True if not already defined. Each Zyte API session handles its own cookies instead.

If you set :reqmeta:`dont_merge_cookies <scrapy:dont_merge_cookies>` to False in a request that uses a session, that request will include cookies managed by Scrapy. However, session initialization requests will still have :reqmeta:`dont_merge_cookies <scrapy:dont_merge_cookies>` set to True, you cannot override that.

To include cookies in session initialization requests, use :http:`request:requestCookies` in :ref:`session initialization parameters <session-init>`. But mind that those cookies are only set during that request, :ref:`they are not added to the session cookie jar <zapi-session-cookie-jar>`.

Session retry policies

The following retry policies are designed to work well with session management (see :ref:`enable-sessions`):

.. autodata:: scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY
    :annotation:

.. autodata:: scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY
    :annotation:


Spider closers

Session management can close your spider early in the following scenarios:

A custom :meth:`SessionConfig.check <scrapy_zyte_api.SessionConfig.check>` implementation may also close your spider with a custom reason by raising a :exc:`~scrapy.exceptions.CloseSpider` exception.

Session stats

Plugin-managed sessions trigger some stats to help understand how well sessions are working.

By default, stats are aggregated across session pools. Set :setting:`ZYTE_API_SESSION_STATS_PER_POOL` to True to enable per-pool stats.

Tracked stats are as follows (pools/{pool}/ is only present if per-pool stats are enabled):

scrapy-zyte-api/sessions/pools/{pool}/init/check-error

Number of times that a session for pool {pool} triggered an unexpected exception during its session validation check right after initialization.

It is most likely the result of a bad implementation of :meth:`SessionConfig.check <scrapy_zyte_api.SessionConfig.check>`; the logs should contain an error message with a traceback for such errors.

scrapy-zyte-api/sessions/pools/{pool}/init/check-failed
Number of times that a session from pool {pool} failed its session validation check right after initialization.
scrapy-zyte-api/sessions/pools/{pool}/init/check-passed
Number of times that a session from pool {pool} passed its session validation check right after initialization.
scrapy-zyte-api/sessions/pools/{pool}/init/failed
Number of times that initializing a session for pool {pool} resulted in an :ref:`unsuccessful response <zapi-unsuccessful-responses>`.
scrapy-zyte-api/sessions/pools/{pool}/init/param-error

Number of times that initializing a session for pool {pool} triggered an unexpected exception when obtaining the Zyte API parameters for session initialization.

It is most likely the result of a bad implementation of :meth:`SessionConfig.params <scrapy_zyte_api.SessionConfig.params>`; the logs should contain an error message with a traceback for such errors.

scrapy-zyte-api/sessions/pools/{pool}/use/check-error

Number of times that a response that used a session from pool {pool} triggered an unexpected exception during its session validation check.

It is most likely the result of a bad implementation of :meth:`SessionConfig.check <scrapy_zyte_api.SessionConfig.check>`; the logs should contain an error message with a traceback for such errors.

scrapy-zyte-api/sessions/pools/{pool}/use/check-failed
Number of times that a response that used a session from pool {pool} failed its session validation check.
scrapy-zyte-api/sessions/pools/{pool}/use/check-passed
Number of times that a response that used a session from pool {pool} passed its session validation check.
scrapy-zyte-api/sessions/pools/{pool}/use/expired
Number of times that a session from pool {pool} expired.
scrapy-zyte-api/sessions/pools/{pool}/use/failed
Number of times that a request that used a session from pool {pool} got an :ref:`unsuccessful response <zapi-unsuccessful-responses>`.
scrapy-zyte-api/sessions/use/disabled
Number of processed requests for which session management was disabled.