Skip to content

Configure temporary WARC storage location #36

@wvengen

Description

@wvengen

When running a spider using this extension from a read-only location (e.g. a Docker container, as in scrapyd-k8s-spider-example), storing the WARC files fails:

[scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method WaczExporter.response_downloaded of <scrapy_webarchive.extensions.WaczExporter object at 0x7fe9ebff4790>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/signal.py", line 43, in send_catch_log
    response = robustApply(
               ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy_webarchive/extensions.py", line 165, in response_downloaded
    response_record = self.writer.write_response(response, request)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy_webarchive/warc/writers.py", line 78, in write_response
    record = self.write_record(
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy_webarchive/warc/writers.py", line 39, in write_record
    with open(self.warc_fname, "ab") as fh:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: 'Spider-20250000000000-00000-scrapyd-project-12345-abcde.warc.gz'

I think it makes sense to introduce a configuration variable for this location, and perhaps default to TMP (or /tmp).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions