Skip to content

pywb cannot handle non-latin1 WARC-headers #961

@steph-nb

Description

@steph-nb

Describe the bug

pywb 2.9.1 seems no longe to be able to cope with non-latin1 WARC-headers.

When having such a WARC-record:
WARC/1.0
WARC-Type: response
WARC-Record-ID: urn:uuid:1d39260d-fe6c-45a7-a8e1-b07625642fc8
WARC-Target-URI: https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg
WARC-Date: 2025-10-15T08:05:04Z
WARC-Payload-Digest: sha1:XE4A7CUPWUOCWYKFIG2AWLZHULYEHTAP
WARC-Block-Digest: sha1:VHPNYPS7BV5F5BUQTZMSAHUTXIT6N5FU
Content-Type: application/http; msgtype=response
Content-Length: 61903

The replay of this page (https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_%281789–1858%29.jpg) fails with:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1107, in handle_one_response
self.run_application()
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1053, in run_application
self.result = self.application(self.environ, self.start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 106, in call
return self.send_error({}, start_response,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 148, in send_error
start_response(message, list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 983, in start_response
self.status = status.encode("latin-1")
^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 122: ordinal not in range(256)
2025-10-15T08:06:53Z {'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '55238', 'HTTP_HOST': 'localhost:57817', (hidden keys: 23)} failed with UnicodeEncodeError

127.0.0.1 - - [2025-10-15 10:06:53] "POST /wiki/resource/postreq?url=https%3A%2F%2Frm.wikipedia.org%2Fwiki%2FDatoteca%253AOtto_Carisch_%25281789%25E2%2580%25931858%2529.jpg&closest=20251008171921&matchType=exact HTTP/1.1" 500 161 0.128404
2025-10-15 10:06:53,454: [DEBUG]: http://localhost:57817 "POST /wiki/resource/postreq?url=https%3A%2F%2Frm.wikipedia.org%2Fwiki%2FDatoteca%253AOtto_Carisch_%25281789%25E2%2580%25931858%2529.jpg&closest=20251008171921&matchType=exact HTTP/1.1" 500 21
127.0.0.1 - - [2025-10-15 10:06:53] "GET /wiki/20251008171921/https://rm.wikipedia.org/wiki/Datoteca%3AOtto_Carisch_%281789%E2%80%931858%29.jpg HTTP/1.1" 500 1587 0.208289
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/css/bootstrap.min.css HTTP/1.1" 200 153240 0.001996
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/css/font-awesome.min.css HTTP/1.1" 200 54739 0.002008
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/css/base.css HTTP/1.1" 200 962 0.001023
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/js/jquery-latest.min.js HTTP/1.1" 200 87036 0.007615
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/js/bootstrap.min.js HTTP/1.1" 200 76418 0.009702
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/js/bootstrap.bundle.min.js.map HTTP/1.1" 200 115 0.001821
127.0.0.1 - - [2025-10-15 10:06:55] "GET /static/sw.js?replayPrefix= HTTP/1.1" 200 1026803 0.010996
2025-10-15 10:20:23,028: [DEBUG]: Starting new HTTP connection (1): localhost:57817
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1107, in handle_one_response
self.run_application()
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1053, in run_application
self.result = self.application(self.environ, self.start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 106, in call
return self.send_error({}, start_response,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 148, in send_error
start_response(message, list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 983, in start_response
self.status = status.encode("latin-1")
^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 122: ordinal not in range(256)

Environment

  • OS: Windows11
  • python 3.12
  • pywb 2.9.1
  • Browser: Chrome

Additional context

warc written with warcio 1.7.5

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions