-
-
Notifications
You must be signed in to change notification settings - Fork 238
Description
Describe the bug
pywb 2.9.1 seems no longe to be able to cope with non-latin1 WARC-headers.
When having such a WARC-record:
WARC/1.0
WARC-Type: response
WARC-Record-ID: urn:uuid:1d39260d-fe6c-45a7-a8e1-b07625642fc8
WARC-Target-URI: https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg
WARC-Date: 2025-10-15T08:05:04Z
WARC-Payload-Digest: sha1:XE4A7CUPWUOCWYKFIG2AWLZHULYEHTAP
WARC-Block-Digest: sha1:VHPNYPS7BV5F5BUQTZMSAHUTXIT6N5FU
Content-Type: application/http; msgtype=response
Content-Length: 61903
The replay of this page (https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_%281789–1858%29.jpg) fails with:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1107, in handle_one_response
self.run_application()
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1053, in run_application
self.result = self.application(self.environ, self.start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 106, in call
return self.send_error({}, start_response,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 148, in send_error
start_response(message, list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 983, in start_response
self.status = status.encode("latin-1")
^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 122: ordinal not in range(256)
2025-10-15T08:06:53Z {'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '55238', 'HTTP_HOST': 'localhost:57817', (hidden keys: 23)} failed with UnicodeEncodeError
127.0.0.1 - - [2025-10-15 10:06:53] "POST /wiki/resource/postreq?url=https%3A%2F%2Frm.wikipedia.org%2Fwiki%2FDatoteca%253AOtto_Carisch_%25281789%25E2%2580%25931858%2529.jpg&closest=20251008171921&matchType=exact HTTP/1.1" 500 161 0.128404
2025-10-15 10:06:53,454: [DEBUG]: http://localhost:57817 "POST /wiki/resource/postreq?url=https%3A%2F%2Frm.wikipedia.org%2Fwiki%2FDatoteca%253AOtto_Carisch_%25281789%25E2%2580%25931858%2529.jpg&closest=20251008171921&matchType=exact HTTP/1.1" 500 21
127.0.0.1 - - [2025-10-15 10:06:53] "GET /wiki/20251008171921/https://rm.wikipedia.org/wiki/Datoteca%3AOtto_Carisch_%281789%E2%80%931858%29.jpg HTTP/1.1" 500 1587 0.208289
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/css/bootstrap.min.css HTTP/1.1" 200 153240 0.001996
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/css/font-awesome.min.css HTTP/1.1" 200 54739 0.002008
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/css/base.css HTTP/1.1" 200 962 0.001023
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/js/jquery-latest.min.js HTTP/1.1" 200 87036 0.007615
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/js/bootstrap.min.js HTTP/1.1" 200 76418 0.009702
127.0.0.1 - - [2025-10-15 10:06:53] "GET /static/js/bootstrap.bundle.min.js.map HTTP/1.1" 200 115 0.001821
127.0.0.1 - - [2025-10-15 10:06:55] "GET /static/sw.js?replayPrefix= HTTP/1.1" 200 1026803 0.010996
2025-10-15 10:20:23,028: [DEBUG]: Starting new HTTP connection (1): localhost:57817
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 969, in start_response
value.encode("latin-1")))
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 57: ordinal not in range(256)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 92, in call
start_response('200 OK', list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 972, in start_response
raise UnicodeError("Non-latin1 header", repr(header), repr(value))
UnicodeError: ('Non-latin1 header', "'WARC-Target-URI'", "'https://rm.wikipedia.org/wiki/Datoteca:Otto_Carisch_(1789–1858).jpg'")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1107, in handle_one_response
self.run_application()
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 1053, in run_application
self.result = self.application(self.environ, self.start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 106, in call
return self.send_error({}, start_response,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Tools\pywb\Lib\site-packages\pywb\warcserver\basewarcserver.py", line 148, in send_error
start_response(message, list(out_headers.items()))
File "C:\Tools\pywb\Lib\site-packages\gevent\pywsgi.py", line 983, in start_response
self.status = status.encode("latin-1")
^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 122: ordinal not in range(256)
Environment
- OS: Windows11
- python 3.12
- pywb 2.9.1
- Browser: Chrome
Additional context
warc written with warcio 1.7.5
Metadata
Metadata
Assignees
Labels
Type
Projects
Status