-
-
Notifications
You must be signed in to change notification settings - Fork 33k
Open
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)stdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancementA feature request or enhancement
Description
Bug report
Bug description:
For codecs.encode
,
with utf-*
encoding, and a custom errors
which returns str
,
if you pass some characters that are not valid UTF characters (e.g. surrogates),
UnicodeEncodeError
is just raised and there's not the expected (and documented) case
where the returned str
is appended.
import codecs
ERRORS_NAME = "returning non-ascii"
# something being not encod-able via `utf-*`
BAD_UTF = "\uD800" # the first high surrogate character
def register_repl_error(repl: str):
def error_handle(exc: UnicodeEncodeError) -> tuple[str, int]:
return (repl, exc.end)
codecs.register_error(ERRORS_NAME, error_handle)
def encode_surrogate(encoding: str, repl: str):
register_repl_error(repl)
max_enc_len = 9
pre = f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) "
try:
res = codecs.encode(BAD_UTF, encoding, ERRORS_NAME)
except UnicodeEncodeError as err:
reason = err.reason
print(pre + f"raises with {reason=}")
else:
print(pre + f"returns {res}")
NON_ASCII = "龍" # loong in Chinese
## utf-*
for i in ('8', '16', '32', '16-le', '16-be'):
encode_surrogate("utf-" + i, NON_ASCII)
print('-'*3)
# The following is some non-utf* encoding, which works fine
## cjk
### zh
for enc in ("gbk", "big5"):
encode_surrogate(enc, NON_ASCII)
### jp
for enc in ("Shift_JIS", "EUC-JP"):
encode_surrogate(enc, NON_ASCII)
Output:
codecs.encode('\ud800', encoding=utf-8 ) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16 ) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-32 ) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-le) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-be) raises with reason='surrogates not allowed'
---
codecs.encode('\ud800', encoding=gbk ) returns b'\xfd\x88'
codecs.encode('\ud800', encoding=big5 ) returns b'\xc0s'
codecs.encode('\ud800', encoding=Shift_JIS) returns b'\x97\xb4'
codecs.encode('\ud800', encoding=EUC-JP ) returns b'\xce\xb6'
CPython versions tested on:
3.9, 3.11, 3.12, 3.13, 3.14
Operating systems tested on:
Linux, Windows
Linked PRs
Metadata
Metadata
Assignees
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)stdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancementA feature request or enhancement
Projects
Status
No status