Skip to content

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Jul 28, 2025

The codecs lookup function now performs only minimal normalization of the encoding name before passing it to the search functions: all ASCII letters are converted to lower case, spaces are replaced with hyphens.

Excessive normalization broke third-party codecs providers, like python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (GH-15092)"

This reverts commit 20f59fe.


📚 Documentation preview 📚: https://cpython-previews--137167.org.readthedocs.build/

Copy link
Member

@StanFromIreland StanFromIreland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update normalize_encodings doc.

@malemburg Which order should the two PRs be merged in, switching it to the C implementation and simplifying the implementation?

@serhiy-storchaka
Copy link
Member Author

serhiy-storchaka commented Jul 28, 2025

This mainly restores the status quo prior to bpo-37751 and updates the documentation. But I am planning more changes.

I was not sure about replacing spaces with hyphens? Should we kept this here, on leave it to the search function? Should we convert spaces to underscores instead? Or should we remove any transformation?

I plan also to change _Py_normalize_encoding() -- only convert to lower case and replace spaces, hyphens and underscores, so encoding like "utf#$^(&8" will no longer be accepted. And change encodings.normalize_encoding() in a similar way. Names that are not valid module names after normalization should not be found.

The codecs lookup function now performs only minimal normalization of
the encoding name before passing it to the search functions:
all ASCII letters are converted to lower case, spaces are replaced
with hyphens.

Excessive normalization broke third-party codecs providers, like
python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (pythonGH-15092)"

This reverts commit 20f59fe.
Copy link
Member

@malemburg malemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the one change to the news entry.

Copy link
Member

@StanFromIreland StanFromIreland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@malemburg
Copy link
Member

Is there anything left to be done for this PR ?

After merging this one, I'd like to merge #136643

This will then give us a better story overall regarding encoding normalization in CPython: a central function which deals with normalization, avoiding duplicate normalizations as best as possible.

@serhiy-storchaka
Copy link
Member Author

I do not think #136643 should be merged. Normalization is an internal affair of the search function. Different search functions can have different normalizations (including no normalization).

@malemburg
Copy link
Member

I do not think #136643 should be merged. Normalization is an internal affair of the search function. Different search functions can have different normalizations (including no normalization).

Sure, but the encoding package uses the same normalization as the C implementation that's used internally (after all, the C implementation was crafted after the encoding package's normalization function), so it makes sense to reuse it that way.

Of course, other codec packages can have their own search functions and normalizations.

@serhiy-storchaka
Copy link
Member Author

The C code only needs support for few variants of few builting encodings (like "ISO-8859-1", "ISO_8859-1" and "iso8859-1"), but it does not need to support normalization for all encodings or more exotic forms of builting encodings (like "ISO_8859-1:1987" or "iso88591"). It is needed because some encodings should be accessible by their standard names before importing the encodings package.

@malemburg
Copy link
Member

The C code only needs support for few variants of few builting encodings (like "ISO-8859-1", "ISO_8859-1" and "iso8859-1"), but it does not need to support normalization for all encodings or more exotic forms of builting encodings (like "ISO_8859-1:1987" or "iso88591"). It is needed because some encodings should be accessible by their standard names before importing the encodings package.

Yes. I am well aware that it is internally only used for a few encodings, but since it is a C reimplementation of the encoding package's normalization function, it's a good idea to have this implemented in only one place. Given that the C version is faster, merging the PR is what I'd like to do.

@serhiy-storchaka serhiy-storchaka merged commit af58a6f into python:main Sep 9, 2025
45 checks passed
@serhiy-storchaka serhiy-storchaka deleted the normalize_encoding branch September 9, 2025 18:07
lkollar pushed a commit to lkollar/cpython that referenced this pull request Sep 9, 2025
…H-137167)

The codecs lookup function now performs only minimal normalization of
the encoding name before passing it to the search functions:
all ASCII letters are converted to lower case, spaces are replaced
with hyphens.

Excessive normalization broke third-party codecs providers, like
python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (pythonGH-15092)"

This reverts commit 20f59fe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants