gh-88886: Remove excessive encoding name normalization #137167

serhiy-storchaka · 2025-07-28T14:06:22Z

The codecs lookup function now performs only minimal normalization of the encoding name before passing it to the search functions: all ASCII letters are converted to lower case, spaces are replaced with hyphens.

Excessive normalization broke third-party codecs providers, like python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (GH-15092)"

This reverts commit 20f59fe.

Issue: Codec name normalization breaks custom codecs #88886

📚 Documentation preview 📚: https://cpython-previews--137167.org.readthedocs.build/

StanFromIreland

You need to update normalize_encodings doc.

@malemburg Which order should the two PRs be merged in, switching it to the C implementation and simplifying the implementation?

serhiy-storchaka · 2025-07-28T14:19:12Z

This mainly restores the status quo prior to bpo-37751 and updates the documentation. But I am planning more changes.

I was not sure about replacing spaces with hyphens? Should we kept this here, on leave it to the search function? Should we convert spaces to underscores instead? Or should we remove any transformation?

I plan also to change _Py_normalize_encoding() -- only convert to lower case and replace spaces, hyphens and underscores, so encoding like "utf#$^(&8" will no longer be accepted. And change encodings.normalize_encoding() in a similar way. Names that are not valid module names after normalization should not be found.

The codecs lookup function now performs only minimal normalization of the encoding name before passing it to the search functions: all ASCII letters are converted to lower case, spaces are replaced with hyphens. Excessive normalization broke third-party codecs providers, like python-iconv. Revert "bpo-37751: Fix codecs.lookup() normalization (pythonGH-15092)" This reverts commit 20f59fe.

malemburg

LGTM modulo the one change to the news entry.

Misc/NEWS.d/next/Core_and_Builtins/2025-07-28-17-01-05.gh-issue-88886.g4XFPb.rst

StanFromIreland

The docs still need updating: https://docs.python.org/3.15/library/codecs.html#encodings.normalize_encoding

Doc/library/codecs.rst

vstinner

LGTM

Python/codecs.c

malemburg · 2025-09-09T11:29:25Z

Is there anything left to be done for this PR ?

After merging this one, I'd like to merge #136643

This will then give us a better story overall regarding encoding normalization in CPython: a central function which deals with normalization, avoiding duplicate normalizations as best as possible.

Co-authored-by: Victor Stinner <[email protected]>

serhiy-storchaka · 2025-09-09T12:00:37Z

I do not think #136643 should be merged. Normalization is an internal affair of the search function. Different search functions can have different normalizations (including no normalization).

malemburg · 2025-09-09T12:16:51Z

I do not think #136643 should be merged. Normalization is an internal affair of the search function. Different search functions can have different normalizations (including no normalization).

Sure, but the encoding package uses the same normalization as the C implementation that's used internally (after all, the C implementation was crafted after the encoding package's normalization function), so it makes sense to reuse it that way.

Of course, other codec packages can have their own search functions and normalizations.

serhiy-storchaka · 2025-09-09T12:41:54Z

The C code only needs support for few variants of few builting encodings (like "ISO-8859-1", "ISO_8859-1" and "iso8859-1"), but it does not need to support normalization for all encodings or more exotic forms of builting encodings (like "ISO_8859-1:1987" or "iso88591"). It is needed because some encodings should be accessible by their standard names before importing the encodings package.

malemburg · 2025-09-09T12:59:42Z

The C code only needs support for few variants of few builting encodings (like "ISO-8859-1", "ISO_8859-1" and "iso8859-1"), but it does not need to support normalization for all encodings or more exotic forms of builting encodings (like "ISO_8859-1:1987" or "iso88591"). It is needed because some encodings should be accessible by their standard names before importing the encodings package.

Yes. I am well aware that it is internally only used for a few encodings, but since it is a C reimplementation of the encoding package's normalization function, it's a good idea to have this implemented in only one place. Given that the C version is faster, merging the PR is what I'd like to do.

…H-137167) The codecs lookup function now performs only minimal normalization of the encoding name before passing it to the search functions: all ASCII letters are converted to lower case, spaces are replaced with hyphens. Excessive normalization broke third-party codecs providers, like python-iconv. Revert "bpo-37751: Fix codecs.lookup() normalization (pythonGH-15092)" This reverts commit 20f59fe.

serhiy-storchaka requested a review from malemburg July 28, 2025 14:06

bedevere-app bot added the awaiting core review label Jul 28, 2025

serhiy-storchaka requested a review from vstinner July 28, 2025 14:06

bedevere-app bot mentioned this pull request Jul 28, 2025

Codec name normalization breaks custom codecs #88886

Open

StanFromIreland reviewed Jul 28, 2025

View reviewed changes

serhiy-storchaka force-pushed the normalize_encoding branch from 2df6169 to ae1cae2 Compare July 28, 2025 14:23

malemburg approved these changes Jul 29, 2025

View reviewed changes

Misc/NEWS.d/next/Core_and_Builtins/2025-07-28-17-01-05.gh-issue-88886.g4XFPb.rst Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting core review labels Jul 29, 2025

StanFromIreland reviewed Jul 29, 2025

View reviewed changes

Doc/library/codecs.rst Show resolved Hide resolved

Doc/library/codecs.rst Show resolved Hide resolved

malemburg mentioned this pull request Jul 29, 2025

encoding package's normalize_encoding() function is too slow #55531

Open

Update a NEWS entry.

9c0f595

vstinner approved these changes Sep 2, 2025

View reviewed changes

Python/codecs.c Outdated Show resolved Hide resolved

malemburg mentioned this pull request Sep 9, 2025

gh-55531: Implement normalize_encoding in C #136643

Open

serhiy-storchaka and others added 2 commits September 9, 2025 14:56

Update Python/codecs.c

4546171

Co-authored-by: Victor Stinner <[email protected]>

Merge branch 'main' into normalize_encoding

68308d9

serhiy-storchaka merged commit af58a6f into python:main Sep 9, 2025
45 checks passed

bedevere-app bot removed the awaiting merge label Sep 9, 2025

serhiy-storchaka deleted the normalize_encoding branch September 9, 2025 18:07

Uh oh!

gh-88886: Remove excessive encoding name normalization #137167

gh-88886: Remove excessive encoding name normalization #137167

Uh oh!

Conversation

serhiy-storchaka commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StanFromIreland left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

StanFromIreland left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malemburg commented Sep 9, 2025

Uh oh!

serhiy-storchaka commented Sep 9, 2025

Uh oh!

malemburg commented Sep 9, 2025

Uh oh!

serhiy-storchaka commented Sep 9, 2025

Uh oh!

malemburg commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

serhiy-storchaka commented Jul 28, 2025 •

edited by github-actions bot

Loading

StanFromIreland left a comment •

edited

Loading

serhiy-storchaka commented Jul 28, 2025 •

edited

Loading

StanFromIreland left a comment •

edited

Loading