gh-99064: Added warnings for file objects with non-UTF8 content #123887

rruuaanng · 2024-09-10T03:24:27Z

#99064 mentioned that if source contains non-ASCII characters, the parse method will return incorrect results. However, there is no warning for this process.

Issue: xml.etree.ElementTree: file source must be binary for non-UTF-8 encodings #99064

erlend-aasland · 2024-10-15T08:54:14Z

Please see the Making good PRs section of the devguide. I suggest taking a close look at it before opening new PRs; incomplete PRs only result in wasted reviewer and CI churn.

This PR lacks documentation and tests.

rruuaanng · 2024-10-16T07:08:04Z

I have made changes, Please reviewer! @erlend-aasland

Lib/xml/etree/ElementTree.py

Misc/NEWS.d/next/Library/2024-09-10-11-19-05.gh-issue-99064.iHNbcr.rst

rruuaanng · 2024-10-16T10:58:55Z

Thank for @vstinner, @erlend-aasland reviewer. Thank you for your hard work! Please review again :)

Lib/xml/etree/ElementTree.py

rruuaanng · 2024-10-16T11:18:02Z

They may not have any other problems. Thank for reviewer.

Doc/library/xml.etree.elementtree.rst

vstinner · 2024-10-16T11:42:41Z

Lib/xml/etree/ElementTree.py

+        warnings.warn(
+            "For file objects containing XML data "
+            "with non-ASCII and non-UTF-8 encoding (for example. ISO 8859-1), "
+            "the file must have been opened in binary mode.", category=RuntimeWarning)


Honestly, I don't understand this warning message :-( I am not sure how to rephase it.

The warning is emitted even if XML is encoded to ASCII. "Non-UTF-8" is unclear to me. I'm not sure why ISO 8859-1 is only given as an example, whereas the warning is only emitted if the encoding is ISO-8859-1.

Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?

Let me try to propose a better message.

"File using {source.encoding} encoding should open XML in binary mode."

And consider emitting the warning if the encoding is not ASCII nor UTF-8.

I'm not a XML expert, so I'm not sure :-(

In fact, non-UTF8 and non-ASCII refer to something similar to ISO-8859-1, But in fact the character set it contains is not similar to ASCII, and it isn't considered UTF8. I agree with your point of view, it should be shortened, I think it should be changed to ISO 8859-1 encoding should be read using 'rb' mode.

Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?

IMO, that should be identified (on the issue) first. It seems weird to only add a warning for one particular non-UTF8 encoding. I would guess that the problem is not limited to ISO-8859-1.

You are right, this may exist in all 8-bit encodings, such as ISO-8859-[1,15]. Please give me some time to test these codes (because it's not easy to find a suitable test text).

Why only 8-bit encodings? What about UTF-16? Or any other multibyte encoding?

Because this occurs in 8-bit encoding and has been read in non-rb mode (although I can't guarantee that this will not happen in 16-bit, in fact, 16-bit encoding can be read with r instead of rb)

I suggest closing this PR and going back to the drawing board. Try to understand the issue fully before opening the next PR.

I convert PR into a draft and continue to track the problem.

rruuaanng · 2024-10-16T12:25:13Z

Maybe I should introduce ISO-8859-1. It adds 96 letters and symbols to the vacant range of 0xA0-0xFF for use by Latin alphabet languages that use additional symbols. But in fact, the newly added part does not belong to UTF8, and since it adds 96 new contents, it is not actually ASCII. Since it is an 8-bit character set, it must be read byte by byte.
for more to see https://en.wikipedia.org/wiki/ISO/IEC_8859-1.

vstinner · 2024-10-16T13:03:47Z

I closed the PR. The scope of the change is not well defined. Someone has to properly test different encodings and decide which encodings emit a warning or not. The issue should be better analyzed before jumping into a concrete implementation.

pythongh-99064: Added warnings for file objects with non-UTF8 content

851b179

bedevere-app bot added the awaiting review label Sep 10, 2024

bedevere-app bot mentioned this pull request Sep 10, 2024

xml.etree.ElementTree: file source must be binary for non-UTF-8 encodings #99064

Open

Add double backticks

c2c9aa7

rruuaanng force-pushed the dev4 branch from 67774e3 to c2c9aa7 Compare September 10, 2024 03:27

rruuaanng added 4 commits September 10, 2024 11:31

Add tab

65984a4

Delete trailing-whitespace

cd9eaa8

Delete trailing-whitespace

b84075c

Change NEWS

bfdfc5b

rruuaanng force-pushed the dev4 branch from fdc6922 to bfdfc5b Compare September 10, 2024 03:40

Change func label

eac7791

rruuaanng force-pushed the dev4 branch from 057439f to eac7791 Compare September 10, 2024 04:22

rruuaanng added 2 commits September 10, 2024 12:36

Change NEWS

3cbd0ca

Merge branch 'main' into dev4

45b0e8f

erlend-aasland marked this pull request as draft October 15, 2024 08:50

bedevere-app bot removed the awaiting review label Oct 15, 2024

rruuaanng added 2 commits October 15, 2024 18:39

Add documentation

121a6b1

Add test

604dc15

rruuaanng marked this pull request as ready for review October 16, 2024 06:51

bedevere-app bot added the awaiting review label Oct 16, 2024

rruuaanng added 2 commits October 16, 2024 14:51

Remove extra spaces

a88a0b4

Change NEWS

32b35b5

vstinner reviewed Oct 16, 2024

View reviewed changes

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2024-09-10-11-19-05.gh-issue-99064.iHNbcr.rst Outdated Show resolved Hide resolved

rruuaanng added 2 commits October 16, 2024 18:35

Change NEWS

be27e53

Move import

5b1a048

rruuaanng requested a review from vstinner October 16, 2024 10:37

Change if cond

12d91b5

rruuaanng requested a review from erlend-aasland October 16, 2024 10:49

erlend-aasland reviewed Oct 16, 2024

View reviewed changes

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved

erlend-aasland reviewed Oct 16, 2024

View reviewed changes

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved

rruuaanng added 2 commits October 16, 2024 19:05

Add a space

cd9e7a7

e.g. change to for example

99aa9b5

rruuaanng requested a review from erlend-aasland October 16, 2024 11:10

erlend-aasland reviewed Oct 16, 2024

View reviewed changes

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved

Remove newline

78834a8

vstinner reviewed Oct 16, 2024

View reviewed changes

Doc/library/xml.etree.elementtree.rst Outdated Show resolved Hide resolved

rruuaanng added 2 commits October 16, 2024 19:20

Change docs

ca044ce

Add space

2d98fff

vstinner reviewed Oct 16, 2024

View reviewed changes

rruuaanng added 2 commits October 16, 2024 20:13

Change warn info

b753d31

Change warn info

cbd6dbb

erlend-aasland marked this pull request as draft October 16, 2024 12:51

bedevere-app bot removed the awaiting review label Oct 16, 2024

vstinner closed this Oct 16, 2024

rruuaanng deleted the dev4 branch October 16, 2024 13:06

Uh oh!

gh-99064: Added warnings for file objects with non-UTF8 content #123887

gh-99064: Added warnings for file objects with non-UTF8 content #123887

Uh oh!

Conversation

rruuaanng commented Sep 10, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erlend-aasland commented Oct 15, 2024

Uh oh!

rruuaanng commented Oct 16, 2024

Uh oh!

Uh oh!

Uh oh!

rruuaanng commented Oct 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rruuaanng commented Oct 16, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rruuaanng commented Oct 16, 2024

Uh oh!

vstinner commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rruuaanng commented Sep 10, 2024 •

edited by bedevere-app bot

Loading