-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-99064: Added warnings for file objects with non-UTF8 content #123887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please see the Making good PRs section of the devguide. I suggest taking a close look at it before opening new PRs; incomplete PRs only result in wasted reviewer and CI churn. This PR lacks documentation and tests. |
I have made changes, Please reviewer! @erlend-aasland |
Misc/NEWS.d/next/Library/2024-09-10-11-19-05.gh-issue-99064.iHNbcr.rst
Outdated
Show resolved
Hide resolved
Thank for @vstinner, @erlend-aasland reviewer. Thank you for your hard work! Please review again :) |
They may not have any other problems. Thank for reviewer. |
Lib/xml/etree/ElementTree.py
Outdated
warnings.warn( | ||
"For file objects containing XML data " | ||
"with non-ASCII and non-UTF-8 encoding (for example. ISO 8859-1), " | ||
"the file must have been opened in binary mode.", category=RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't understand this warning message :-( I am not sure how to rephase it.
The warning is emitted even if XML is encoded to ASCII. "Non-UTF-8" is unclear to me. I'm not sure why ISO 8859-1 is only given as an example, whereas the warning is only emitted if the encoding is ISO-8859-1.
Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?
Let me try to propose a better message.
"File using {source.encoding} encoding should open XML in binary mode."
And consider emitting the warning if the encoding is not ASCII nor UTF-8.
I'm not a XML expert, so I'm not sure :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, non-UTF8
and non-ASCII
refer to something similar to ISO-8859-1, But in fact the character set it contains is not similar to ASCII
, and it isn't considered UTF8
. I agree with your point of view, it should be shortened, I think it should be changed to ISO 8859-1 encoding should be read using 'rb' mode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?
IMO, that should be identified (on the issue) first. It seems weird to only add a warning for one particular non-UTF8 encoding. I would guess that the problem is not limited to ISO-8859-1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this may exist in all 8-bit encodings, such as ISO-8859-[1,15]. Please give me some time to test these codes (because it's not easy to find a suitable test text).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only 8-bit encodings? What about UTF-16? Or any other multibyte encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this occurs in 8-bit encoding and has been read in non-rb mode (although I can't guarantee that this will not happen in 16-bit, in fact, 16-bit encoding can be read with r instead of rb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest closing this PR and going back to the drawing board. Try to understand the issue fully before opening the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I convert PR into a draft and continue to track the problem.
Maybe I should introduce ISO-8859-1. It adds |
I closed the PR. The scope of the change is not well defined. Someone has to properly test different encodings and decide which encodings emit a warning or not. The issue should be better analyzed before jumping into a concrete implementation. |
#99064 mentioned that if source contains non-ASCII characters, the parse method will return incorrect results. However, there is no warning for this process.