Skip to content

Conversation

rruuaanng
Copy link
Contributor

@rruuaanng rruuaanng commented Sep 10, 2024

#99064 mentioned that if source contains non-ASCII characters, the parse method will return incorrect results. However, there is no warning for this process.

@erlend-aasland erlend-aasland marked this pull request as draft October 15, 2024 08:50
@erlend-aasland
Copy link
Contributor

Please see the Making good PRs section of the devguide. I suggest taking a close look at it before opening new PRs; incomplete PRs only result in wasted reviewer and CI churn.

This PR lacks documentation and tests.

@rruuaanng rruuaanng marked this pull request as ready for review October 16, 2024 06:51
@rruuaanng
Copy link
Contributor Author

I have made changes, Please reviewer! @erlend-aasland

@rruuaanng rruuaanng requested a review from vstinner October 16, 2024 10:37
@rruuaanng
Copy link
Contributor Author

Thank for @vstinner, @erlend-aasland reviewer. Thank you for your hard work! Please review again :)

@rruuaanng
Copy link
Contributor Author

They may not have any other problems. Thank for reviewer.

warnings.warn(
"For file objects containing XML data "
"with non-ASCII and non-UTF-8 encoding (for example. ISO 8859-1), "
"the file must have been opened in binary mode.", category=RuntimeWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't understand this warning message :-( I am not sure how to rephase it.

The warning is emitted even if XML is encoded to ASCII. "Non-UTF-8" is unclear to me. I'm not sure why ISO 8859-1 is only given as an example, whereas the warning is only emitted if the encoding is ISO-8859-1.

Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?

Let me try to propose a better message.

"File using {source.encoding} encoding should open XML in binary mode."

And consider emitting the warning if the encoding is not ASCII nor UTF-8.

I'm not a XML expert, so I'm not sure :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, non-UTF8 and non-ASCII refer to something similar to ISO-8859-1, But in fact the character set it contains is not similar to ASCII, and it isn't considered UTF8. I agree with your point of view, it should be shortened, I think it should be changed to ISO 8859-1 encoding should be read using 'rb' mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?

IMO, that should be identified (on the issue) first. It seems weird to only add a warning for one particular non-UTF8 encoding. I would guess that the problem is not limited to ISO-8859-1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this may exist in all 8-bit encodings, such as ISO-8859-[1,15]. Please give me some time to test these codes (because it's not easy to find a suitable test text).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 8-bit encodings? What about UTF-16? Or any other multibyte encoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this occurs in 8-bit encoding and has been read in non-rb mode (although I can't guarantee that this will not happen in 16-bit, in fact, 16-bit encoding can be read with r instead of rb)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest closing this PR and going back to the drawing board. Try to understand the issue fully before opening the next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I convert PR into a draft and continue to track the problem.

@rruuaanng
Copy link
Contributor Author

Maybe I should introduce ISO-8859-1. It adds 96 letters and symbols to the vacant range of 0xA0-0xFF for use by Latin alphabet languages ​​that use additional symbols. But in fact, the newly added part does not belong to UTF8, and since it adds 96 new contents, it is not actually ASCII. Since it is an 8-bit character set, it must be read byte by byte.
for more to see https://en.wikipedia.org/wiki/ISO/IEC_8859-1.

@erlend-aasland erlend-aasland marked this pull request as draft October 16, 2024 12:51
@vstinner vstinner closed this Oct 16, 2024
@vstinner
Copy link
Member

I closed the PR. The scope of the change is not well defined. Someone has to properly test different encodings and decide which encodings emit a warning or not. The issue should be better analyzed before jumping into a concrete implementation.

@rruuaanng rruuaanng deleted the dev4 branch October 16, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants