Skip to content
Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Doc/library/xml.etree.elementtree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -665,6 +665,11 @@ Functions
not given, the standard :class:`XMLParser` parser is used. Returns an
:class:`ElementTree` instance.

.. warning::

When the *source* encoding is ``ISO-8859-1`` and the mode is ``r``, An encoding
error warning will be thrown.


.. function:: ProcessingInstruction(target, text=None)

Expand Down
10 changes: 10 additions & 0 deletions Lib/test/test_xml_etree.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,16 @@ def test_makeelement(self):
elem[:] = tuple([subelem])
self.serialize_check(elem, '<tag><subtag key="value" /></tag>')

def test_parse_encoding_warn(self):
with self.assertWarns(RuntimeWarning) as cm:
with open(SIMPLE_XMLFILE, 'r', encoding='ISO-8859-1') as fp:
ET.parse(fp)
self.assertIn(
"For file objects containing XML data"
"with non-ASCII and non-UTF-8 encoding (e.g. ISO 8859-1), "
"the file must have been opened in binary mode.",
str(cm.warnings[0].message))

def test_parsefile(self):
# Test parsing from file.

Expand Down
7 changes: 7 additions & 0 deletions Lib/xml/etree/ElementTree.py
Original file line number Diff line number Diff line change
Expand Up @@ -1200,6 +1200,13 @@ def parse(source, parser=None):
Return an ElementTree instance.

"""
if (getattr(source, 'encoding', None) == 'ISO-8859-1' and
source.mode == 'r'):
import warnings
warnings.warn(
"For file objects containing XML data"
"with non-ASCII and non-UTF-8 encoding (e.g. ISO 8859-1), "
"the file must have been opened in binary mode.", category=RuntimeWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't understand this warning message :-( I am not sure how to rephase it.

The warning is emitted even if XML is encoded to ASCII. "Non-UTF-8" is unclear to me. I'm not sure why ISO 8859-1 is only given as an example, whereas the warning is only emitted if the encoding is ISO-8859-1.

Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?

Let me try to propose a better message.

"File using {source.encoding} encoding should open XML in binary mode."

And consider emitting the warning if the encoding is not ASCII nor UTF-8.

I'm not a XML expert, so I'm not sure :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, non-UTF8 and non-ASCII refer to something similar to ISO-8859-1, But in fact the character set it contains is not similar to ASCII, and it isn't considered UTF8. I agree with your point of view, it should be shortened, I think it should be changed to ISO 8859-1 encoding should be read using 'rb' mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also emit the warning for ISO-8859-15? What about ShiftJIS? Or Big5?

IMO, that should be identified (on the issue) first. It seems weird to only add a warning for one particular non-UTF8 encoding. I would guess that the problem is not limited to ISO-8859-1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this may exist in all 8-bit encodings, such as ISO-8859-[1,15]. Please give me some time to test these codes (because it's not easy to find a suitable test text).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 8-bit encodings? What about UTF-16? Or any other multibyte encoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this occurs in 8-bit encoding and has been read in non-rb mode (although I can't guarantee that this will not happen in 16-bit, in fact, 16-bit encoding can be read with r instead of rb)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest closing this PR and going back to the drawing board. Try to understand the issue fully before opening the next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I convert PR into a draft and continue to track the problem.

tree = ElementTree()
tree.parse(source, parser)
return tree
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Added a warning in the :func:`xml.etree.ElementTree.parse` for
files opened in ``r`` mode with ``'ISO-8859-1'`` encoding. Patch by RUANG.
Loading