Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 20 additions & 13 deletions Doc/library/html.parser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,18 @@
This module defines a class :class:`HTMLParser` which serves as the basis for
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.

.. class:: HTMLParser(*, convert_charrefs=True)
.. class:: HTMLParser(*, convert_charrefs=True, scripting=False)

Create a parser instance able to parse invalid markup.

If *convert_charrefs* is ``True`` (the default), all character
references (except the ones in ``script``/``style`` elements) are
If *convert_charrefs* is true (the default), all character
references (except the ones in elements like ``script`` and ``style``) are
automatically converted to the corresponding Unicode characters.

If *scripting* is false (the default), the content of the ``noscript``
element is parsed normally; if it's true, it's parsed in RAWTEXT mode,
like ``script``.

An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
when start tags, end tags, text, comments, and other markup elements are
encountered. The user should subclass :class:`.HTMLParser` and override its
Expand All @@ -37,6 +41,9 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
.. versionchanged:: 3.5
The default value for argument *convert_charrefs* is now ``True``.

.. versionchanged:: 3.14.1
Added the *scripting* parameter.


Example HTML Parser Application
-------------------------------
Expand Down Expand Up @@ -161,24 +168,24 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
.. method:: HTMLParser.handle_data(data)

This method is called to process arbitrary data (e.g. text nodes and the
content of ``<script>...</script>`` and ``<style>...</style>``).
content of elements like ``script`` and ``style``).


.. method:: HTMLParser.handle_entityref(name)

This method is called to process a named character reference of the form
``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
(e.g. ``'gt'``). This method is never called if *convert_charrefs* is
``True``.
(e.g. ``'gt'``).
This method is only called if *convert_charrefs* is false.


.. method:: HTMLParser.handle_charref(name)

This method is called to process decimal and hexadecimal numeric character
references of the form :samp:`&#{NNN};` and :samp:`&#x{NNN};`. For example, the decimal
equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
in this case the method will receive ``'62'`` or ``'x3E'``. This method
is never called if *convert_charrefs* is ``True``.
in this case the method will receive ``'62'`` or ``'x3E'``.
This method is only called if *convert_charrefs* is false.


.. method:: HTMLParser.handle_comment(data)
Expand Down Expand Up @@ -292,8 +299,8 @@ Parsing an element with a few attributes and a title:
Data : Python
End tag : h1

The content of ``script`` and ``style`` elements is returned as is, without
further parsing:
The content of elements like ``script`` and ``style`` is returned as is,
without further parsing:

.. doctest::

Expand All @@ -304,10 +311,10 @@ further parsing:
End tag : style

>>> parser.feed('<script type="text/javascript">'
... 'alert("<strong>hello!</strong>");</script>')
... 'alert("<strong>hello! &#9786;</strong>");</script>')
Start tag: script
attr: ('type', 'text/javascript')
Data : alert("<strong>hello!</strong>");
Data : alert("<strong>hello! &#9786;</strong>");
End tag : script

Parsing comments:
Expand Down Expand Up @@ -336,7 +343,7 @@ correct char (note: these 3 references are all equivalent to ``'>'``):

Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
:meth:`~HTMLParser.handle_data` might be called more than once
(unless *convert_charrefs* is set to ``True``):
if *convert_charrefs* is false:

.. doctest::

Expand Down
22 changes: 18 additions & 4 deletions Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,17 +127,25 @@ class HTMLParser(_markupbase.ParserBase):
argument.
"""

CDATA_CONTENT_ELEMENTS = ("script", "style")
# See the HTML5 specs section "13.4 Parsing HTML fragments".
# https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
# CDATA_CONTENT_ELEMENTS are parsed in RAWTEXT mode
CDATA_CONTENT_ELEMENTS = ("script", "style", "xmp", "iframe", "noembed", "noframes")
RCDATA_CONTENT_ELEMENTS = ("textarea", "title")

def __init__(self, *, convert_charrefs=True):
def __init__(self, *, convert_charrefs=True, scripting=False):
"""Initialize and reset this instance.

If convert_charrefs is True (the default), all character references
If convert_charrefs is true (the default), all character references
are automatically converted to the corresponding Unicode characters.

If *scripting* is false (the default), the content of the
``noscript`` element is parsed normally; if it's true,
it's parsed in RAWTEXT mode.
"""
super().__init__()
self.convert_charrefs = convert_charrefs
self.scripting = scripting
self.reset()

def reset(self):
Expand Down Expand Up @@ -172,7 +180,9 @@ def get_starttag_text(self):
def set_cdata_mode(self, elem, *, escapable=False):
self.cdata_elem = elem.lower()
self._escapable = escapable
if escapable and not self.convert_charrefs:
if escapable is None: # PLAINTEXT mode
self.interesting = re.compile(r'\z')
elif escapable and not self.convert_charrefs:
self.interesting = re.compile(r'&|</%s(?=[\t\n\r\f />])' % self.cdata_elem,
re.IGNORECASE|re.ASCII)
else:
Expand Down Expand Up @@ -448,6 +458,10 @@ def parse_starttag(self, i):
self.set_cdata_mode(tag)
elif tag in self.RCDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag, escapable=True)
elif self.scripting and tag == "noscript":
self.set_cdata_mode(tag)
elif tag == "plaintext":
self.set_cdata_mode(tag, escapable=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like too much (ab)using escapable=None for PLAINTEXT mode.

Currently the set_cdata_mode function does two things:

  1. determines where the closing tag/end is, which depends on the value tag passed;
  2. determines whether charrefs are converted, which depends on the value passed to escapable;

Even though there is some duplication, I would prefer something like this:

            if (tag in self.CDATA_CONTENT_ELEMENTS or
                (self.scripting and tag == "noscript") or
                tag == "plaintext"):
                self.set_cdata_mode(tag, escapable=False)
            elif tag in self.RCDATA_CONTENT_ELEMENTS:
                self.set_cdata_mode(tag, escapable=True)

This makes clear that all these cases are handled by set_cdata_mode, with the former ignoring charrefs and the latter converting them.

Then in set_cdata_mode we can set self.interesting based on the values of the args passed. This will also make it clearer what is considered interesting for each tag.

return endpos

# Internal -- check to see if we have a complete starttag; return end
Expand Down
Loading
Loading