Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,11 @@ def parse_html_declaration(self, i):
# this case is actually already handled in goahead()
return self.parse_comment(i)
elif rawdata[i:i+9] == '<![CDATA[':
return self.parse_marked_section(i)
j = rawdata.find(']]>')
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.unknown_decl(rawdata[i+3: j])
self.unknown_decl(rawdata[i+3:j])

return j + 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the HTML5 standard (https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should be either data or bogus comment (which ends with >, not ]]>), but this depends on the context. It may be that I incorrectly understand the HTML5 standard, because this part is difficult to implement.

Copy link
Member

@ezio-melotti ezio-melotti Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried copying the content of the tests in the following file:

<!DOCTYPE html>
<html>
<body>
<![CDATA[just some plain text]]><hr>
<![CDATA[<!-- not a comment -->]]><hr>
<![CDATA[&not-an-entity-ref;]]><hr>
<![CDATA[<not a='start tag'>]]><hr>
<![CDATA[]]><hr>
<![CDATA[[[I have many brackets]]]]><hr>
<![CDATA[I have a > in the middle]]><hr>
<![CDATA[I have a ]] in the middle]]><hr>
<![CDATA[] ]>]]><hr>
<![CDATA[]] >]]><hr>
<![CDATA[
    if (a < b && a > b) {
        printf("[<marquee>How?</marquee>]");
    }
]]><hr>

</body>
</html>

and this was the result on Firefox:

<html><head></head><body>
<!--[CDATA[just some plain text]]--><hr>
<!--[CDATA[<!-- not a comment ---->]]&gt;<hr>
<!--[CDATA[&not-an-entity-ref;]]--><hr>
<!--[CDATA[<not a='start tag'-->]]&gt;<hr>
<!--[CDATA[]]--><hr>
<!--[CDATA[[[I have many brackets]]]]--><hr>
<!--[CDATA[I have a --> in the middle]]&gt;<hr>
<!--[CDATA[I have a ]] in the middle]]--><hr>
<!--[CDATA[] ]-->]]&gt;<hr>
<!--[CDATA[]] -->]]&gt;<hr>
<!--[CDATA[
    if (a < b && a --> b) {
        printf("[<marquee>How?</marquee>]");
    }
]]&gt;<hr>



</body></html>
Image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and if you try <svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>, you will see that content between <![CDATA[ and ]]> is interpreted as a raw data.

This is context dependent.

HTMLParser is actually just a tokenizer. To determine the context automatically, it needs to support the stack of open elements and to know what elements are in the HTML namespace. This is all in the specification, and we will implement this in future. But this is a different level of complexity. So I solved the issue by letting the user to determine the context. New method support_cdata() sets how HTMLParser will parse CDATA. This is not good, but perhaps better than the current state.

elif rawdata[i:i+9].lower() == '<!doctype':
# find the closing >
gtpos = rawdata.find('>', i+9)
Expand Down
42 changes: 21 additions & 21 deletions Lib/test/test_htmlparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -721,27 +721,27 @@ def test_broken_condcoms(self):
]
self._run_check(html, expected)

def test_cdata_declarations(self):
# More tests should be added. See also "8.2.4.42. Markup
# declaration open state", "8.2.4.69. CDATA section state",
# and issue 32876
html = ('<![CDATA[just some plain text]]>')
expected = [('unknown decl', 'CDATA[just some plain text')]
self._run_check(html, expected)

def test_cdata_declarations_multiline(self):
html = ('<code><![CDATA['
' if (a < b && a > b) {'
' printf("[<marquee>How?</marquee>]");'
' }'
']]></code>')
expected = [
('starttag', 'code', []),
('unknown decl',
'CDATA[ if (a < b && a > b) { '
'printf("[<marquee>How?</marquee>]"); }'),
('endtag', 'code')
]
@support.subTests('content', [
'just some plain text',
'<!-- not a comment -->',
'&not-an-entity-ref;',
"<not a='start tag'>",
'',
'[[I have many brackets]]',
'I have a > in the middle',
'I have a ]] in the middle',
'] ]>',
']] >',
('\n'
' if (a < b && a > b) {\n'
' printf("[<marquee>How?</marquee>]");\n'
' }\n'),
])
def test_cdata_section(self, content):
# See "13.2.5.42 Markup declaration open state",
# "13.2.5.69 CDATA section state", and issue bpo-32876.
html = f'<![CDATA[{content}]]>'
expected = [('unknown decl', 'CDATA[' + content)]
self._run_check(html, expected)

def test_convert_charrefs_dropped_text(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
Comment on lines +1 to +2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
By default, :meth:`~HTMLParser.handle_comment` is called for CDATA.
The old behavior (calling :meth:`~HTMLParser.unknown_decl`) can be restored
using a new method, :meth:`~HTMLParser.support_cdata`.

Loading