-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-69426: only unescape properly terminated character entities in attribute values #95215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
71a89f9
bebae0a
a7af750
f915b19
6c65830
e8263ae
ec1341b
fb77f97
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -12,6 +12,7 @@ | |||||
| import _markupbase | ||||||
|
|
||||||
| from html import unescape | ||||||
| from html.entities import html5 as html5_entities | ||||||
|
|
||||||
|
|
||||||
| __all__ = ['HTMLParser'] | ||||||
|
|
@@ -57,6 +58,26 @@ | |||||
| # </ and the tag name, so maybe this should be fixed | ||||||
| endtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>') | ||||||
|
|
||||||
| # Character reference processing logic specific to attribute values | ||||||
| # See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state | ||||||
| attr_charref = re.compile(r'&(#[0-9]+|#[xX][0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*)[;=]?') | ||||||
|
|
||||||
| def replace_attr_charref(match): | ||||||
| ref = match.group(0) | ||||||
| # Numeric / hex char refs must always be unescaped | ||||||
| if ref[1] == '#': | ||||||
|
||||||
| if ref[1] == '#': | |
| if ref.startswith('&#'): |
I think this is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| terminates_with_equals = ref[-1:] == '=' | |
| terminates_with_equals = ref.endswith('=') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both functions should be private, and their name prefixed by an _.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -347,17 +347,17 @@ def test_convert_charrefs(self): | |
| self.assertTrue(collector().convert_charrefs) | ||
| charrefs = ['"', '"', '"', '"', '"', '"'] | ||
| # check charrefs in the middle of the text/attributes | ||
| expected = [('starttag', 'a', [('href', 'foo"zar')]), | ||
| expected = [('starttag', 'a', [('href', 'foo " zar')]), | ||
| ('data', 'a"z'), ('endtag', 'a')] | ||
| for charref in charrefs: | ||
| self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref), | ||
| self._run_check('<a href="foo {0} zar">a{0}z</a>'.format(charref), | ||
| expected, collector=collector()) | ||
| # check charrefs at the beginning/end of the text/attributes | ||
| # check charrefs at the beginning/end of the text | ||
| expected = [('data', '"'), | ||
| ('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]), | ||
| ('starttag', 'a', []), | ||
| ('data', '"'), ('endtag', 'a'), ('data', '"')] | ||
| for charref in charrefs: | ||
| self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">' | ||
| self._run_check('{0}<a>' | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the existing tests to remove flawed assumptions about how the unescaping in attribute values should work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be better to remove all attribute-related checks from this test, and move them in the next. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
| '{0}</a>{0}'.format(charref), | ||
| expected, collector=collector()) | ||
| # check charrefs in <script>/<style> elements | ||
|
|
@@ -380,6 +380,48 @@ def test_convert_charrefs(self): | |
| self._run_check('no charrefs here', [('data', 'no charrefs here')], | ||
| collector=collector()) | ||
|
|
||
| def test_convert_charrefs_in_attribute_values(self): | ||
| # default value for convert_charrefs is now True | ||
| collector = lambda: EventCollectorCharrefs() | ||
| self.assertTrue(collector().convert_charrefs) | ||
|
|
||
| # do unescape numeric and hex char refs | ||
| expected = [('starttag', 'a', | ||
| [('href', 'https://example.com?foo¢=bar¢&baz¢=bla¢')]), | ||
| ('endtag', 'a')] | ||
| self._run_check('<a href="https://example.com?foo¢=bar¢&baz¢=bla¢"></a>', expected, collector=collector()) | ||
|
|
||
| # do unescape entity matches not followed by ASCII alphanumeric | ||
| expected = [('starttag', 'a', | ||
| [('href', 'https://example.com?foo¢¢ ¢+¢')]), | ||
| ('endtag', 'a')] | ||
| self._run_check('<a href="https://example.com?foo¢¢ ¢+¢"></a>', expected, collector=collector()) | ||
|
|
||
| # do not unescape entity matches followed by ASCII alphanumeric | ||
| expected = [('starttag', 'a', | ||
| [('href', 'https://example.com?foo¢er¢123')]), | ||
| ('endtag', 'a')] | ||
| self._run_check('<a href="https://example.com?foo¢er¢123"></a>', expected, collector=collector()) | ||
|
|
||
| # do not unescape entity matches followed by equals | ||
| expected = [('starttag', 'a', | ||
| [('href', 'https://example.com?foo¢=123')]), | ||
| ('endtag', 'a')] | ||
| self._run_check('<a href="https://example.com?foo¢=123"></a>', expected, collector=collector()) | ||
|
|
||
| # do unescape terminated entity matches followed by equals | ||
| expected = [('starttag', 'a', | ||
| [('href', 'https://example.com?foo¢=123')]), | ||
| ('endtag', 'a')] | ||
| self._run_check('<a href="https://example.com?foo¢=123"></a>', expected, collector=collector()) | ||
|
||
|
|
||
| # do unescape char refs at begging and end of text attributes | ||
| charrefs = ['"', '"', '"', '"', '"', '"'] | ||
| expected = [('starttag', 'a', [('x', '"'), ('y', '"-X'), ('z', 'X-"')]), ('endtag', 'a')] | ||
| for charref in charrefs: | ||
| self._run_check('<a x="{0}" y="{0}-X" z="X-{0}"></a>'.format(charref), | ||
| expected, collector=collector()) | ||
|
||
|
|
||
| # the remaining tests were for the "tolerant" parser (which is now | ||
| # the default), and check various kind of broken markup | ||
| def test_tolerant_parsing(self): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| Fix :class:`HTMLParser` to not unescape character entities in attribute | ||
| values if they are followed by an ASCII alphanumeric or an equals sign. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This partially duplicates an existing Regex, but I was not able to reuse the existing one for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the new
_unescape_attrvalueis effectively a wrapper forhtml.escapethat only delegates tohtml.escapeif the attribute specific conditions are met. Since we still want to escape numeric and hex char refs in attributes, we need to include them in the regex.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to move this immediately after the definition of
entityrefandcharref. If we change one regexp, we will not forget to change the other.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done