Skip to content

Commit ea436a7

Browse files
committed
add php advisory
1 parent 0314bbe commit ea436a7

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: "PHP HTML parser differential due to libxml2 lack of HTML5 support"
3+
date: 2023-11-29
4+
tags:
5+
- "parser differential"
6+
- "xss"
7+
- "mxss"
8+
- "bypass"
9+
advisory: true
10+
origin:
11+
cves:
12+
ghsas:
13+
---
14+
### Summary
15+
The default HTML parser of PHP uses the underlying package libxml2 ([for example here](https://github.com/php/php-src/blob/master/ext/dom/document.c#L1920)). Libxml2 doesn’t [currently support](https://gitlab.gnome.org/GNOME/libxml2/-/issues/211) HTML5 parsing, and while it is undergoing process, after contacting them about this matter they said it will take a while before implementing this feature. This means that the built-in HTML parser of PHP behind [loadHTML](https://www.php.net/manual/en/domdocument.loadhtml.php), [DOMImplementation](https://www.php.net/manual/en/class.domimplementation.php), etc. does not follow the same parsing rules as modern web browsers.
16+
This behaviour becomes security-relevant when HTML sanitizers use the built-in HTML parser.
17+
We have come across multiple PHP sanitizers that are vulnerable to bypasses due to using the built-in parser, and we think that the root cause can't be addressed without significant changes by libxml2.
18+
19+
### PoC
20+
Here are some examples of how attackers can leverage these parsing differentials in order to bypass sanitizers.
21+
22+
#### 1. Comments:
23+
According to the [XML specification](https://www.w3.org/TR/xml/#sec-comments) (XHTML), comments must end with the characters `—>`. On the other hand, the [HTML specification](https://html.spec.whatwg.org/multipage/syntax.html#comments) states that a comment's text “must not start with the string `>`, nor start with the string `->`”.
24+
When parsing the following string in a browser, the comment will end before the `p` tag. But when parsing with PHP the `p` tag will be considered `a` comment:
25+
```
26+
Input: <!--><p>
27+
Browser (HTML specification) output: <!----><p></p>
28+
PHP parser (XHTML specification) output: <!--><p>-->
29+
```
30+
This can be done with either `<!-->` or `<!--->`.
31+
An attacker can input the following payload `<!--><xss>-->`. While the parser considers the xss tag as a comment, the browser will end the comment right before and render the xss tag as expected.
32+
33+
#### 2. RCDATA/RAWTEXT elements
34+
In [HTML5](https://html.spec.whatwg.org/#parsing-html-fragments), other element parsing types were introduced:
35+
* RCDATA
36+
* textarea
37+
* title
38+
* RAWTEXT
39+
* noframes
40+
* noembed
41+
* iframe
42+
* xmp
43+
* style
44+
* OTHERS
45+
* noscript - depends if [scripting](https://html.spec.whatwg.org/#the-noscript-element) is enabled (enabled by default in browsers).
46+
* plaintext
47+
* script
48+
49+
While the PHP’s parser is oblivious to that. There are multiple ways an attacker can bypass a sanitizer due to wrong parsing such as:
50+
* `<iframe><!--</iframe><xss>--></iframe>`
51+
* `<noframes><style></noframes><xss></style></noframes>`
52+
* ...
53+
#### 3. Foreign content elements
54+
HTML5 introduced two foreign elements ([math](https://html.spec.whatwg.org/#mathml) and [svg](https://html.spec.whatwg.org/#svg-0)) which follow different parsing specifications than HTML. Again parsing with PHP doesn’t take it into account, causing other parsing differentials and sanitizers bypass such as:
55+
* `<svg><p><style><!--</style><xss>--></style>`
56+
* ...
57+
58+
#### 4. DOCTYPE element
59+
The `!DOCTYPE` [element in XML/XHTML](https://www.w3.org/TR/xml/#NT-doctypedecl) is more complex allowing more characters and element nesting than in [HTML5](https://html.spec.whatwg.org/#the-doctype). In contrast, the HTML doctype ends with the [first occurrence](https://html.spec.whatwg.org/#doctype-state) of the “greater than” sign `>`.
60+
Parsing the following string will render an xss tag in the browser but not in PHP:
61+
* `<!DOCTYPE HTML PUBLIC "-//W3C//DTDHTML4.01//EN" "><xss>">`
62+
* `<!DOCTYPE HTML SYSTEM "><xss>">`
63+
64+
### Impact
65+
Sanitizers using the built-in PHP parser are inherently vulnerable to bypass due to wrong parsing.
66+
67+
### Recommendation
68+
This issue is [known](https://wiki.php.net/rfc/domdocument_html5_parser) but isn't clear for users of PHP, after this report the PHP team added a red warning to the documentation:
69+
70+
* [loadhtml](https://www.php.net/manual/en/domdocument.loadhtml.php)
71+
* [loadhtmlfile](https://www.php.net/manual/en/domdocument.loadhtmlfile.php)
72+
* [Commit](https://github.com/php/doc-en/commit/4ef716f8aa753e1189b2e57c91da378b16d970b0)

0 commit comments

Comments
 (0)