|
| 1 | +--- |
| 2 | +title: "PHP HTML parser differential due to libxml2 lack of HTML5 support" |
| 3 | +date: 2023-11-29 |
| 4 | +tags: |
| 5 | + - "parser differential" |
| 6 | + - "xss" |
| 7 | + - "mxss" |
| 8 | + - "bypass" |
| 9 | +advisory: true |
| 10 | +origin: |
| 11 | +cves: |
| 12 | +ghsas: |
| 13 | +--- |
| 14 | +### Summary |
| 15 | +The default HTML parser of PHP uses the underlying package libxml2 ([for example here](https://github.com/php/php-src/blob/master/ext/dom/document.c#L1920)). Libxml2 doesn’t [currently support](https://gitlab.gnome.org/GNOME/libxml2/-/issues/211) HTML5 parsing, and while it is undergoing process, after contacting them about this matter they said it will take a while before implementing this feature. This means that the built-in HTML parser of PHP behind [loadHTML](https://www.php.net/manual/en/domdocument.loadhtml.php), [DOMImplementation](https://www.php.net/manual/en/class.domimplementation.php), etc. does not follow the same parsing rules as modern web browsers. |
| 16 | +This behaviour becomes security-relevant when HTML sanitizers use the built-in HTML parser. |
| 17 | +We have come across multiple PHP sanitizers that are vulnerable to bypasses due to using the built-in parser, and we think that the root cause can't be addressed without significant changes by libxml2. |
| 18 | + |
| 19 | +### PoC |
| 20 | +Here are some examples of how attackers can leverage these parsing differentials in order to bypass sanitizers. |
| 21 | + |
| 22 | +#### 1. Comments: |
| 23 | +According to the [XML specification](https://www.w3.org/TR/xml/#sec-comments) (XHTML), comments must end with the characters `—>`. On the other hand, the [HTML specification](https://html.spec.whatwg.org/multipage/syntax.html#comments) states that a comment's text “must not start with the string `>`, nor start with the string `->`”. |
| 24 | +When parsing the following string in a browser, the comment will end before the `p` tag. But when parsing with PHP the `p` tag will be considered `a` comment: |
| 25 | +``` |
| 26 | +Input: <!--><p> |
| 27 | +Browser (HTML specification) output: <!----><p></p> |
| 28 | +PHP parser (XHTML specification) output: <!--><p>--> |
| 29 | +``` |
| 30 | +This can be done with either `<!-->` or `<!--->`. |
| 31 | +An attacker can input the following payload `<!--><xss>-->`. While the parser considers the xss tag as a comment, the browser will end the comment right before and render the xss tag as expected. |
| 32 | + |
| 33 | +#### 2. RCDATA/RAWTEXT elements |
| 34 | +In [HTML5](https://html.spec.whatwg.org/#parsing-html-fragments), other element parsing types were introduced: |
| 35 | +* RCDATA |
| 36 | + * textarea |
| 37 | + * title |
| 38 | +* RAWTEXT |
| 39 | + * noframes |
| 40 | + * noembed |
| 41 | + * iframe |
| 42 | + * xmp |
| 43 | + * style |
| 44 | +* OTHERS |
| 45 | + * noscript - depends if [scripting](https://html.spec.whatwg.org/#the-noscript-element) is enabled (enabled by default in browsers). |
| 46 | + * plaintext |
| 47 | + * script |
| 48 | + |
| 49 | +While the PHP’s parser is oblivious to that. There are multiple ways an attacker can bypass a sanitizer due to wrong parsing such as: |
| 50 | +* `<iframe><!--</iframe><xss>--></iframe>` |
| 51 | +* `<noframes><style></noframes><xss></style></noframes>` |
| 52 | +* ... |
| 53 | +#### 3. Foreign content elements |
| 54 | +HTML5 introduced two foreign elements ([math](https://html.spec.whatwg.org/#mathml) and [svg](https://html.spec.whatwg.org/#svg-0)) which follow different parsing specifications than HTML. Again parsing with PHP doesn’t take it into account, causing other parsing differentials and sanitizers bypass such as: |
| 55 | +* `<svg><p><style><!--</style><xss>--></style>` |
| 56 | +* ... |
| 57 | + |
| 58 | +#### 4. DOCTYPE element |
| 59 | +The `!DOCTYPE` [element in XML/XHTML](https://www.w3.org/TR/xml/#NT-doctypedecl) is more complex allowing more characters and element nesting than in [HTML5](https://html.spec.whatwg.org/#the-doctype). In contrast, the HTML doctype ends with the [first occurrence](https://html.spec.whatwg.org/#doctype-state) of the “greater than” sign `>`. |
| 60 | +Parsing the following string will render an xss tag in the browser but not in PHP: |
| 61 | +* `<!DOCTYPE HTML PUBLIC "-//W3C//DTDHTML4.01//EN" "><xss>">` |
| 62 | +* `<!DOCTYPE HTML SYSTEM "><xss>">` |
| 63 | + |
| 64 | +### Impact |
| 65 | +Sanitizers using the built-in PHP parser are inherently vulnerable to bypass due to wrong parsing. |
| 66 | + |
| 67 | +### Recommendation |
| 68 | +This issue is [known](https://wiki.php.net/rfc/domdocument_html5_parser) but isn't clear for users of PHP, after this report the PHP team added a red warning to the documentation: |
| 69 | + |
| 70 | +* [loadhtml](https://www.php.net/manual/en/domdocument.loadhtml.php) |
| 71 | +* [loadhtmlfile](https://www.php.net/manual/en/domdocument.loadhtmlfile.php) |
| 72 | +* [Commit](https://github.com/php/doc-en/commit/4ef716f8aa753e1189b2e57c91da378b16d970b0) |
0 commit comments