Skip to content

Commit 296711b

Browse files
authored
Support character references for &, <, >, ' and " (#5)
Polyglot HTML 5 markup (i. e. HTML 5 written in a way to be valid XML) only uses very few named entity references: > Polyglot markup uses only the following named entity references: > amp lt gt apos quot https://www.w3.org/TR/html-polyglot/#named-entity-references To support working with content that has been created before HTML5 – that is, XHTML1 – we substitute all named and character references with their plain values, which should not pose a problem in UTF-8 content. Only `&amp;`, `&lt;`, `&gt;`, `&quot;` and `&apos;` shall be kept. We missed, however, that e. g. `&amp;` can also be written as `&#38;`; similar for the other characters. This PR adds support for these cases as well.
1 parent b7308a3 commit 296711b

File tree

1 file changed

+6
-3
lines changed

1 file changed

+6
-3
lines changed

src/Webfactory/Dom/PolyglotHTML5ParsingHelper.php

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,14 @@ protected function sanitize($xml)
2020
$xml = str_replace('xmlns="http://www.w3.org/2000/svg"', '_xmlns="http://www.w3.org/2000/svg"', $xml);
2121

2222
$escaped = str_replace(
23-
array('&amp;', '&lt;', '&gt;', '&quot;', '&apos;'),
24-
array('&amp;amp;', '&amp;lt;', '&amp;gt;', '&amp;quot;', '&amp;apos;'),
23+
['&amp;', '&#38;', '&lt;', '&#60;', '&gt;', '&#62;', '&quot;', '&#34;', '&apos;', '&#39;'],
24+
['&amp;amp;', '&amp;amp;', '&amp;lt;', '&amp;lt;', '&amp;gt;', '&amp;gt;', '&amp;quot;', '&amp;quot;', '&amp;apos;', '&amp;apos;'],
2525
$xml
2626
);
27-
return html_entity_decode($escaped, ENT_QUOTES, 'UTF-8');
27+
28+
$decoded = html_entity_decode($escaped, ENT_QUOTES, 'UTF-8');
29+
30+
return $decoded;
2831
}
2932

3033
protected function fixDump($dump)

0 commit comments

Comments
 (0)