-
Notifications
You must be signed in to change notification settings - Fork 31
feat: refactor dom manipulation #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
CodeAnt AI is reviewing your PR. |
|
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||
|
PR Summary: Refactor script tag manipulation to use DOMDocument instead of regex; add DOM-based validation and tests.
|
| <?php | ||
| $buffer = '<div></div><script src="other.js"></script><script src="tracking.js"></script>'; | ||
| $wrapped_buffer = '<dummy>' . $buffer . '</dummy>'; | ||
|
|
||
| $dom = new DOMDocument(); | ||
| $encoded_buffer = mb_encode_numericentity( $wrapped_buffer, array( 0x80, 0x10FFFF, 0, 0x1FFFFF ), 'UTF-8' ); | ||
| $dom->loadHTML( $encoded_buffer, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ); | ||
|
|
||
| $dummy = $dom->getElementsByTagName('dummy')->item(0); | ||
|
|
||
| echo "Child Nodes: " . $dummy->childNodes->length . "\n"; | ||
| foreach ($dummy->childNodes as $node) { | ||
| echo "Node: " . $node->nodeName . "\n"; | ||
| echo "Content: " . $dom->saveHTML($node) . "\n"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NITPICK] This file appears to be a local debugging script left in the commit. Remove debug_dom.php from the PR (or move it to a non-shipped dev-only location). Leaving debug utilities in the repository can accidentally expose information and increases maintenance noise.
| $dom = new DOMDocument(); | ||
|
|
||
| // Suppress errors for partial HTML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CRITICAL_BUG] The new logic depends on the DOMDocument extension (instantiating new DOMDocument()). If ext-dom is not available in the target environment this will fatal error. Add a graceful fallback (either keep the original regex-based implementation as a fallback or check class_exists('DOMDocument')/extension_loaded('dom') and return $buffer or run the regex approach). Ensure the code never triggers a fatal error in environments without DOM.
if (!class_exists('DOMDocument') || !extension_loaded('dom')) {
// Fallback to regex-based implementation
// (Insert the old regex logic here, or simply return $buffer)
return $buffer;
}| $libxml_previous_state = libxml_use_internal_errors( true ); | ||
|
|
||
| // Wrap buffer in a custom tag to ensure correct parsing of fragments (e.g. multiple siblings at root) | ||
| // This prevents DOMDocument from trying to fix structure by nesting siblings | ||
| $wrapped_buffer = '<cookiebot-wrapper>' . $buffer . '</cookiebot-wrapper>'; | ||
|
|
||
| $normalized_buffer = preg_replace( '/(<script(.*?)\/>)/is', '<script$2></script>', $buffer ); | ||
| // Load HTML with UTF-8 encoding hack | ||
| // The mb_convert_encoding is to ensure we don't have encoding issues | ||
| // We use LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD to avoid adding <html><body> wrappers automatically | ||
| // Replacement for deprecated mb_convert_encoding(..., 'HTML-ENTITIES', 'UTF-8') | ||
| $encoded_buffer = mb_encode_numericentity( $wrapped_buffer, array( 0x80, 0x10FFFF, 0, 0x1FFFFF ), 'UTF-8' ); | ||
| $dom->loadHTML( $encoded_buffer, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ); | ||
|
|
||
| if ( $normalized_buffer !== null ) { | ||
| $buffer = $normalized_buffer; | ||
| libxml_use_internal_errors( $libxml_previous_state ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PERFORMANCE_OPTIMIZATION] You call libxml_use_internal_errors(true) and then restore the previous state after loadHTML but don't clear libxml errors. Call libxml_clear_errors() after loadHTML (before restoring the previous state) to avoid accumulating libxml errors in long-running processes or tests.
$dom->loadHTML( $encoded_buffer, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
libxml_clear_errors();
libxml_use_internal_errors( $libxml_previous_state );| } | ||
| } | ||
| } | ||
|
|
||
| if ( $modified ) { | ||
| // Save HTML | ||
| // We extract children of our wrapper | ||
| $wrapper = $dom->getElementsByTagName( 'cookiebot-wrapper' )->item( 0 ); | ||
| if ( $wrapper ) { | ||
| $output = ''; | ||
| foreach ( $wrapper->childNodes as $node ) { | ||
| $output .= $dom->saveHTML( $node ); | ||
| } | ||
| return $output; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[VALIDATION] Behavior change: the old implementation updated option 'cookiebot_regex_stacklimit' on a regex fallback. The new DOM-based path removed that fallback and no longer writes that option. If other code relies on that option being set to detect regex failures, adapt or preserve that behavior in a DOM fallback branch to maintain compatibility.
// In the fallback branch (regex failure), preserve the option update
if ($updated_scripts === null) {
$updated_scripts = $buffer;
if (get_option('cookiebot_regex_stacklimit') === false) {
update_option('cookiebot_regex_stacklimit', 1);
}
}| // Use DOMDocument to safely parse and modify the script tag | ||
| $dom = new DOMDocument(); | ||
|
|
||
| // Suppress errors for partial HTML | ||
| $libxml_previous_state = libxml_use_internal_errors( true ); | ||
|
|
||
| // Load HTML with UTF-8 encoding hack | ||
| // The mb_convert_encoding is to ensure we don't have encoding issues | ||
| // Replacement for deprecated mb_convert_encoding(..., 'HTML-ENTITIES', 'UTF-8') | ||
| $encoded_tag = mb_encode_numericentity( $tag, array( 0x80, 0x10FFFF, 0, 0x1FFFFF ), 'UTF-8' ); | ||
| $dom->loadHTML( $encoded_tag, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CRITICAL_BUG] This method now uses DOMDocument (new DOMDocument() at lines 96-100). As with helper.php, you must guard against missing ext-dom (class DOMDocument not found) or add a fallback. Without the DOM extension this will fatal error. Either check class_exists('DOMDocument') and return original $tag or fall back to the previous preg_replace_callback approach.
if ( ! class_exists( 'DOMDocument' ) ) {
// Fallback to previous regex approach or return original $tag
return preg_replace_callback(
'/<script\s*(?<atts>[^>]*)>/',
function ( $tag ) use ( $handle ) {
if ( ! self::validate_attributes_for_consent_ignore( $handle, $tag['atts'] ) ) {
return $tag[0];
}
return str_replace( '<script ', '<script data-cookieconsent="ignore" ', $tag[0] );
},
$tag
);
}
// ... continue with DOMDocument logic| $html = preg_replace_callback('/<script([^>]*)>/', function($matches) { | ||
| $attrs = $matches[1]; | ||
| preg_match_all('/(\w+)(?:="([^"]*)")?/', $attrs, $attr_matches, PREG_SET_ORDER); | ||
| $sorted_attrs = []; | ||
| foreach ($attr_matches as $m) { | ||
| $name = $m[1]; | ||
| $value = isset($m[2]) ? $m[2] : ''; | ||
| $sorted_attrs[$name] = $value; | ||
| } | ||
| ksort($sorted_attrs); | ||
| $new_attrs = ''; | ||
| foreach ($sorted_attrs as $k => $v) { | ||
| if ($v === '') { | ||
| $new_attrs .= " $k"; | ||
| } else { | ||
| $new_attrs .= " $k=\"$v\""; | ||
| } | ||
| } | ||
| return "<script$new_attrs>"; | ||
| }, $html); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[REFACTORING] The attribute sorting regex in normalize_html uses /(\w+)(?:="([^"]*)")?/ which doesn't match attribute names containing hyphens (e.g. data-cookieconsent, data-cfasync). Update the attribute name pattern to allow hyphens (e.g. [A-Za-z0-9_-]+) so attribute sorting is robust and tests don't produce false negatives because of attribute name parsing issues.
// In normalize_html, update the attribute regex to allow hyphens in attribute names
$html = preg_replace_callback('/<script([^>]*)>/', function($matches) {
$attrs = $matches[1];
// Updated regex: attribute names can include hyphens and underscores
preg_match_all('/([A-Za-z0-9_\-]+)(?:="([^"]*)")?/', $attrs, $attr_matches, PREG_SET_ORDER);
$sorted_attrs = [];
foreach ($attr_matches as $m) {
$name = $m[1];
$value = isset($m[2]) ? $m[2] : '';
$sorted_attrs[$name] = $value;
}
ksort($sorted_attrs);
$new_attrs = '';
foreach ($sorted_attrs as $k => $v) {
if ($v === '') {
$new_attrs .= " $k";
} else {
$new_attrs .= " $k=\"$v\"";
}
}
return "<script$new_attrs>";
}, $html);| if (!function_exists('esc_url')) { | ||
| function esc_url($url) { return $url; } | ||
| } | ||
| if (!function_exists('esc_attr')) { | ||
| function esc_attr($text) { return htmlspecialchars($text, ENT_QUOTES); } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NITPICK] Test helper esc_attr returns htmlspecialchars(..., ENT_QUOTES) while esc_url returns original value. This difference is OK for tests but can hide encoding differences when new code uses esc_url/esc_attr. Consider making esc_url a lightweight normalizer (e.g. return filter_var($url, FILTER_SANITIZE_URL)) to better reflect production behavior in tests.
function esc_url($url) {
return filter_var($url, FILTER_SANITIZE_URL);
}|
Reviewed up to commit:0dcba638733eb2176a7a3eb572eb3f500c76cbbc Additional Suggestionsrc/lib/script_loader_tag/Script_Loader_Tag.php, line:218-256validate_attributes_for_consent_ignore_dom reimplements ID matching logic using string checks and a small list of suffixes. This can miss edge cases compared to the original regex (for example different suffix patterns or subtle ID forms). Reuse the original preg_match pattern (converted to operate on the DOMElement's id attribute) or centralize the matching logic so both extract_base_id_from_inline_id() and this validator agree on allowed suffixes. That avoids regressions for inline script IDs created by WP or build tools.private function validate_attributes_for_consent_ignore_dom( $script_handle, $id, $script ) {
// Use the same regex as extract_base_id_from_inline_id for suffixes
$base_id = preg_replace( '/-js-(extra|after|before)$/', '', $id );
if ( $base_id !== $script_handle ) {
return false;
}
if ( $script->hasAttribute( 'data-cookieconsent' ) ) {
return false;
}
return true;
}
// Or, move the regex pattern to a shared method and use it in both places. |
| $script->setAttribute( 'data-cookieconsent', 'ignore' ); | ||
|
|
||
| // Save HTML | ||
| return $dom->saveHTML( $script ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: After the DOM validation, inject the data-cookieconsent attribute into the original $tag string instead of returning $dom->saveHTML() so the markup is not mutated by DOMDocument serialization. [possible issue]
|
CodeAnt AI finished reviewing your PR. |



User description
PR Type
Enhancement
Description
Replace regex-based DOM manipulation with DOMDocument for safer HTML parsing
Add UTF-8 encoding support using mb_encode_numericentity to handle special characters
Implement wrapper tag strategy to preserve HTML fragments during parsing
Add security escaping (esc_url, esc_attr) for output sanitization
Convert attribute validation from regex to DOM element methods
Add comprehensive test suite comparing old and new implementations
Diagram Walkthrough
File Walkthrough
helper.php
Convert script manipulation from regex to DOMDocumentsrc/lib/helper.php
cookiebot_addons_manipulate_script()from regex-based toDOMDocument-based implementation
mb_encode_numericentity()to handle specialcharacters safely
prevent unwanted nesting
attribute manipulation
libxml_use_internal_errors()DOMDocumentandDOMElementclassesScript_Loader_Tag.php
Replace regex parsing with DOMDocument in script loadersrc/lib/script_loader_tag/Script_Loader_Tag.php
cookiebot_add_consent_attribute_to_tag()to use DOMDocumentinstead of regex for script tag parsing
validate_attributes_for_consent_ignore_dom()method using DOM elementAPI
mb_encode_numericentity()forconsistent HTML parsing
esc_url()andesc_attr()for outputsanitization
DOMDocumentandDOMElementclassescompare_output.php
Add test suite for DOM refactoring validationtests/compare_output.php
implementation against old regex-based logic
normalize_html()function to handle whitespace, quotenormalization, and attribute sorting
existing types, and fragments
cookiebot_addons_manipulate_script()andScript_Loader_Tagimplementations
test cases
old_logic.php
Store original regex implementations for testingtests/old_logic.php
testing
cookiebot_addons_manipulate_script_old()function withoriginal regex patterns
Script_Loader_Tag_Oldclass with original regex-basedattribute validation
equivalence
mock_wp_functions.php
Add WordPress function mocks for testingtests/mock_wp_functions.php
testing
esc_url(),esc_attr(),esc_html()get_option(),update_option()add_action(),add_filter(),apply_filters()debug_dom.php
Add DOMDocument parsing debug scriptdebug_dom.php
mb_encode_numericentity()usageCodeAnt-AI Description
Switch script consent tagging to DOM parsing with regression tests
What Changed
type="text/plain"anddata-cookieconsentattributes instead of relying on fragile regex.Impact
✅ Stable consent tagging for keyword-matched scripts✅ Accurate ignore flags for identified third-party scripts✅ Confidence from regression tests💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.