Summary
When printFormatPreserving reprints a Scalar\String_ node whose value has changed, it falls back to pSingleQuotedString. That method's regex contains a third alternative (?<=\\)\\ that escapes a backslash simply because it follows another backslash, even when no escaping is required for correctness.
This turns valid non-canonical source like '\\\|' (3 backslashes) into '\\\\|' (4 backslashes), producing a different but semantically equivalent string.
See
|
protected function pSingleQuotedString(string $string): string { |
|
// It is idiomatic to only escape backslashes when necessary, i.e. when followed by ', \ or |
|
// the end of the string ('Foo\Bar' instead of 'Foo\\Bar'). However, we also don't want to |
|
// produce an odd number of backslashes, so '\\\\a' should not get rendered as '\\\a', even |
|
// though that would be legal. |
|
$regex = '/\'|\\\\(?=[\'\\\\]|$)|(?<=\\\\)\\\\/'; |
|
return '\'' . preg_replace($regex, '\\\\$0', $string) . '\''; |
|
} |
Expected behaviour
The reprinted string preserves the original three-backslash encoding '\\\|' when format-preserving pretty printer is used.
Actual behaviour
The reprinted string uses four backslashes '\\\\|', adding one backslash that was not
present in the original source.
Root cause
Standard::pSingleQuotedString uses this regex:
$regex = '/\'|\\\\(?=[\'\\\\]|$)|(?<=\\\\)\\\\/';
// ^^^^^^^^^^^^^^ third alternative
Possible fix
For format-preserving pretty printing, we can probably remove the part where extra \ is added:
// before
$regex = '/\'|\\\\(?=[\'\\\\]|$)|(?<=\\\\)\\\\/';
// after
$regex = '/\'|\\\\(?=[\'\\\\]|$)/';
How to reproduce? I will add the full script that demonstrates a bug, but the crucial thing is that we modify the string (regex) before pretty-printing
Reproducer
<?php
require_once __DIR__ . '/vendor/autoload.php';
use PhpParser\NodeFinder;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitor\CloningVisitor;
use PhpParser\ParserFactory;
use PhpParser\PrettyPrinter\Standard;
// ---------------------------------------------------------------------------
// Original PHP source containing the non-canonical regex.
// '\\\|' is 3 backslashes in source: '\\' (→ \) + '\|' (→ \| literally),
// so the string VALUE has 2 backslashes + pipe — valid but non-canonical.
// ---------------------------------------------------------------------------
$originalCode = <<<'PHP'
<?php
preg_match('/(?:.*controller\\\|.*controllers\\\)([\w\\\]+)controller$/iU', '', $m);
PHP;
$parser = (new ParserFactory())->createForHostVersion();
$originalStatements = $parser->parse($originalCode);
$originalTokens = $parser->getTokens();
// Clone the AST (same as Infection's CloningVisitor step in the mutation pipeline).
$clonedStatements = (new NodeTraverser(new CloningVisitor()))->traverse($originalStatements);
// Find the String_ node in the clone and change its value to simulate
// PregMatchRemoveFlags removing the 'U' flag: '/…/iU' → '/…/i'.
$stringNode = (new NodeFinder())->findFirstInstanceOf(
$clonedStatements,
PhpParser\Node\Scalar\String_::class,
);
$originalValue = $stringNode->value; // current decoded value (ends with /iU)
$stringNode->value = rtrim($originalValue, 'U'); // mutated decoded value (ends with /i)
echo "Original decoded value : " . json_encode($originalValue) . "\n";
echo "Mutated decoded value : " . json_encode($stringNode->value) . "\n\n";
// ---------------------------------------------------------------------------
// Print with the DEFAULT Standard printer (exhibits the bug).
// ---------------------------------------------------------------------------
$standardPrinter = new Standard();
$standardOutput = $standardPrinter->printFormatPreserving(
$clonedStatements,
$originalStatements,
$originalTokens,
);
// ---------------------------------------------------------------------------
// Print with the fixed printer (drops the (?<=\\)\\ alternative that causes
// the extra backslash).
// ---------------------------------------------------------------------------
$infectionPrinter = new class extends Standard {
protected function pSingleQuotedString(string $string): string
{
return '\'' . preg_replace('/\'|\\\\(?=[\'\\\\]|$)/', '\\\\$0', $string) . '\'';
}
};
$infectionOutput = $infectionPrinter->printFormatPreserving(
$clonedStatements,
$originalStatements,
$originalTokens,
);
echo "=== Standard printer (BUG: \\\\\\\\| instead of \\\\\\|) ===\n";
echo $standardOutput . "\n\n";
echo "=== InfectionPrettyPrinter (FIXED: \\\\\\| preserved) ===\n";
echo $infectionOutput . "\n\n";
// ---------------------------------------------------------------------------
// Explicit assertion so the script exits non-zero on failure.
// ---------------------------------------------------------------------------
$threeBackslashes = str_repeat('\\', 3);
$fourBackslashes = str_repeat('\\', 4);
assert(
str_contains($standardOutput, $fourBackslashes . '|'),
'Expected Standard printer to produce 4 backslashes before |',
);
assert(
str_contains($infectionOutput, $threeBackslashes . '|'),
'Expected InfectionPrettyPrinter to produce 3 backslashes before |',
);
assert(
!str_contains($infectionOutput, $fourBackslashes . '|'),
'Expected InfectionPrettyPrinter NOT to produce 4 backslashes before |',
);
echo "All assertions passed.\n";
Summary
When
printFormatPreservingreprints aScalar\String_node whose value has changed, it falls back topSingleQuotedString. That method's regex contains a third alternative(?<=\\)\\that escapes a backslash simply because it follows another backslash, even when no escaping is required for correctness.This turns valid non-canonical source like
'\\\|'(3 backslashes) into'\\\\|'(4 backslashes), producing a different but semantically equivalent string.See
PHP-Parser/lib/PhpParser/PrettyPrinter/Standard.php
Lines 1083 to 1090 in 50f0d9c
Expected behaviour
The reprinted string preserves the original three-backslash encoding
'\\\|'when format-preserving pretty printer is used.Actual behaviour
The reprinted string uses four backslashes
'\\\\|', adding one backslash that was notpresent in the original source.
Root cause
Standard::pSingleQuotedStringuses this regex:Possible fix
For format-preserving pretty printing, we can probably remove the part where extra
\is added:How to reproduce? I will add the full script that demonstrates a bug, but the crucial thing is that we modify the string (regex) before pretty-printing
\\\infection/infection#3022Reproducer