Skip to content

Conversation

@sbulen
Copy link
Contributor

@sbulen sbulen commented Oct 23, 2025

Fixes #8980

pg & mysql escape literals a bit differently. The rules that work for one won't work for the other. pg will escape single quotes by doubling them, '', mysql will escape them with a backslash, \'. Also, mysql escapes backslashes, \\, where pg doesn't.

So you cannot identify & clean the literals with the same logic.

The 'query check' logic makes a fully fleshed out copy of the current query, inserting values & smf substitutions, in order to do some syntax checks for behavior that SMF forbids. It wants to focus on the sql syntax itself, not anything that might only look like syntax, so it strips all string literals. Then it can easily check for forbidden syntax - internal comments, stacked sql commands, sleep() & benchmark() functions. The cleaned copy is discarded after this check.

Here are two properly escaped strings in pg:
'Hello world \'
'subject: ''SMF 2.0.17 has been released'','

And here are the same strings in mysql:
'Hello world \\'
'subject: \'SMF 2.0.17 has been released\','

So... The solution here was to remove the respective escapes first, making removing the string literals more straightforward. For mysql, this was a two step process - had to get rid of escaped backslashes first, in order to make the distinction between an escaped backslash with a valid end-of-string single quote (\\') and an escaped mid-string single quote (\').

An aside: during A/B testing, I ran across one of those elusive 'mysql result expected false given' errors. This may have been a (the?) source of them.

Another benefit of these fixes: Performance. On very large inserts, with lots of literals, the strpos() calls in the old method were insanely inefficient. (I have a mod that loads large files, and disabling the query check improves load time ~10x... Viewing the execution profile in WinCacheGrind, virtually all the time was spent executing strpos()... MUCH faster now.)

@live627
Copy link
Contributor

live627 commented Oct 24, 2025

 === Generating very large INSERT with 50000 rows ===
Generated SQL length: 4,016,651 bytes in 0.035s

=== Running SMF 2.1.6 cleaner on the huge INSERT ===
Cleaner result: ALLOWED
Cleaner runtime: 4.602 seconds
Peak memory: 2.79 MB

=== Running SMF 2.1.7 cleaner on the huge INSERT ===
Cleaner result: ALLOWED
Cleaner runtime: 0.021 seconds
Peak memory: 2.62 MB

=== Running sbulen cleaner on the huge INSERT ===
Cleaner result: ALLOWED
Cleaner runtime: 0.019 seconds
Peak memory: 2.62 MB

=== Running SMF 3 cleaner on the huge INSERT ===
Cleaner result: ALLOWED
Cleaner runtime: 0.021 seconds
Peak memory: 15.30 MB

PHP 8.2 lets me reset peak memory between runs.

@sbulen
Copy link
Contributor Author

sbulen commented Oct 24, 2025

Yep. The benefits of getting rid of the old strpos() method are significant.

@live627
Copy link
Contributor

live627 commented Oct 24, 2025

I'll share my testing script, which also includes a version of this query checking code that I was preparing for inclusion in SMF 3 that is even faster and has a smaller memory footprint. Requires at least PHP 8.2.

<?php

/**
 * SMF Query Cleaning Benchmark (Old vs Optimized)
 * Includes per-query average time (µs/query)
 */

// ------------------------------------------------------------
// OLD cleaner (slow baseline)
// ------------------------------------------------------------
function smf2_cleaner($db_string)
{
	// Comments that are allowed in a query are preg_removed.
	static $allowed_comments_from = array(
		'~\s+~s',
		'~/\*!40001 SQL_NO_CACHE \*/~',
		'~/\*!40000 USE INDEX \([A-Za-z\_]+?\) \*/~',
		'~/\*!40100 ON DUPLICATE KEY UPDATE id_msg = \d+ \*/~',
	);
	static $allowed_comments_to = array(
		' ',
		'',
		'',
		'',
	);

	$clean = '';
	$old_pos = 0;
	$pos = -1;
	// Remove the string escape for better runtime
	$db_string_1 = str_replace('\\\'', '', $db_string);
	while (true)
	{
		$pos = strpos($db_string_1, '\'', $pos + 1);
		if ($pos === false)
			break;
		$clean .= substr($db_string_1, $old_pos, $pos - $old_pos);

		while (true)
		{
			$pos1 = strpos($db_string_1, '\'', $pos + 1);
			$pos2 = strpos($db_string_1, '\\', $pos + 1);
			if ($pos1 === false)
				break;
			elseif ($pos2 === false || $pos2 > $pos1)
			{
				$pos = $pos1;
				break;
			}

			$pos = $pos2 + 1;
		}
		$clean .= ' %s ';

		$old_pos = $pos + 1;
	}
	$clean .= substr($db_string_1, $old_pos);
	$clean = trim(strtolower(preg_replace($allowed_comments_from, $allowed_comments_to, $clean)));

	// Comments?  We don't use comments in our queries, we leave 'em outside!
	if (strpos($clean, '/*') > 2 || strpos($clean, '--') !== false || strpos($clean, ';') !== false)
		$fail = true;
	// Trying to change passwords, slow us down, or something?
	elseif (strpos($clean, 'sleep') !== false && preg_match('~(^|[^a-z])sleep($|[^[_a-z])~s', $clean) != 0)
		$fail = true;
	elseif (strpos($clean, 'benchmark') !== false && preg_match('~(^|[^a-z])benchmark($|[^[a-z])~s', $clean) != 0)
		$fail = true;

	return empty($fail);
}

// ------------------------------------------------------------
// OLD cleaner (slow baseline)
// ------------------------------------------------------------
function smf217_cleaner($db_string)
{
	// Comments that are allowed in a query are preg_removed.
	static $allowed_comments_from = array(
		'~(?<![\'\\\\])\'\X*?(?<![\'\\\\])\'~',
		'~\s+~s',
		'~/\*!40001 SQL_NO_CACHE \*/~',
		'~/\*!40000 USE INDEX \([A-Za-z\_]+?\) \*/~',
		'~/\*!40100 ON DUPLICATE KEY UPDATE id_msg = \d+ \*/~',
	);

	static $allowed_comments_to = array(
		' %s ',
		' ',
		'',
		'',
		'',
	);

	$clean = trim(strtolower(preg_replace($allowed_comments_from, $allowed_comments_to, $db_string)));

	// Comments?  We don't use comments in our queries, we leave 'em outside!
	if (strpos($clean, '/*') > 2 || strpos($clean, '--') !== false || strpos($clean, ';') !== false)
		$fail = true;
	// Trying to change passwords, slow us down, or something?
	elseif (strpos($clean, 'sleep') !== false && preg_match('~(^|[^a-z])sleep($|[^[_a-z])~s', $clean) != 0)
		$fail = true;
	elseif (strpos($clean, 'benchmark') !== false && preg_match('~(^|[^a-z])benchmark($|[^[a-z])~s', $clean) != 0)
		$fail = true;

	return empty($fail);
}

// ------------------------------------------------------------
// OLD cleaner (slow baseline)
// ------------------------------------------------------------
function sbulen_cleaner($db_string)
{
	// Comments that are allowed in a query are preg_removed.
	static $allowed_comments_from = array(
		'~\'\X*?\'~s',
		'~\s+~s',
		'~/\*!40001 SQL_NO_CACHE \*/~',
		'~/\*!40000 USE INDEX \([A-Za-z\_]+?\) \*/~',
		'~/\*!40100 ON DUPLICATE KEY UPDATE id_msg = \d+ \*/~',
	);

	static $allowed_comments_to = array(
		' %s ',
		' ',
		'',
		'',
		'',
	);

	// Clear out escaped backslashes & single quotes first, to make it simpler to ID & remove string literals
	$clean = str_replace(array('\\\\', '\\\''), array('', ''), $db_string);
	$clean = trim(strtolower(preg_replace($allowed_comments_from, $allowed_comments_to, $clean)));

	// Comments?  We don't use comments in our queries, we leave 'em outside!
	if (strpos($clean, '/*') > 2 || strpos($clean, '--') !== false || strpos($clean, ';') !== false)
		$fail = true;
	// Trying to change passwords, slow us down, or something?
	elseif (strpos($clean, 'sleep') !== false && preg_match('~(^|[^a-z])sleep($|[^[_a-z])~s', $clean) != 0)
		$fail = true;
	elseif (strpos($clean, 'benchmark') !== false && preg_match('~(^|[^a-z])benchmark($|[^[a-z])~s', $clean) != 0)
		$fail = true;

	return empty($fail);
}

// ------------------------------------------------------------
// OLD cleaner (slow baseline)
// ------------------------------------------------------------
function smf3_cleaner($db_string)
{
	// Comments that are allowed in a query are preg_removed.
	static $allowed_comments_from = array(
		'~\s+~s',
		'~/\*!40001 SQL_NO_CACHE \*/~',
		'~/\*!40000 USE INDEX \([A-Za-z\_]+?\) \*/~',
		'~/\*!40100 ON DUPLICATE KEY UPDATE id_msg = \d+ \*/~',
	);
	static $allowed_comments_to = array(
		' ',
		'',
		'',
		'',
	);

	$clean = preg_split('/(?<![\'\\\\])\'(?![\'])/', $db_string);

	for ($i = 0; $i < \count($clean); $i++) {
		if ($i % 2 === 1) {
			$clean[$i] = ' %s ';
		}
	}

	$clean = trim(strtolower(preg_replace(
		$allowed_comments_from,
		$allowed_comments_to,
		implode('', $clean),
	)));

	return !(
		// Empty string?
		$clean === ''
		// Comments?  We don't use comments in our queries, we leave 'em outside!
		|| strpos($clean, '/*') > 2
		|| str_contains($clean, '--')
		|| str_contains($clean, ';')
		// Trying to change passwords, slow us down, or something?
		|| preg_match('~(^|[^a-z])sleep($|[^[_a-z])~s', $clean)
		|| preg_match('~(^|[^a-z])benchmark($|[^[a-z])~s', $clean));
}

// ------------------------------------------------------------
// NEW optimized cleaner
// ------------------------------------------------------------
function new_cleaner(string $sql): bool
{
	// Comments that are allowed in a query are preg_removed.
	static $allowed_comments =  '~/\*!(?:40001\s+SQL_NO_CACHE|40000\s+USE\s+INDEX\s+\([A-Za-z_]+?\)|40100\s+ON\s+DUPLICATE\s+KEY\s+UPDATE\s+id_msg\s+=\s+\d+)\s+\*/~';

	// Remove strings to prevent false positives (simpler & faster)
	$clean = preg_replace(
		[
			"/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s",
			// Strip allowed MySQL optimizer hints
			$allowed_comments,
		],
		["''", ''],
		$sql
	);

	// Trying to change passwords, slow us down, or add comments? We leave 'em outside!
	return !preg_match(
		'/\/\*|--|;|\b(?:sleep|benchmark)\s*\(/i',
		$clean
	);
}

// ------------------------------------------------------------
// Benchmark helper
// ------------------------------------------------------------
function bench(string $label, callable $fn, array $queries, int $iterations)
{
	$totalQueries = $iterations * count($queries);
	memory_reset_peak_usage();
	$start = hrtime(true);

	for ($i = 0; $i < $iterations; $i++) {
		foreach ($queries as $sql) {
			$fn($sql);
		}
	}

	$elapsed = (hrtime(true) - $start) / 1e6; // total ms
	$mem = memory_get_peak_usage();
	$avgUs = ($elapsed * 1000) / $totalQueries; // µs per query
	return [$label, $elapsed, $avgUs, $mem, $totalQueries];
}

// ------------------------------------------------------------
// Test dataset
// ------------------------------------------------------------
$queries = [
	// Normal / safe
	"SELECT * FROM users WHERE id = 1",
	"SELECT * FROM users WHERE name='O\\'Connor'",

	// Unsafe / classic
	"SELECT * FROM users; DROP TABLE users;",
	"SELECT * FROM log WHERE msg LIKE '%/*comment*/%';",
	"SELECT SLEEP(2)",
	"SELECT * FROM messages WHERE text='normal' -- trick",
	"SELECT /*!40001 SQL_NO_CACHE */ * FROM posts",
	"INSERT INTO tbl VALUES(1, 'safe')",

	// Edge cases
	"SELECT 'This string has a ; semicolon' AS test",
	"SELECT 'This string has a -- dash' AS test",
	"SELECT 'SLEEP(10)' AS test",
	"SELECT 'benchmark(1000, MD5(1))' AS test",
	"SELECT * FROM /*!40100 ON DUPLICATE KEY UPDATE id_msg = 42 */ tbl",
	"SELECT 'Escaped \\' quote inside\\'' AS test", // mysql
	"SELECT 'Escaped '' quote inside''\\''' AS test", // pgsql
	"SELECT * FROM users WHERE name='Alice' AND comment='/* tricky */' -- end",
	"SELECT /*!40000 USE INDEX (idx_test) */ * FROM posts WHERE id=1",
];

// ------------------------------------------------------------
// Scales to test
// ------------------------------------------------------------
$scales = [
	20000,
	200000,
	1000000,
];

$tests = [
	['SMF 2.1.6 cleaner', 'smf2_cleaner'],
	['SMF 2.1.7 cleaner', 'smf217_cleaner'],
	['sbulen cleaner', 'sbulen_cleaner'],
	['SMF 3 cleaner', 'smf3_cleaner'],
	['New cleaner', 'new_cleaner'],
];


// ------------------------------------------------------------
// Run full benchmark
// ------------------------------------------------------------
foreach ($scales as $total) {
	echo "=== Benchmark for {$total} queries ===\n";
	printf("%-25s : %10s | %12s | %10s\n", '', 'total (ms)', 'avg/query (µs)', 'memory');
	echo str_repeat('-', 68) . "\n";

	foreach ($tests as [$label, $fn]) {
		[$label, $elapsed, $avgUs, $mem] = bench($label, $fn, $queries, (int)($total / count($queries)));
		printf("%-25s : %10.3f | %12.3f | %8.2f MB\n", $label, $elapsed, $avgUs, $mem / 1048576);
	}

	echo "\n";
}

// ------------------------------------------------------------
// Functional difference check
// ------------------------------------------------------------
echo "=== Detection differences ===\n";
foreach ($queries as $sql) {
	$old = smf2_cleaner($sql);
	$new = new_cleaner($sql);
	if ($old !== $new) {
		echo sprintf(
			"- %-60s | old=%s new=%s\n",
			$sql,
			$old ? 'allow' : 'block',
			$new ? 'allow' : 'block'
		);
	}
}

echo "=== Detection differences: smf217 ===\n";
foreach ($queries as $sql) {
	$old = smf2_cleaner($sql);
	$smf217 = smf217_cleaner($sql);
	if ($old !== $smf217) {
		echo sprintf(
			"- %-60s | old=%s smf217=%s\n",
			$sql,
			$old ? 'allow' : 'block',
			$smf217 ? 'allow' : 'block'
		);
	}
}

echo "=== Detection differences: sbulen ===\n";
foreach ($queries as $sql) {
	$old = smf2_cleaner($sql);
	$sbulen = sbulen_cleaner($sql);
	if ($old !== $sbulen) {
		echo sprintf(
			"- %-60s | old=%s sbulen=%s\n",
			$sql,
			$old ? 'allow' : 'block',
			$sbulen ? 'allow' : 'block'
		);
	}
}


// -------------------- Generate large INSERT --------------------
$table = 'bulk_test';
$columns = ['id', 'payload', 'meta'];
$rows = 50000; // adjust as needed

echo "\n === Generating very large INSERT with {$rows} rows ===\n";
$startGen = microtime(true);

$values = [];
for ($i = 1; $i <= $rows; $i++) {
	$row = [
		$i, // id
		"'" . addslashes("payload_{$i}_" . str_repeat('x', $i % 50)) . "'",
		"'" . addslashes("meta_{$i}_" . str_repeat('y', $i % 30)) . "'"
	];
	$values[] = '(' . implode(', ', $row) . ')';
}

$sql = "INSERT INTO `{$table}` (" . implode(', ', $columns) . ") VALUES\n" .
		implode(",\n", $values);

$genTime = microtime(true) - $startGen;
echo "Generated SQL length: " . number_format(strlen($sql)) . " bytes in " . round($genTime,3) . "s\n";

// -------------------- Run cleaner --------------------

foreach ($tests as [$label, $fn]) {
	echo "\n=== Running $label on the huge INSERT ===\n";

	memory_reset_peak_usage();
	$mem = -memory_get_peak_usage();
	$startClean = microtime(true);
	$blocked = !$fn($sql);
	$cleanTime = microtime(true) - $startClean;
	$mem += memory_get_peak_usage();

	echo "Cleaner result: " . ($blocked ? "BLOCKED" : "ALLOWED") . "\n";
	echo "Cleaner runtime: " . round($cleanTime, 3) . " seconds\n";
	printf("Peak memory: %.2f MB\n", $mem / 1048576);
}

return;

echo "=== Detection Differences ===";

foreach ($queries as $sql) {
	echo "\n\nQuery:\n";
	echo $sql . "\n\n";

	// Print table header
	printf("%-20s | %-10s | %-10s | %-10s\n", "Cleaner", "Result", "Runtime(s)", "Peak MB");
	echo str_repeat("-", 60) . "\n";

	foreach ($tests as [$label, $fn]) {
		$startClean = hrtime(true);
		memory_reset_peak_usage();
		$blocked = !$fn($sql);
		$peakMB = memory_get_peak_usage() / 1048576;
		$cleanTime = hrtime(true) - $startClean;

		printf(
			"%-20s | %-10s | %-10.3f | %-10.2f\n",
			$label,
			$blocked ? "BLOCKED" : "ALLOWED",
			$cleanTime / 1000,
			$peakMB
		);
	}
}

@Sesquipedalian
Copy link
Member

Sorry, @sbulen, but trying to do this job using the ~\'\X*?\'~s regex has significant problems. I actually tried one that in the process of trying to fix this very same issue in 3.0. I no longer have the set of test strings that I used at the time, but there were some that it failed on.

Moreover, MySQL always recognizes both \' and '' as ways to escape an apostrophe character, and PostgreSQL recognizes both forms if standard_conforming_strings is set to off, so the logic for both databases does in fact need to handle both forms of escaping.

What needs to happen here is that the fixes from 3.0 (specifically, #8870) need to be backported to 2.1. Apparently I forgot to do that when I submitted #8870. I will submit a PR to do that shortly.

@Sesquipedalian
Copy link
Member

(clicked the wrong button)

@Sesquipedalian
Copy link
Member

Closed by #8992

@live627
Copy link
Contributor

live627 commented Oct 26, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants