diff --git a/website/blog/2026-03-07-regex-hidden-traps.mdx b/website/blog/2026-03-07-regex-hidden-traps.mdx new file mode 100644 index 000000000000..bcdd076019f7 --- /dev/null +++ b/website/blog/2026-03-07-regex-hidden-traps.mdx @@ -0,0 +1,183 @@ +--- +slug: regex-hidden-traps +title: "The hidden traps of regex in LIKE and split" +authors: [mbasmanova] +tags: [tech-blog,functions] +--- + +SQL functions sometimes use regular expressions under the hood in ways that +surprise users. Two common examples are the +LIKE +operator and Spark's +split +function. + +In Presto, +split +takes a literal string delimiter and +regexp_split +is a separate function for regex-based splitting. Spark's `split`, however, +always treats the delimiter as a regular expression. + +Both LIKE and Spark's split can silently produce wrong results and waste CPU +when used with column values instead of constants. Understanding why this +happens helps write faster, more correct queries — and helps engine developers +make better design choices. + +## LIKE is not contains + +A very common query pattern is to check whether one string contains another: + +```sql +SELECT * FROM t WHERE name LIKE '%' || search_term || '%' +``` + +This looks intuitive: wrap `search_term` in `%` wildcards and you get a +"contains" check. But LIKE is **not** the same as substring matching. +LIKE treats `_` as a single-character wildcard and `%` as a multi-character +wildcard. If `search_term` comes from a column and contains these characters, +the results are silently wrong: + +``` +SELECT url, + url LIKE '%' || search_term || '%' AS like_result, + strpos(url, search_term) > 0 AS contains_result +FROM (VALUES + ('https://site.com/home'), + ('https://site.com/user_profile'), + ('https://site.com/username') +) AS t(url) +CROSS JOIN (VALUES ('user_')) AS s(search_term); + + url | like_result | contains_result +-------------------------------+-------------+---------------- +https://site.com/home | false | false +https://site.com/user_profile | true | true +https://site.com/username | true | false +``` + +`LIKE '%user_%'` matches `'https://site.com/username'` because `_` is a +wildcard that matches any single character — in this case, `n`. But +`strpos(url, 'user_') > 0` treats `_` as a literal underscore and correctly +reports that `'https://site.com/username'` does not contain the substring +`'user_'`. + +When the pattern is a constant, this distinction is visible and intentional. +But when users write `x LIKE '%' || y || '%'` where `y` is a column, the +values of `y` may contain `_` or `%` characters — and they will be silently +interpreted as wildcards, producing wrong results. + +## Spark's split treats delimiters as regex + +In Presto, the split +function takes a literal string delimiter, while +regexp_split +is a separate function for regex-based splitting. This distinction makes the intent clear. + +Spark's split +function, however, always treats the delimiter as a regular expression. +Users rarely realize this, and a common pattern is to split a string using a +value from another column: + +```sql +select split(dir_path, location_path)[1] as partition_name from t +``` + +Here, a table stores Hive partition metadata: `dir_path` is the full partition +path (e.g., `/data/warehouse/db.name/table/ds=2024-01-01`) and `location_path` +is the table path (e.g., `/data/warehouse/db.name/table`). The user wants to +strip the table path prefix to get the partition name. + +This works for simple paths. But `location_path` is interpreted as a regular +expression, not a literal string. If it contains `.` — as in `db.name` — the +`.` matches **any character**, not a literal dot. Characters like `(`, `)`, +`[`, `+`, `*`, `?`, and `$` would also cause wrong results or errors. + +A correct alternative that also executes faster uses simple string operations: + +```sql +IF(starts_with(dir_path, location_path), + substr(dir_path, length(location_path) + 2)) as partition_name +``` + +This is a bit more verbose than `split(dir_path, location_path)[1]`, but it is +correct for all inputs and avoids regex compilation entirely. + +## Performance trap + +Beyond correctness, there is a performance problem. Both LIKE and Spark's +split use RE2 as the regex +engine. RE2 is fast and safe, but compiling a regular expression can take up +to 200x more CPU time than evaluating it. + +When the pattern or delimiter is a constant, the regex is compiled once and +reused for every row. The cost is negligible. But when the pattern comes from +a column, a new regex may need to be compiled for every distinct value. A table +with thousands of distinct `location_path` values means thousands of regex +compilations — each one expensive and none of them necessary. + +Velox limits the number of compiled regular expressions per function instance +per thread of execution via the +expression.max_compiled_regexes +configuration property (default: 100). When this limit is reached, the query fails with an +error. + +## Tempting but wrong fix + +When users hit this limit, the natural reaction is to ask the engine developers +to raise or eliminate the cap. A recent +pull request +proposed replacing the fixed-size cache with an evicting cache: when the limit +is reached, the oldest compiled regex is evicted to make room for the new one. + +This sounds reasonable, and the motivation is understandable — users migrating +from Spark don't want to rewrite working queries. But it makes things worse: + +- **It hides the correctness bug.** The query no longer fails, so users never + discover that their LIKE pattern or split delimiter is being interpreted as + a regex and producing wrong results for inputs with special characters. +- **It makes the performance problem worse.** With thousands of distinct + patterns, the cache churns constantly — evicting one compiled regex only to + compile another. The query runs, but dramatically slower than necessary, and + the user has no indication why. In shared multi-tenant clusters, a single + slow query like this can consume excessive CPU and affect other users' + workloads. + +The error is a feature, not a bug. It is an early warning that catches misuse +before it leads to silently wrong results in production and prevents a single +query from wasting shared cluster resources. + +## Right fix + +**For users:** replace LIKE with literal string operations when checking for +substrings. Use `strpos(x, y) > 0` or `contains(x, y)` instead of +`x LIKE '%' || y || '%'`. For Spark's split with literal delimiters, use +`substr` or other string functions that don't involve regex. + +**For engine developers:** optimize the functions to avoid regex when it isn't +needed. Velox's LIKE implementation already does this. As described in +James Xu's +earlier blog post, the engine analyzes each pattern +and uses fast paths — prefix match, suffix match, substring search — whenever +the pattern contains only regular characters and `_` wildcards. For simple patterns, this gives up to 750x speedup over regex. +Regex is compiled only for patterns that truly require it, and these optimized +patterns are not counted toward the compiled regex limit. + +The same approach should be applied to Spark's split function. The engine can +check whether the delimiter contains any regex metacharacters. If it doesn't, +a simple string search can be used instead of compiling a regex. This would +make queries like `split(dir_path, location_path)` both fast and correct — +without users needing to change anything and without removing the safety net +for cases that genuinely require regex. + +## Takeaways + +- `LIKE` is not `contains`. The `_` and `%` wildcards can silently corrupt + results when the pattern comes from a column. +- Spark's `split` treats delimiters as regex. Characters like `.` in column + values are interpreted as regex metacharacters, not literal characters. + Presto avoids this by separating `split` (literal) and `regexp_split` (regex). +- When a query hits the compiled regex limit, the right response is to fix the + query, not to raise the limit. +- Engine developers should optimize functions to avoid regex when the input + is a plain string, rather than making it easier to misuse regex at scale.