|
| 1 | +--- |
| 2 | +slug: regex-hidden-traps |
| 3 | +title: "The hidden traps of regex in LIKE and split" |
| 4 | +authors: [mbasmanova] |
| 5 | +tags: [tech-blog,functions] |
| 6 | +--- |
| 7 | + |
| 8 | +SQL functions sometimes use regular expressions under the hood in ways that |
| 9 | +surprise users. Two common examples are the |
| 10 | +<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#like">LIKE</a> |
| 11 | +operator and Spark's |
| 12 | +<a href="https://facebookincubator.github.io/velox/functions/spark/string.html#split">split</a> |
| 13 | +function. |
| 14 | + |
| 15 | +In Presto, |
| 16 | +<a href="https://facebookincubator.github.io/velox/functions/presto/string.html#split">split</a> |
| 17 | +takes a literal string delimiter and |
| 18 | +<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#regexp_split">regexp_split</a> |
| 19 | +is a separate function for regex-based splitting. Spark's `split`, however, |
| 20 | +always treats the delimiter as a regular expression. |
| 21 | + |
| 22 | +Both LIKE and Spark's split can silently produce wrong results and waste CPU |
| 23 | +when used with column values instead of constants. Understanding why this |
| 24 | +happens helps write faster, more correct queries — and helps engine developers |
| 25 | +make better design choices. |
| 26 | + |
| 27 | +## LIKE is not contains |
| 28 | + |
| 29 | +A very common query pattern is to check whether one string contains another: |
| 30 | + |
| 31 | +```sql |
| 32 | +SELECT * FROM t WHERE name LIKE '%' || search_term || '%' |
| 33 | +``` |
| 34 | + |
| 35 | +This looks intuitive: wrap `search_term` in `%` wildcards and you get a |
| 36 | +"contains" check. But LIKE is **not** the same as substring matching. |
| 37 | +LIKE treats `_` as a single-character wildcard and `%` as a multi-character |
| 38 | +wildcard. If `search_term` comes from a column and contains these characters, |
| 39 | +the results are silently wrong: |
| 40 | + |
| 41 | +``` |
| 42 | +SELECT url, |
| 43 | + url LIKE '%' || search_term || '%' AS like_result, |
| 44 | + strpos(url, search_term) > 0 AS contains_result |
| 45 | +FROM (VALUES |
| 46 | + ('https://site.com/home'), |
| 47 | + ('https://site.com/user_profile'), |
| 48 | + ('https://site.com/username') |
| 49 | +) AS t(url) |
| 50 | +CROSS JOIN (VALUES ('user_')) AS s(search_term); |
| 51 | +
|
| 52 | + url | like_result | contains_result |
| 53 | +-------------------------------+-------------+---------------- |
| 54 | +https://site.com/home | false | false |
| 55 | +https://site.com/user_profile | true | true |
| 56 | +https://site.com/username | true | false |
| 57 | +``` |
| 58 | + |
| 59 | +`LIKE '%user_%'` matches `'https://site.com/username'` because `_` is a |
| 60 | +wildcard that matches any single character — in this case, `n`. But |
| 61 | +`strpos(url, 'user_') > 0` treats `_` as a literal underscore and correctly |
| 62 | +reports that `'https://site.com/username'` does not contain the substring |
| 63 | +`'user_'`. |
| 64 | + |
| 65 | +When the pattern is a constant, this distinction is visible and intentional. |
| 66 | +But when users write `x LIKE '%' || y || '%'` where `y` is a column, the |
| 67 | +values of `y` may contain `_` or `%` characters — and they will be silently |
| 68 | +interpreted as wildcards, producing wrong results. |
| 69 | + |
| 70 | +## Spark's split treats delimiters as regex |
| 71 | + |
| 72 | +In Presto, the <a href="https://facebookincubator.github.io/velox/functions/presto/string.html#split">split</a> |
| 73 | +function takes a literal string delimiter, while |
| 74 | +<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#regexp_split">regexp_split</a> |
| 75 | +is a separate function for regex-based splitting. This distinction makes the intent clear. |
| 76 | + |
| 77 | +Spark's <a href="https://facebookincubator.github.io/velox/functions/spark/string.html#split">split</a> |
| 78 | +function, however, always treats the delimiter as a regular expression. |
| 79 | +Users rarely realize this, and a common pattern is to split a string using a |
| 80 | +value from another column: |
| 81 | + |
| 82 | +```sql |
| 83 | +select split(dir_path, location_path)[1] as partition_name from t |
| 84 | +``` |
| 85 | + |
| 86 | +Here, a table stores Hive partition metadata: `dir_path` is the full partition |
| 87 | +path (e.g., `/data/warehouse/db.name/table/ds=2024-01-01`) and `location_path` |
| 88 | +is the table path (e.g., `/data/warehouse/db.name/table`). The user wants to |
| 89 | +strip the table path prefix to get the partition name. |
| 90 | + |
| 91 | +This works for simple paths. But `location_path` is interpreted as a regular |
| 92 | +expression, not a literal string. If it contains `.` — as in `db.name` — the |
| 93 | +`.` matches **any character**, not a literal dot. Characters like `(`, `)`, |
| 94 | +`[`, `+`, `*`, `?`, and `$` would also cause wrong results or errors. |
| 95 | + |
| 96 | +A correct alternative that also executes faster uses simple string operations: |
| 97 | + |
| 98 | +```sql |
| 99 | +IF(starts_with(dir_path, location_path), |
| 100 | + substr(dir_path, length(location_path) + 2)) as partition_name |
| 101 | +``` |
| 102 | + |
| 103 | +This is a bit more verbose than `split(dir_path, location_path)[1]`, but it is |
| 104 | +correct for all inputs and avoids regex compilation entirely. |
| 105 | + |
| 106 | +## Performance trap |
| 107 | + |
| 108 | +Beyond correctness, there is a performance problem. Both LIKE and Spark's |
| 109 | +split use <a href="https://github.com/google/re2">RE2</a> as the regex |
| 110 | +engine. RE2 is fast and safe, but compiling a regular expression can take up |
| 111 | +to 200x more CPU time than evaluating it. |
| 112 | + |
| 113 | +When the pattern or delimiter is a constant, the regex is compiled once and |
| 114 | +reused for every row. The cost is negligible. But when the pattern comes from |
| 115 | +a column, a new regex may need to be compiled for every distinct value. A table |
| 116 | +with thousands of distinct `location_path` values means thousands of regex |
| 117 | +compilations — each one expensive and none of them necessary. |
| 118 | + |
| 119 | +Velox limits the number of compiled regular expressions per function instance |
| 120 | +per thread of execution via the |
| 121 | +<a href="https://facebookincubator.github.io/velox/configs.html#expression-evaluation-configuration">expression.max_compiled_regexes</a> |
| 122 | +configuration property (default: 100). When this limit is reached, the query fails with an |
| 123 | +error. |
| 124 | + |
| 125 | +## Tempting but wrong fix |
| 126 | + |
| 127 | +When users hit this limit, the natural reaction is to ask the engine developers |
| 128 | +to raise or eliminate the cap. A recent |
| 129 | +<a href="https://github.com/facebookincubator/velox/pull/15953">pull request</a> |
| 130 | +proposed replacing the fixed-size cache with an evicting cache: when the limit |
| 131 | +is reached, the oldest compiled regex is evicted to make room for the new one. |
| 132 | + |
| 133 | +This sounds reasonable, and the motivation is understandable — users migrating |
| 134 | +from Spark don't want to rewrite working queries. But it makes things worse: |
| 135 | + |
| 136 | +- **It hides the correctness bug.** The query no longer fails, so users never |
| 137 | + discover that their LIKE pattern or split delimiter is being interpreted as |
| 138 | + a regex and producing wrong results for inputs with special characters. |
| 139 | +- **It makes the performance problem worse.** With thousands of distinct |
| 140 | + patterns, the cache churns constantly — evicting one compiled regex only to |
| 141 | + compile another. The query runs, but dramatically slower than necessary, and |
| 142 | + the user has no indication why. In shared multi-tenant clusters, a single |
| 143 | + slow query like this can consume excessive CPU and affect other users' |
| 144 | + workloads. |
| 145 | + |
| 146 | +The error is a feature, not a bug. It is an early warning that catches misuse |
| 147 | +before it leads to silently wrong results in production and prevents a single |
| 148 | +query from wasting shared cluster resources. |
| 149 | + |
| 150 | +## Right fix |
| 151 | + |
| 152 | +**For users:** replace LIKE with literal string operations when checking for |
| 153 | +substrings. Use `strpos(x, y) > 0` or `contains(x, y)` instead of |
| 154 | +`x LIKE '%' || y || '%'`. For Spark's split with literal delimiters, use |
| 155 | +`substr` or other <a href="https://facebookincubator.github.io/velox/functions/spark/string.html">string functions</a> that don't involve regex. |
| 156 | + |
| 157 | +**For engine developers:** optimize the functions to avoid regex when it isn't |
| 158 | +needed. Velox's LIKE implementation already does this. As described in |
| 159 | +<a href="https://github.com/xumingming">James Xu</a>'s |
| 160 | +<a href="/blog/like">earlier blog post</a>, the engine analyzes each pattern |
| 161 | +and uses fast paths — prefix match, suffix match, substring search — whenever |
| 162 | +the pattern contains only regular characters and `_` wildcards. For simple patterns, this gives up to 750x speedup over regex. |
| 163 | +Regex is compiled only for patterns that truly require it, and these optimized |
| 164 | +patterns are not counted toward the compiled regex limit. |
| 165 | + |
| 166 | +The same approach should be applied to Spark's split function. The engine can |
| 167 | +check whether the delimiter contains any regex metacharacters. If it doesn't, |
| 168 | +a simple string search can be used instead of compiling a regex. This would |
| 169 | +make queries like `split(dir_path, location_path)` both fast and correct — |
| 170 | +without users needing to change anything and without removing the safety net |
| 171 | +for cases that genuinely require regex. |
| 172 | + |
| 173 | +## Takeaways |
| 174 | + |
| 175 | +- `LIKE` is not `contains`. The `_` and `%` wildcards can silently corrupt |
| 176 | + results when the pattern comes from a column. |
| 177 | +- Spark's `split` treats delimiters as regex. Characters like `.` in column |
| 178 | + values are interpreted as regex metacharacters, not literal characters. |
| 179 | + Presto avoids this by separating `split` (literal) and `regexp_split` (regex). |
| 180 | +- When a query hits the compiled regex limit, the right response is to fix the |
| 181 | + query, not to raise the limit. |
| 182 | +- Engine developers should optimize functions to avoid regex when the input |
| 183 | + is a plain string, rather than making it easier to misuse regex at scale. |
0 commit comments