Skip to content

Commit acbb96c

Browse files
mbasmanovameta-codesync[bot]
authored andcommitted
docs: Add blog post about hidden traps of regex in LIKE and split (#16673)
Summary: Pull Request resolved: #16673 Blog post explaining how LIKE and Spark's split use regex under the hood, leading to correctness bugs and performance traps when patterns or delimiters come from columns. Discusses why raising the compiled regex limit is the wrong fix and what users and engine developers should do instead. ___ overriding_review_checks_triggers_an_audit_and_retroactive_review Oncall Short Name: velox Differential Revision: D95671491 fbshipit-source-id: 00b4edb1b6ba4ea1fb200137b36b6ba158a11e8d
1 parent d085d37 commit acbb96c

File tree

1 file changed

+183
-0
lines changed

1 file changed

+183
-0
lines changed
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
---
2+
slug: regex-hidden-traps
3+
title: "The hidden traps of regex in LIKE and split"
4+
authors: [mbasmanova]
5+
tags: [tech-blog,functions]
6+
---
7+
8+
SQL functions sometimes use regular expressions under the hood in ways that
9+
surprise users. Two common examples are the
10+
<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#like">LIKE</a>
11+
operator and Spark's
12+
<a href="https://facebookincubator.github.io/velox/functions/spark/string.html#split">split</a>
13+
function.
14+
15+
In Presto,
16+
<a href="https://facebookincubator.github.io/velox/functions/presto/string.html#split">split</a>
17+
takes a literal string delimiter and
18+
<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#regexp_split">regexp_split</a>
19+
is a separate function for regex-based splitting. Spark's `split`, however,
20+
always treats the delimiter as a regular expression.
21+
22+
Both LIKE and Spark's split can silently produce wrong results and waste CPU
23+
when used with column values instead of constants. Understanding why this
24+
happens helps write faster, more correct queries — and helps engine developers
25+
make better design choices.
26+
27+
## LIKE is not contains
28+
29+
A very common query pattern is to check whether one string contains another:
30+
31+
```sql
32+
SELECT * FROM t WHERE name LIKE '%' || search_term || '%'
33+
```
34+
35+
This looks intuitive: wrap `search_term` in `%` wildcards and you get a
36+
"contains" check. But LIKE is **not** the same as substring matching.
37+
LIKE treats `_` as a single-character wildcard and `%` as a multi-character
38+
wildcard. If `search_term` comes from a column and contains these characters,
39+
the results are silently wrong:
40+
41+
```
42+
SELECT url,
43+
url LIKE '%' || search_term || '%' AS like_result,
44+
strpos(url, search_term) > 0 AS contains_result
45+
FROM (VALUES
46+
('https://site.com/home'),
47+
('https://site.com/user_profile'),
48+
('https://site.com/username')
49+
) AS t(url)
50+
CROSS JOIN (VALUES ('user_')) AS s(search_term);
51+
52+
url | like_result | contains_result
53+
-------------------------------+-------------+----------------
54+
https://site.com/home | false | false
55+
https://site.com/user_profile | true | true
56+
https://site.com/username | true | false
57+
```
58+
59+
`LIKE '%user_%'` matches `'https://site.com/username'` because `_` is a
60+
wildcard that matches any single character — in this case, `n`. But
61+
`strpos(url, 'user_') > 0` treats `_` as a literal underscore and correctly
62+
reports that `'https://site.com/username'` does not contain the substring
63+
`'user_'`.
64+
65+
When the pattern is a constant, this distinction is visible and intentional.
66+
But when users write `x LIKE '%' || y || '%'` where `y` is a column, the
67+
values of `y` may contain `_` or `%` characters — and they will be silently
68+
interpreted as wildcards, producing wrong results.
69+
70+
## Spark's split treats delimiters as regex
71+
72+
In Presto, the <a href="https://facebookincubator.github.io/velox/functions/presto/string.html#split">split</a>
73+
function takes a literal string delimiter, while
74+
<a href="https://facebookincubator.github.io/velox/functions/presto/regexp.html#regexp_split">regexp_split</a>
75+
is a separate function for regex-based splitting. This distinction makes the intent clear.
76+
77+
Spark's <a href="https://facebookincubator.github.io/velox/functions/spark/string.html#split">split</a>
78+
function, however, always treats the delimiter as a regular expression.
79+
Users rarely realize this, and a common pattern is to split a string using a
80+
value from another column:
81+
82+
```sql
83+
select split(dir_path, location_path)[1] as partition_name from t
84+
```
85+
86+
Here, a table stores Hive partition metadata: `dir_path` is the full partition
87+
path (e.g., `/data/warehouse/db.name/table/ds=2024-01-01`) and `location_path`
88+
is the table path (e.g., `/data/warehouse/db.name/table`). The user wants to
89+
strip the table path prefix to get the partition name.
90+
91+
This works for simple paths. But `location_path` is interpreted as a regular
92+
expression, not a literal string. If it contains `.` — as in `db.name` — the
93+
`.` matches **any character**, not a literal dot. Characters like `(`, `)`,
94+
`[`, `+`, `*`, `?`, and `$` would also cause wrong results or errors.
95+
96+
A correct alternative that also executes faster uses simple string operations:
97+
98+
```sql
99+
IF(starts_with(dir_path, location_path),
100+
substr(dir_path, length(location_path) + 2)) as partition_name
101+
```
102+
103+
This is a bit more verbose than `split(dir_path, location_path)[1]`, but it is
104+
correct for all inputs and avoids regex compilation entirely.
105+
106+
## Performance trap
107+
108+
Beyond correctness, there is a performance problem. Both LIKE and Spark's
109+
split use <a href="https://github.com/google/re2">RE2</a> as the regex
110+
engine. RE2 is fast and safe, but compiling a regular expression can take up
111+
to 200x more CPU time than evaluating it.
112+
113+
When the pattern or delimiter is a constant, the regex is compiled once and
114+
reused for every row. The cost is negligible. But when the pattern comes from
115+
a column, a new regex may need to be compiled for every distinct value. A table
116+
with thousands of distinct `location_path` values means thousands of regex
117+
compilations — each one expensive and none of them necessary.
118+
119+
Velox limits the number of compiled regular expressions per function instance
120+
per thread of execution via the
121+
<a href="https://facebookincubator.github.io/velox/configs.html#expression-evaluation-configuration">expression.max_compiled_regexes</a>
122+
configuration property (default: 100). When this limit is reached, the query fails with an
123+
error.
124+
125+
## Tempting but wrong fix
126+
127+
When users hit this limit, the natural reaction is to ask the engine developers
128+
to raise or eliminate the cap. A recent
129+
<a href="https://github.com/facebookincubator/velox/pull/15953">pull request</a>
130+
proposed replacing the fixed-size cache with an evicting cache: when the limit
131+
is reached, the oldest compiled regex is evicted to make room for the new one.
132+
133+
This sounds reasonable, and the motivation is understandable — users migrating
134+
from Spark don't want to rewrite working queries. But it makes things worse:
135+
136+
- **It hides the correctness bug.** The query no longer fails, so users never
137+
discover that their LIKE pattern or split delimiter is being interpreted as
138+
a regex and producing wrong results for inputs with special characters.
139+
- **It makes the performance problem worse.** With thousands of distinct
140+
patterns, the cache churns constantly — evicting one compiled regex only to
141+
compile another. The query runs, but dramatically slower than necessary, and
142+
the user has no indication why. In shared multi-tenant clusters, a single
143+
slow query like this can consume excessive CPU and affect other users'
144+
workloads.
145+
146+
The error is a feature, not a bug. It is an early warning that catches misuse
147+
before it leads to silently wrong results in production and prevents a single
148+
query from wasting shared cluster resources.
149+
150+
## Right fix
151+
152+
**For users:** replace LIKE with literal string operations when checking for
153+
substrings. Use `strpos(x, y) > 0` or `contains(x, y)` instead of
154+
`x LIKE '%' || y || '%'`. For Spark's split with literal delimiters, use
155+
`substr` or other <a href="https://facebookincubator.github.io/velox/functions/spark/string.html">string functions</a> that don't involve regex.
156+
157+
**For engine developers:** optimize the functions to avoid regex when it isn't
158+
needed. Velox's LIKE implementation already does this. As described in
159+
<a href="https://github.com/xumingming">James Xu</a>'s
160+
<a href="/blog/like">earlier blog post</a>, the engine analyzes each pattern
161+
and uses fast paths — prefix match, suffix match, substring search — whenever
162+
the pattern contains only regular characters and `_` wildcards. For simple patterns, this gives up to 750x speedup over regex.
163+
Regex is compiled only for patterns that truly require it, and these optimized
164+
patterns are not counted toward the compiled regex limit.
165+
166+
The same approach should be applied to Spark's split function. The engine can
167+
check whether the delimiter contains any regex metacharacters. If it doesn't,
168+
a simple string search can be used instead of compiling a regex. This would
169+
make queries like `split(dir_path, location_path)` both fast and correct —
170+
without users needing to change anything and without removing the safety net
171+
for cases that genuinely require regex.
172+
173+
## Takeaways
174+
175+
- `LIKE` is not `contains`. The `_` and `%` wildcards can silently corrupt
176+
results when the pattern comes from a column.
177+
- Spark's `split` treats delimiters as regex. Characters like `.` in column
178+
values are interpreted as regex metacharacters, not literal characters.
179+
Presto avoids this by separating `split` (literal) and `regexp_split` (regex).
180+
- When a query hits the compiled regex limit, the right response is to fix the
181+
query, not to raise the limit.
182+
- Engine developers should optimize functions to avoid regex when the input
183+
is a plain string, rather than making it easier to misuse regex at scale.

0 commit comments

Comments
 (0)