Skip to content

Commit bcbe28b

Browse files
committed
minor fixes
1 parent 4928b9c commit bcbe28b

File tree

1 file changed

+73
-56
lines changed

1 file changed

+73
-56
lines changed

15-regexp-catastrophic-backtracking/article.md

Lines changed: 73 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
# Catastrophic backtracking
22

3-
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine.
3+
Some regular expressions are looking simple, but can execute a veeeeeery long time, and even "hang" the JavaScript engine.
44

5-
Sooner or later most developers occasionally face such behavior, because it's quite easy to create such a regexp.
6-
7-
The typical symptom -- a regular expression works fine sometimes, but for certain strings it "hangs", consuming 100% of CPU.
5+
Sooner or later most developers occasionally face such behavior. The typical symptom -- a regular expression works fine sometimes, but for certain strings it "hangs", consuming 100% of CPU.
86

97
In such case a web-browser suggests to kill the script and reload the page. Not a good thing for sure.
108

11-
For server-side JavaScript it may become a vulnerability if regular expressions process user data.
9+
For server-side JavaScript such a regexp may hang the server process, that's even worse. So we definitely should take a look at it.
1210

1311
## Example
1412

15-
Let's say we have a string, and we'd like to check if it consists of words `pattern:\w+` with an optional space `pattern:\s?` after each.
13+
Let's say we have a string, and we'd like to check if it consists of words `pattern:\w+` with an optional space `pattern:\s?` after each.
14+
15+
An obvious way to construct a regexp would be to take a word followed by an optional space `pattern:\w+\s?` and then repeat it with `*`.
1616

17-
We'll use a regexp `pattern:^(\w+\s?)*$`, it specifies 0 or more such words.
17+
That leads us to the regexp `pattern:^(\w+\s?)*$`, it specifies zero or more such words, that start at the beginning `pattern:^` and finish at the end `pattern:$` of the line.
1818

1919
In action:
2020

@@ -25,9 +25,9 @@ alert( regexp.test("A good string") ); // true
2525
alert( regexp.test("Bad characters: $@#") ); // false
2626
```
2727

28-
It seems to work. The result is correct. Although, on certain strings it takes a lot of time. So long that JavaScript engine "hangs" with 100% CPU consumption.
28+
The regexp seems to work. The result is correct. Although, on certain strings it takes a lot of time. So long that JavaScript engine "hangs" with 100% CPU consumption.
2929

30-
If you run the example below, you probably won't see anything, as JavaScript will just "hang". A web-browser will stop reacting on events, the UI will stop working. After some time it will suggest to reloaad the page. So be careful with this:
30+
If you run the example below, you probably won't see anything, as JavaScript will just "hang". A web-browser will stop reacting on events, the UI will stop working (most browsers allow only scrolling). After some time it will suggest to reload the page. So be careful with this:
3131

3232
```js run
3333
let regexp = /^(\w+\s?)*$/;
@@ -37,70 +37,72 @@ let str = "An input string that takes a long time or even makes this regexp to h
3737
alert( regexp.test(str) );
3838
```
3939

40-
Some regular expression engines can handle such search, but most of them can't.
40+
To be fair, let's note that some regular expression engines can handle such a search effectively. But most of them can't. Browser engines usually hang.
4141

4242
## Simplified example
4343

44-
What's the matter? Why the regular expression "hangs"?
44+
What's the matter? Why the regular expression hangs?
4545

4646
To understand that, let's simplify the example: remove spaces `pattern:\s?`. Then it becomes `pattern:^(\w+)*$`.
4747

4848
And, to make things more obvious, let's replace `pattern:\w` with `pattern:\d`. The resulting regular expression still hangs, for instance:
4949

50-
<!-- let str = `AnInputStringThatMakesItHang!`; -->
51-
5250
```js run
5351
let regexp = /^(\d+)*$/;
5452

55-
let str = "012345678901234567890123456789!";
53+
let str = "012345678901234567890123456789z";
5654

57-
// will take a very long time
55+
// will take a very long time (careful!)
5856
alert( regexp.test(str) );
5957
```
6058

6159
So what's wrong with the regexp?
6260

6361
First, one may notice that the regexp `pattern:(\d+)*` is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+`.
6462

65-
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious.
63+
Indeed, the regexp is artificial, we got it by simplifying the previous example. But the reason why it is slow is the same. So let's understand it, and then the previous example will become obvious.
6664

67-
What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456789!` (shortened a bit for clarity), why does it take so long?
65+
What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456789z` (shortened a bit for clarity, please note a non-digit character `subject:z` at the end, it's important), why does it take so long?
6866

69-
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
67+
Here's what the regexp engine does:
68+
69+
1. First, the regexp engine tries to find the content of the parentheses: the number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
7070

7171
```
7272
\d+.......
7373
(123456789)z
7474
```
7575
76-
Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything.
76+
After all digits are consumed, `pattern:\d+` is considered found (as `match:123456789`).
77+
78+
Then the star quantifier `pattern:(\d+)*` applies. But there are no more digits in the text, so the star doesn't give anything.
7779
78-
The next in the pattern is the string end `pattern:$`, but in the text we have `subject:!`, so there's no match:
80+
The next character in the pattern is the string end `pattern:$`. But in the text we have `subject:z` instead, so there's no match:
7981
8082
```
8183
X
8284
\d+........$
83-
(123456789)!
85+
(123456789)z
8486
```
8587
8688
2. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions, backtracks one character back.
8789
88-
Now `pattern:\d+` takes all digits except the last one:
90+
Now `pattern:\d+` takes all digits except the last one (`match:12345678`):
8991
```
9092
\d+.......
91-
(12345678)9!
93+
(12345678)9z
9294
```
93-
3. Then the engine tries to continue the search from the new position (`9`).
95+
3. Then the engine tries to continue the search from the next position (right after `match:12345678`).
9496
95-
The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`:
97+
The star `pattern:(\d+)*` can be applied -- it gives one more match of `pattern:\d+`, the number `match:9`:
9698
9799
```
98100
99101
\d+.......\d+
100-
(12345678)(9)!
102+
(12345678)(9)z
101103
```
102104
103-
The engine tries to match `pattern:$` again, but fails, because meets `subject:!`:
105+
The engine tries to match `pattern:$` again, but fails, because it meets `subject:z` instead:
104106
105107
```
106108
X
@@ -118,47 +120,43 @@ What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456
118120
```
119121
X
120122
\d+......\d+
121-
(1234567)(89)!
123+
(1234567)(89)z
122124
```
123125
124126
The first number has 7 digits, and then two numbers of 1 digit each:
125127
126128
```
127129
X
128130
\d+......\d+\d+
129-
(1234567)(8)(9)!
131+
(1234567)(8)(9)z
130132
```
131133
132134
The first number has 6 digits, and then a number of 3 digits:
133135
134136
```
135137
X
136138
\d+.......\d+
137-
(123456)(789)!
139+
(123456)(789)z
138140
```
139141
140142
The first number has 6 digits, and then 2 numbers:
141143
142144
```
143145
X
144146
\d+.....\d+ \d+
145-
(123456)(78)(9)!
147+
(123456)(78)(9)z
146148
```
147149
148150
...And so on.
149151
150152
151-
There are many ways to split a set of digits `123456789` into numbers. To be precise, there are <code>2<sup>n</sup>-1</code>, where `n` is the length of the set.
153+
There are many ways to split a sequence of digits `123456789` into numbers. To be precise, there are <code>2<sup>n</sup>-1</code>, where `n` is the length of the sequence.
152154
153-
For `n=20` there are about 1 million combinations, for `n=30` - a thousand times more. Trying each of them is exactly the reason why the search takes so long.
155+
- For `123456789` we have `n=9`, that gives 511 combinations.
156+
- For a longer sequence with `n=20` there are about one million (1048575) combinations.
157+
- For `n=30` - a thousand times more (1073741823 combinations).
154158
155-
What to do?
156-
157-
Should we turn on the lazy mode?
158-
159-
Unfortunately, that won't help: if we replace `pattern:\d+` with `pattern:\d+?`, the regexp will still hang. The order of combinations will change, but not their total count.
160-
161-
Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but not all engines, and not in all cases.
159+
Trying each of them is exactly the reason why the search takes so long.
162160
163161
## Back to words and strings
164162
@@ -176,15 +174,23 @@ The reason is that a word can be represented as one `pattern:\w+` or many:
176174
177175
For a human, it's obvious that there may be no match, because the string ends with an exclamation sign `!`, but the regular expression expects a wordly character `pattern:\w` or a space `pattern:\s` at the end. But the engine doesn't know that.
178176
179-
It tries all combinations of how the regexp `pattern:(\w+\s?)*` can "consume" the string, including variants with spaces `pattern:(\w+\s)*` and without them `pattern:(\w+)*` (because spaces `pattern:\s?` are optional). As there are many such combinations, the search takes a lot of time.
177+
It tries all combinations of how the regexp `pattern:(\w+\s?)*` can "consume" the string, including variants with spaces `pattern:(\w+\s)*` and without them `pattern:(\w+)*` (because spaces `pattern:\s?` are optional). As there are many such combinations (we've seen it with digits), the search takes a lot of time.
178+
179+
What to do?
180+
181+
Should we turn on the lazy mode?
182+
183+
Unfortunately, that won't help: if we replace `pattern:\w+` with `pattern:\w+?`, the regexp will still hang. The order of combinations will change, but not their total count.
184+
185+
Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but most engines don't, and it doesn't always help.
180186
181187
## How to fix?
182188
183189
There are two main approaches to fixing the problem.
184190
185191
The first is to lower the number of possible combinations.
186192
187-
Let's rewrite the regular expression as `pattern:^(\w+\s)*\w*` - we'll look for any number of words followed by a space `pattern:(\w+\s)*`, and then (optionally) a word `pattern:\w*`.
193+
Let's make the space non-optional by rewriting the regular expression as `pattern:^(\w+\s)*\w*$` - we'll look for any number of words followed by a space `pattern:(\w+\s)*`, and then (optionally) a final word `pattern:\w*`.
188194
189195
This regexp is equivalent to the previous one (matches the same) and works well:
190196
@@ -197,26 +203,30 @@ alert( regexp.test(str) ); // false
197203

198204
Why did the problem disappear?
199205

200-
Now the star `pattern:*` goes after `pattern:\w+\s` instead of `pattern:\w+\s?`. It became impossible to represent one word of the string with multiple successive `pattern:\w+`. The time needed to try such combinations is now saved.
206+
That's because now the space is mandatory.
201207

202-
For example, the previous pattern `pattern:(\w+\s?)*` could match the word `subject:string` as two `pattern:\w+`:
208+
The previous regexp, if we omit the space, becomes `pattern:(\w+)*`, leading to many combinations of `\w+` within a single word
209+
210+
So `subject:input` could be matched as two repetitions of `pattern:\w+`, like this:
203211

204-
```js run
205-
\w+\w+
206-
string
212+
```
213+
\w+ \w+
214+
(inp)(ut)
207215
```
208216

209-
The previous pattern, due to the optional `pattern:\s` allowed variants `pattern:\w+`, `pattern:\w+\s`, `pattern:\w+\w+` and so on.
217+
The new pattern is different: `pattern:(\w+\s)*` specifies repetitions of words followed by a space! The `subject:input` string can't be matched as two repetitions of `pattern:\w+\s`, because the space is mandatory.
210218

211-
With the rewritten pattern `pattern:(\w+\s)*`, that's impossible: there may be `pattern:\w+\s` or `pattern:\w+\s\w+\s`, but not `pattern:\w+\w+`. So the overall combinations count is greatly decreased.
219+
The time needed to try a lot of (actually most of) combinations is now saved.
212220

213221
## Preventing backtracking
214222

215-
It's not always convenient to rewrite a regexp. And it's not always obvious how to do it.
223+
It's not always convenient to rewrite a regexp though. In the example above it was easy, but it's not always obvious how to do it.
224+
225+
Besides, a rewritten regexp is usually more complex, and that's not good. Regexps are complex enough without extra efforts.
216226

217-
The alternative approach is to forbid backtracking for the quantifier.
227+
Luckily, there's an alternative approach. We can forbid backtracking for the quantifier.
218228

219-
The regular expressions engine tries many combinations that are obviously wrong for a human.
229+
The root of the problem is that the regexp engine tries many combinations that are obviously wrong for a human.
220230

221231
E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+` shouldn't backtrack. If we replace one `pattern:\d+` with two separate `pattern:\d+\d+`, nothing changes:
222232

@@ -230,19 +240,26 @@ E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+`
230240

231241
And in the original example `pattern:^(\w+\s?)*$` we may want to forbid backtracking in `pattern:\w+`. That is: `pattern:\w+` should match a whole word, with the maximal possible length. There's no need to lower the repetitions count in `pattern:\w+`, try to split it into two words `pattern:\w+\w+` and so on.
232242

233-
Modern regular expression engines support possessive quantifiers for that. They are like greedy ones, but don't backtrack (so they are actually simpler than regular quantifiers).
243+
Modern regular expression engines support possessive quantifiers for that. Regular quantifiers become possessive if we add `pattern:+` after them. That is, we use `pattern:\d++` instead of `pattern:\d+` to stop `pattern:+` from backtracking.
244+
245+
Possessive quantifiers are in fact simpler than "regular" ones. They just match as many as they can, without any backtracking. The search process without bracktracking is simpler.
234246

235247
There are also so-called "atomic capturing groups" - a way to disable backtracking inside parentheses.
236248

237-
Unfortunately, in JavaScript they are not supported. But there's another way.
249+
...But the bad news is that, unfortunately, in JavaScript they are not supported.
250+
251+
We can emulate them though using a "lookahead transform".
238252

239253
### Lookahead to the rescue!
240254

241-
We can prevent backtracking using lookahead.
255+
So we've come to real advanced topics. We'd like a quantifier, such as `pattern:+` not to backtrack, because sometimes backtracking makes no sense.
256+
257+
The pattern to take as much repetitions of `pattern:\w` as possible without backtracking is: `pattern:(?=(\w+))\1`. Of course, we could take another pattern instead of `pattern:\w`.
242258

243-
The pattern to take as much repetitions of `pattern:\w` as possible without backtracking is: `pattern:(?=(\w+))\1`.
259+
That may seem odd, but it's actually a very simple transform.
244260

245261
Let's decipher it:
262+
246263
- Lookahead `pattern:?=` looks forward for the longest word `pattern:\w+` starting at the current position.
247264
- The contents of parentheses with `pattern:?=...` isn't memorized by the engine, so wrap `pattern:\w+` into parentheses. Then the engine will memorize their contents
248265
- ...And allow us to reference it in the pattern as `pattern:\1`.

0 commit comments

Comments
 (0)