Skip to content

Commit acabd20

Browse files
carenasgitster
authored andcommitted
grep: correctly identify utf-8 characters with \{b,w} in -P
When UTF is enabled for a PCRE match, the corresponding flags are added to the pcre2_compile() call, but PCRE2_UCP wasn't included. This prevents extending the meaning of the character classes to include those new valid characters and therefore result in failed matches for expressions that rely on that extention, for ex: $ git grep -P '\bÆvar' Add PCRE2_UCP so that \w will include Æ and therefore \b could correctly match the beginning of that word. This has an impact on performance that has been estimated to be between 20% to 40% and that is shown through the added performance test. Signed-off-by: Carlo Marcelo Arenas Belón <[email protected]> Acked-by: Ævar Arnfjörð Bjarmason <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent c48035d commit acabd20

File tree

2 files changed

+43
-1
lines changed

2 files changed

+43
-1
lines changed

grep.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
293293
options |= PCRE2_CASELESS;
294294
}
295295
if (!opt->ignore_locale && is_utf8_locale() && !literal)
296-
options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
296+
options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
297297

298298
#ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
299299
/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */

t/perf/p7822-grep-perl-character.sh

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/bin/sh
2+
3+
test_description="git-grep's perl regex
4+
5+
If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8'
6+
etc.) we will test the patterns under those numbers of threads.
7+
"
8+
9+
. ./perf-lib.sh
10+
11+
test_perf_large_repo
12+
test_checkout_worktree
13+
14+
if test -n "$GIT_PERF_GREP_THREADS"
15+
then
16+
test_set_prereq PERF_GREP_ENGINES_THREADS
17+
fi
18+
19+
for pattern in \
20+
'\\bhow' \
21+
'\\bÆvar' \
22+
'\\d+ \\bÆvar' \
23+
'\\bBelón\\b' \
24+
'\\w{12}\\b'
25+
do
26+
echo '$pattern' >pat
27+
if ! test_have_prereq PERF_GREP_ENGINES_THREADS
28+
then
29+
test_perf "grep -P '$pattern'" --prereq PCRE "
30+
git -P grep -f pat || :
31+
"
32+
else
33+
for threads in $GIT_PERF_GREP_THREADS
34+
do
35+
test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
36+
git -c grep.threads=$threads -P grep -f pat || :
37+
"
38+
done
39+
fi
40+
done
41+
42+
test_done

0 commit comments

Comments
 (0)