-
Notifications
You must be signed in to change notification settings - Fork 93
Regex doesn't handle surrogate pairs properly #15
Copy link
Copy link
Open
Description
Hi, thank you for providing a great regular expression library!
I have noticed that brics handles input regex string as a sequence of java.lang.Character, and this could cause a somewhat unintuitive behavior.
For example, 𠀋<𠮟<𡵅 as a Unicode Scalar Value (0x2000b, 0x20b9f, 0x21d45 respectively, all of them will be expressed with surrogate pairs), but automaton created from [𠀋-𡵅] doesn't accept 𠮟.
private static void checkOneCodePoint(final String s) {
if (s.codePointCount(0, s.length()) != 1) throw new IllegalArgumentException();
}
public static boolean testBrics(final String a, final String b, final String c) {
checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
final RegExp regex = new RegExp("[" + a + "-" + c + "]");
return regex.toAutomaton().run(b);
}
public static boolean testJava(final String a, final String b, final String c) {
checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
final Pattern pattern = Pattern.compile("[" + a + "-" + c + "]");
return pattern.matcher(b).matches();
}
public static void main(final String[] args) throws IOException {
final String a = new String(new int[]{0x2000b}, 0, 1); // 𠀋
final String b = new String(new int[]{0x20b9f}, 0, 1); // 𠮟
final String c = new String(new int[]{0x21d45}, 0, 1); // 𡵅
System.out.println(testBrics(a, b, c)); // false
System.out.println(testJava(a, b, c)); // true
}Fixing this would require us to do
- Read regex string as (not
java.lang.Character-by-java.lang.Characterbut) Code Point stream. This also includes fixes for operator precedence, like𠀋+. - Convert them to
java.lang.Characters, and if they involve surrogate pairs, do something similar to what we do for numerical interval<n-m>
Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.
Thanks,
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels