Skip to content

Regex doesn't handle surrogate pairs properly #15

@MiSawa

Description

@MiSawa

Hi, thank you for providing a great regular expression library!

I have noticed that brics handles input regex string as a sequence of java.lang.Character, and this could cause a somewhat unintuitive behavior.

For example, 𠀋<𠮟<𡵅 as a Unicode Scalar Value (0x2000b, 0x20b9f, 0x21d45 respectively, all of them will be expressed with surrogate pairs), but automaton created from [𠀋-𡵅] doesn't accept 𠮟.

private static void checkOneCodePoint(final String s) {
    if (s.codePointCount(0, s.length()) != 1) throw new IllegalArgumentException();
}

public static boolean testBrics(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final RegExp regex = new RegExp("[" + a + "-" + c + "]");
    return regex.toAutomaton().run(b);
}

public static boolean testJava(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final Pattern pattern = Pattern.compile("[" + a + "-" + c + "]");
    return pattern.matcher(b).matches();
}

public static void main(final String[] args) throws IOException {
    final String a = new String(new int[]{0x2000b}, 0, 1); // 𠀋
    final String b = new String(new int[]{0x20b9f}, 0, 1); // 𠮟
    final String c = new String(new int[]{0x21d45}, 0, 1); // 𡵅
    System.out.println(testBrics(a, b, c)); // false
    System.out.println(testJava(a, b, c)); // true
}

Fixing this would require us to do

  • Read regex string as (not java.lang.Character-by-java.lang.Character but) Code Point stream. This also includes fixes for operator precedence, like 𠀋+.
  • Convert them to java.lang.Characters, and if they involve surrogate pairs, do something similar to what we do for numerical interval <n-m>

Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions