Skip to content

[Java] Improve cucumber expression creation performance #202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 15, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

## [Unreleased]

### Fixed
- [Java] Improve cucumber expression creation performance ([#202](https://github.com/cucumber/cucumber-expressions/pull/202))

## [16.1.1] - 2022-12-08
### Fixed
- [Java] Improve expression creation performance ([#187](https://github.com/cucumber/cucumber-expressions/pull/187), [#189](https://github.com/cucumber/cucumber-expressions/pull/189))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,27 @@

@API(status = API.Status.STABLE)
public final class CucumberExpression implements Expression {
private static final Pattern ESCAPE_PATTERN = Pattern.compile("[\\\\^\\[({$.|?*+})\\]]");
/**
* List of characters to be escaped.
* The last char is '}' with index 125, so we need only 126 characters.
*/
private static final boolean[] CHAR_TO_ESCAPE = new boolean[126];
static {
CHAR_TO_ESCAPE['^'] = true;
CHAR_TO_ESCAPE['$'] = true;
CHAR_TO_ESCAPE['['] = true;
CHAR_TO_ESCAPE[']'] = true;
CHAR_TO_ESCAPE['('] = true;
CHAR_TO_ESCAPE[')'] = true;
CHAR_TO_ESCAPE['{'] = true;
CHAR_TO_ESCAPE['}'] = true;
CHAR_TO_ESCAPE['.'] = true;
CHAR_TO_ESCAPE['|'] = true;
CHAR_TO_ESCAPE['?'] = true;
CHAR_TO_ESCAPE['*'] = true;
CHAR_TO_ESCAPE['+'] = true;
CHAR_TO_ESCAPE['\\'] = true;
}
private final List<ParameterType<?>> parameterTypes = new ArrayList<>();
private final String source;
private final TreeRegexp treeRegexp;
Expand Down Expand Up @@ -61,9 +81,18 @@ private String rewriteToRegex(Node node) {
}

private static String escapeRegex(String text) {
return ESCAPE_PATTERN.matcher(text).replaceAll("\\\\$0");
int length = text.length();
StringBuilder sb = new StringBuilder(length * 2); // prevent resizes
int maxChar = CHAR_TO_ESCAPE.length;
for (int i = 0; i < length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in practice expressions with characters that need to be escaped are rare. Would it make sense to check the entire string for characters that need to be replaced first? It would save an array copy.

Copy link
Contributor Author

@jkronegg jkronegg Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rarely escape in practice, the situation is a bit different. We need to benchmark the variants based on the escaping frequency, not only on pure escaping performance.

I've implemented two other variants, so to summarize:

Variant Description
escapeRegex0 Pattern (cucumber 7.11.0)
escapeRegex1 append(char) with 1 array allocation and 1 array copy
escapeRegex2 append(String) with 1 array allocation and 1 array copy
escapeRegex3 append(char) with 1 array allocation and 0..1 array copy
escapeRegex4 append(String) with 0..1 array allocation and 0..1 array copy

I benchmarked these variants with JMH for varying escaping frequencies:

Cucumber_escapeRegex_performance

The variant escapeRegex4 outperforms the other variants for all escaping frequencies. Depending on escaping frequency, this variant is 1.2 to 3 times faster the cucumber 7.11.0 implementation. The code is a bit more complex, but still understandable, so I think we can use this variant:

public static String escapeRegex4(String text) {
    int length = text.length();
    StringBuilder sb = null;
    int blocStart=0;
    int maxChar = CHAR_TO_ESCAPE.length;
    for (int i = 0; i < length; i++) {
        char currentChar = text.charAt(i);
        if (currentChar < maxChar && CHAR_TO_ESCAPE[currentChar]) {
            if (sb == null) {
                sb = new StringBuilder(length * 2);
            }
            if (i > blocStart) {
                // flush previous block
                sb.append(text, blocStart, i);
            }
            sb.append('\\');
            sb.append(currentChar);
            blocStart=i+1;
        }
    }
    if (sb != null) {
        if (length > blocStart) {
            // flush remaining characters
            sb.append(text, blocStart, length);
        }
        return sb.toString();
    }
    return text;
}

In real conditions, on my project with about ~400 test scenarios runs in about 8.5 seconds. The InteliJ profiler says escapeRegex0 takes 3.37% of the total CPU time (=280 ms), while escapeRegex4 takes 0.97% of the total (=80 ms). That's a 200 ms improvement (2.3%): not a lot on one build (I was hoping for more), but worth the effort when considering the number of builds on all cucumber users.

For completeness, the escapeRegex3 variant is:

public static String escapeRegex3(String text) {
    int length = text.length();
    StringBuilder sb = new StringBuilder(length * 2); // prevent resizes
    int maxChar = CHAR_TO_ESCAPE.length;
    boolean escaped = false;
    for (int i = 0; i < length; i++) {
        char currentChar = text.charAt(i);
        if (currentChar < maxChar && CHAR_TO_ESCAPE[currentChar]) {
            sb.append('\\');
            escaped = true;
        }
        sb.append(currentChar);
    }
    return escaped ? sb.toString() : text;
}

char currentChar = text.charAt(i);
if (currentChar < maxChar && CHAR_TO_ESCAPE[currentChar]) {
sb.append('\\');
}
sb.append(currentChar);
}
return sb.toString();
}


private String rewriteOptional(Node node) {
assertNoParameters(node, astNode -> createParameterIsNotAllowedInOptional(astNode, source));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

import java.io.IOException;
import java.io.InputStream;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.lang.reflect.Type;
import java.math.BigDecimal;
import java.math.BigInteger;
Expand Down Expand Up @@ -192,4 +194,34 @@ private List<?> match(String expr, String text, ParameterTypeRegistry parameterT
return list;
}
}

@Test
void escape_all_regexp_characters() throws InvocationTargetException, NoSuchMethodException, IllegalAccessException {
assertEquals("\\^\\$\\[\\]\\(\\)\\{\\}\\.\\|\\?\\*\\+\\\\", delegateEscapeRegex("^$[](){}.|?*+\\"));
}

@Test
void escape_escaped_regexp_characters() throws InvocationTargetException, NoSuchMethodException, IllegalAccessException {
assertEquals("\\^\\$\\[\\]\\\\\\(\\\\\\)\\{\\}\\\\\\\\\\.\\|\\?\\*\\+", delegateEscapeRegex("^$[]\\(\\){}\\\\.|?*+"));
}


@Test
void do_not_escape_when_there_is_nothing_to_escape() throws InvocationTargetException, NoSuchMethodException, IllegalAccessException {
assertEquals("dummy", delegateEscapeRegex("dummy"));
}

@Test
void escapeRegex_gives_no_error_for_unicode_characters() throws InvocationTargetException, NoSuchMethodException, IllegalAccessException {
assertEquals("🥒", delegateEscapeRegex("🥒"));
}

private String delegateEscapeRegex(String expression) throws NoSuchMethodException, InvocationTargetException, IllegalAccessException {
// this delegate is used to test the private method `escapeRegex(String)`
ParameterTypeRegistry context = new ParameterTypeRegistry(Locale.ENGLISH);
Method escapeRegex = CucumberExpression.class.getDeclaredMethod("escapeRegex", String.class);
escapeRegex.setAccessible(true);
return (String)escapeRegex.invoke(new CucumberExpression("", context), expression);
}

}