Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/127924.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 127924
summary: Limit Replace function memory usage
area: ES|QL
type: enhancement
issues: []

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,11 @@
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

import static org.elasticsearch.common.unit.ByteSizeUnit.MB;
import static org.elasticsearch.xpack.esql.core.expression.TypeResolutions.ParamOrdinal.FIRST;
import static org.elasticsearch.xpack.esql.core.expression.TypeResolutions.ParamOrdinal.SECOND;
import static org.elasticsearch.xpack.esql.core.expression.TypeResolutions.ParamOrdinal.THIRD;
Expand All @@ -39,6 +41,8 @@
public class Replace extends EsqlScalarFunction {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new);

static final long MAX_RESULT_LENGTH = MB.toBytes(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the same Repeat logic:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to shove this in some common spot. I have no idea what a good one is. And probably javadoc it too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could make both use the same constant, then future readers can easily see that it's being used in 2 cases which tackle a common problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to ScalarFunction. I don't expect aggs to need this, but we could move it later


private final Expression str;
private final Expression regex;
private final Expression newStr;
Expand Down Expand Up @@ -121,15 +125,15 @@ public boolean foldable() {
return str.foldable() && regex.foldable() && newStr.foldable();
}

@Evaluator(extraName = "Constant", warnExceptions = PatternSyntaxException.class)
@Evaluator(extraName = "Constant", warnExceptions = IllegalArgumentException.class)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternSyntaxException is an IllegalArgumentException. If we choose another exception for this PR, we would have to restore PatternSyntaxException

static BytesRef process(BytesRef str, @Fixed Pattern regex, BytesRef newStr) {
if (str == null || regex == null || newStr == null) {
return null;
}
return new BytesRef(regex.matcher(str.utf8ToString()).replaceAll(newStr.utf8ToString()));
return safeReplace(str, regex, newStr);
}

@Evaluator(warnExceptions = PatternSyntaxException.class)
@Evaluator(warnExceptions = IllegalArgumentException.class)
static BytesRef process(BytesRef str, BytesRef regex, BytesRef newStr) {
if (str == null) {
return null;
Expand All @@ -138,7 +142,30 @@ static BytesRef process(BytesRef str, BytesRef regex, BytesRef newStr) {
if (regex == null || newStr == null) {
return str;
}
return new BytesRef(str.utf8ToString().replaceAll(regex.utf8ToString(), newStr.utf8ToString()));
return safeReplace(str, Pattern.compile(regex.utf8ToString()), newStr);
}

/**
* Executes a Replace without surpassing the memory limit.
*/
private static BytesRef safeReplace(BytesRef strBytesRef, Pattern regex, BytesRef newStrBytesRef) {
String str = strBytesRef.utf8ToString();
Matcher m = regex.matcher(str);
if (false == m.find()) {
return strBytesRef;
}
String newStr = newStrBytesRef.utf8ToString();
// Initialize the buffer with an approximate size for the first replacement
StringBuilder result = new StringBuilder(str.length() + newStr.length() + 8);
do {
m.appendReplacement(result, newStr);

if (result.length() > MAX_RESULT_LENGTH) {
Copy link
Contributor Author

@ivancea ivancea May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're checking after the replacement. This has a downside: if the replacement uses a group ($1), it could, theoretically, still generate a big string (E.g. Replace("<100 chars>", ".+", "$0<...x 100>") == 10.000 chars.
We could take this into account, and use something like:

int matchSize = m.end() - m.start();
int potentialReplacementSize = matchSize * matchSize / 2; // "$0" == 1 repetition
int remainingStr = str.length() - m.end();
if (result.length + potentialReplacementSize + remainingStr > MAX_RESULT_LENGTH) {
    throw ...:
}

Opinions? How predictive should we be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the estimate matchSize * newStr.length() could be too conservative. If I have a 2kb input string, the regex matches the whole thing, and I want to replace it by another 1kb string (no group refs!), this would trigger as a false positive. Not sure how restrictive this is in practice.

Tt looks like appendReplacement only allocates memory through the string builder even in case of group references, although I could've missed something - only skimmed the implementation. But, if we want to be precise, we could maybe inherit from StringBuilder and make it perform our more strict checks every time we append? I don't know if the effort is worth it.

Maybe a simpler approach is to count how many group references newStr actually has, but that could tank performance :/

@nik9000, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either. I was chatting with @ivancea over slack. I kind of think we should take our gloves off and build this directly appending to a BreakingBytesRefBuilder. That way we're reusing the bytes and don't have to copy to encode to utf-8. We have the JVM as a testing oracle so we can clean room implement it. That's probably a few days of work though. All because we don't have proper monomorphization.

Copy link
Contributor Author

@ivancea ivancea May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed this algorithm, taking into account the number of groups. So it's quite more specific. Still quite conservative, but not that much I think? And if you don't use groups, it's nearly pixel-perfect

throw new IllegalArgumentException("Creating strings with more than [" + MAX_RESULT_LENGTH + "] bytes is not supported");
}
} while (m.find());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this with the implementation of replaceAll, this seems to be equivalent, so I think we don't accidentally change the semantics. Nice.

m.appendTail(result);
return new BytesRef(result.toString());
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import static org.elasticsearch.common.unit.ByteSizeUnit.GB;
import static org.elasticsearch.xpack.esql.EsqlTestUtils.unboundLogicalOptimizerContext;
import static org.hamcrest.Matchers.either;
import static org.hamcrest.Matchers.equalTo;
Expand Down Expand Up @@ -304,11 +305,17 @@ public final void testEvaluateInManyThreads() throws ExecutionException, Interru
if (testCase.getExpectedBuildEvaluatorWarnings() != null) {
assertWarnings(testCase.getExpectedBuildEvaluatorWarnings());
}

List<Object> simpleData = testCase.getDataValues();
// Ensure we don't run this test with too much data that could take too long to process.
// The calculation "ramUsed * count" is just a hint of how much data will the function process,
// and the limit is arbitrary
assumeTrue("Input data too big", row(simpleData).ramBytesUsedByBlocks() * count < GB.toBytes(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's this needed? Did we make REPLACE super slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I added the test with a lot of data to the ReplaceTests file, this specific case was very slow (3s -> 40s).

Anyway, I just moved the big cases to another file, so they're executed just once. Same as RepeatStaticTests does. So this isn't needed anymore


ExecutorService exec = Executors.newFixedThreadPool(threads);
try {
List<Future<?>> futures = new ArrayList<>();
for (int i = 0; i < threads; i++) {
List<Object> simpleData = testCase.getDataValues();
Page page = row(simpleData);

futures.add(exec.submit(() -> {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import java.util.function.Supplier;
import java.util.regex.PatternSyntaxException;

import static org.elasticsearch.xpack.esql.expression.function.scalar.string.Replace.MAX_RESULT_LENGTH;
import static org.hamcrest.Matchers.equalTo;

public class ReplaceTests extends AbstractScalarFunctionTestCase {
Expand Down Expand Up @@ -85,7 +86,7 @@ public static Iterable<Object[]> parameters() {
return new TestCaseSupplier.TestCase(
List.of(
new TestCaseSupplier.TypedData(new BytesRef(text), DataType.KEYWORD, "str"),
new TestCaseSupplier.TypedData(new BytesRef(invalidRegex), DataType.KEYWORD, "oldStr"),
new TestCaseSupplier.TypedData(new BytesRef(invalidRegex), DataType.KEYWORD, "regex"),
new TestCaseSupplier.TypedData(new BytesRef(newStr), DataType.KEYWORD, "newStr")
),
"ReplaceEvaluator[str=Attribute[channel=0], regex=Attribute[channel=1], newStr=Attribute[channel=2]]",
Expand All @@ -103,6 +104,27 @@ public static Iterable<Object[]> parameters() {
"Unclosed character class near index 0\n[\n^".replaceAll("\n", System.lineSeparator())
);
}));

suppliers.add(new TestCaseSupplier("result too big", List.of(DataType.KEYWORD, DataType.KEYWORD, DataType.KEYWORD), () -> {
String textAndNewStr = randomAlphaOfLength((int) (MAX_RESULT_LENGTH / 10));
String regex = ".";
return new TestCaseSupplier.TestCase(
List.of(
new TestCaseSupplier.TypedData(new BytesRef(textAndNewStr), DataType.KEYWORD, "str"),
new TestCaseSupplier.TypedData(new BytesRef(regex), DataType.KEYWORD, "regex"),
new TestCaseSupplier.TypedData(new BytesRef(textAndNewStr), DataType.KEYWORD, "newStr")
),
"ReplaceEvaluator[str=Attribute[channel=0], regex=Attribute[channel=1], newStr=Attribute[channel=2]]",
DataType.KEYWORD,
equalTo(null)
).withWarning("Line 1:1: evaluation of [source] failed, treating result as null. Only first 20 failures recorded.")
.withWarning(
"Line 1:1: java.lang.IllegalArgumentException: "
+ "Creating strings with more than ["
+ MAX_RESULT_LENGTH
+ "] bytes is not supported"
);
}));
return parameterSuppliersFromTypedDataWithDefaultChecksNoErrors(false, suppliers);
}

Expand Down
Loading