-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ES|QL: Fix BytesRef2BlockHash #130705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL: Fix BytesRef2BlockHash #130705
Changes from 2 commits
16dde0e
f8f1805
69450f5
0a666e4
882bc2c
de218a6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pr: 130705 | ||
summary: Fix `BytesRef2BlockHash` | ||
area: ES|QL | ||
type: bug | ||
issues: [] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ | |
import com.carrotsearch.randomizedtesting.annotations.ParametersFactory; | ||
|
||
import org.apache.lucene.util.BytesRef; | ||
import org.elasticsearch.common.lucene.BytesRefs; | ||
import org.elasticsearch.common.unit.ByteSizeValue; | ||
import org.elasticsearch.compute.aggregation.GroupingAggregatorFunction; | ||
import org.elasticsearch.compute.data.Block; | ||
|
@@ -35,8 +36,10 @@ | |
import org.elasticsearch.xpack.esql.core.util.Holder; | ||
|
||
import java.util.ArrayList; | ||
import java.util.HashSet; | ||
import java.util.List; | ||
import java.util.Locale; | ||
import java.util.Set; | ||
import java.util.function.Consumer; | ||
import java.util.stream.IntStream; | ||
import java.util.stream.LongStream; | ||
|
@@ -1232,6 +1235,190 @@ public void testLongNull() { | |
}, blockFactory.newLongArrayVector(values, values.length).asBlock(), blockFactory.newConstantNullBlock(values.length)); | ||
} | ||
|
||
public void test2BytesRefsKeys() { | ||
|
||
final Page page; | ||
final int positions1 = 100_000; | ||
final int positions2 = 10; | ||
final int totalPositions = positions1 * positions2; | ||
try ( | ||
BytesRefBlock.Builder builder1 = blockFactory.newBytesRefBlockBuilder(totalPositions); | ||
BytesRefBlock.Builder builder2 = blockFactory.newBytesRefBlockBuilder(totalPositions); | ||
) { | ||
for (int i = 0; i < positions1; i++) { | ||
for (int p = 0; p < positions2; p++) { | ||
builder1.appendBytesRef(new BytesRef("abcdef" + i)); | ||
builder2.appendBytesRef(new BytesRef("abcdef" + p)); | ||
} | ||
} | ||
page = new Page(builder1.build(), builder2.build()); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want this all in one page? I know the Meh. I guess it's only like 9mb. Probably not worth changing it unless you need to go bigger. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it's small enough, and it reproduced the problem pretty consistently, so I guess it's not worth the effort and complication. |
||
record Output(int offset, IntBlock block, IntVector vector) implements Releasable { | ||
@Override | ||
public void close() { | ||
Releasables.close(block, vector); | ||
} | ||
} | ||
List<Output> output = new ArrayList<>(); | ||
|
||
try (BlockHash hash1 = new BytesRef2BlockHash(blockFactory, 0, 1, totalPositions);) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we randomize which key has high cardinality? It looks like this test only stuffs the high cardinality key into the first position, which repros the bug, but in principle the other position might have the same problem (if we implemented things differently). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
hash1.add(page, new GroupingAggregatorFunction.AddInput() { | ||
@Override | ||
public void add(int positionOffset, IntArrayBlock groupIds) { | ||
groupIds.incRef(); | ||
output.add(new Output(positionOffset, groupIds, null)); | ||
} | ||
|
||
@Override | ||
public void add(int positionOffset, IntBigArrayBlock groupIds) { | ||
groupIds.incRef(); | ||
output.add(new Output(positionOffset, groupIds, null)); | ||
} | ||
|
||
@Override | ||
public void add(int positionOffset, IntVector groupIds) { | ||
groupIds.incRef(); | ||
output.add(new Output(positionOffset, null, groupIds)); | ||
} | ||
|
||
@Override | ||
public void close() { | ||
fail("hashes should not close AddInput"); | ||
} | ||
}); | ||
|
||
Block[] keys = hash1.getKeys(); | ||
try { | ||
Set<String> distinctKeys = new HashSet<>(); | ||
BytesRefBlock block0 = (BytesRefBlock) keys[0]; | ||
BytesRefBlock block1 = (BytesRefBlock) keys[1]; | ||
BytesRef scratch = new BytesRef(); | ||
StringBuilder builder = new StringBuilder(); | ||
for (int i = 0; i < totalPositions; i++) { | ||
builder.setLength(0); | ||
builder.append(BytesRefs.toString(block0.getBytesRef(i, scratch))); | ||
builder.append("#"); | ||
builder.append(BytesRefs.toString(block1.getBytesRef(i, scratch))); | ||
distinctKeys.add(builder.toString()); | ||
} | ||
assertThat(distinctKeys.size(), equalTo(totalPositions)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revert the fix and you'll see a failure here |
||
} finally { | ||
Releasables.close(keys); | ||
} | ||
} finally { | ||
Releasables.close(output); | ||
page.releaseBlocks(); | ||
} | ||
} | ||
|
||
public void test2BytesRefs() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the same test we had for BytesRef3BlockHash |
||
final Page page; | ||
final int positions = randomIntBetween(1, 1000); | ||
final boolean generateVector = randomBoolean(); | ||
try ( | ||
BytesRefBlock.Builder builder1 = blockFactory.newBytesRefBlockBuilder(positions); | ||
BytesRefBlock.Builder builder2 = blockFactory.newBytesRefBlockBuilder(positions); | ||
) { | ||
List<BytesRefBlock.Builder> builders = List.of(builder1, builder2); | ||
for (int p = 0; p < positions; p++) { | ||
for (BytesRefBlock.Builder builder : builders) { | ||
int valueCount = generateVector ? 1 : between(0, 3); | ||
switch (valueCount) { | ||
case 0 -> builder.appendNull(); | ||
case 1 -> builder.appendBytesRef(new BytesRef(Integer.toString(between(1, 100)))); | ||
default -> { | ||
builder.beginPositionEntry(); | ||
for (int v = 0; v < valueCount; v++) { | ||
builder.appendBytesRef(new BytesRef(Integer.toString(between(1, 100)))); | ||
} | ||
builder.endPositionEntry(); | ||
} | ||
} | ||
} | ||
} | ||
page = new Page(builder1.build(), builder2.build()); | ||
} | ||
final int emitBatchSize = between(positions, 10 * 1024); | ||
var groupSpecs = List.of(new BlockHash.GroupSpec(0, ElementType.BYTES_REF), new BlockHash.GroupSpec(1, ElementType.BYTES_REF)); | ||
record Output(int offset, IntBlock block, IntVector vector) implements Releasable { | ||
@Override | ||
public void close() { | ||
Releasables.close(block, vector); | ||
} | ||
} | ||
List<Output> output1 = new ArrayList<>(); | ||
List<Output> output2 = new ArrayList<>(); | ||
try ( | ||
BlockHash hash1 = new BytesRef2BlockHash(blockFactory, 0, 1, emitBatchSize); | ||
BlockHash hash2 = new PackedValuesBlockHash(groupSpecs, blockFactory, emitBatchSize) | ||
) { | ||
hash1.add(page, new GroupingAggregatorFunction.AddInput() { | ||
@Override | ||
public void add(int positionOffset, IntArrayBlock groupIds) { | ||
groupIds.incRef(); | ||
output1.add(new Output(positionOffset, groupIds, null)); | ||
} | ||
|
||
@Override | ||
public void add(int positionOffset, IntBigArrayBlock groupIds) { | ||
groupIds.incRef(); | ||
output1.add(new Output(positionOffset, groupIds, null)); | ||
} | ||
|
||
@Override | ||
public void add(int positionOffset, IntVector groupIds) { | ||
groupIds.incRef(); | ||
output1.add(new Output(positionOffset, null, groupIds)); | ||
} | ||
|
||
@Override | ||
public void close() { | ||
fail("hashes should not close AddInput"); | ||
} | ||
}); | ||
hash2.add(page, new GroupingAggregatorFunction.AddInput() { | ||
@Override | ||
public void add(int positionOffset, IntArrayBlock groupIds) { | ||
groupIds.incRef(); | ||
output2.add(new Output(positionOffset, groupIds, null)); | ||
} | ||
|
||
@Override | ||
public void add(int positionOffset, IntBigArrayBlock groupIds) { | ||
groupIds.incRef(); | ||
output2.add(new Output(positionOffset, groupIds, null)); | ||
} | ||
|
||
@Override | ||
public void add(int positionOffset, IntVector groupIds) { | ||
groupIds.incRef(); | ||
output2.add(new Output(positionOffset, null, groupIds)); | ||
} | ||
|
||
@Override | ||
public void close() { | ||
fail("hashes should not close AddInput"); | ||
} | ||
}); | ||
assertThat(output1.size(), equalTo(output2.size())); | ||
for (int i = 0; i < output1.size(); i++) { | ||
Output o1 = output1.get(i); | ||
Output o2 = output2.get(i); | ||
assertThat(o1.offset, equalTo(o2.offset)); | ||
if (o1.vector != null) { | ||
assertNull(o1.block); | ||
assertThat(o1.vector, equalTo(o2.vector != null ? o2.vector : o2.block.asVector())); | ||
} else { | ||
assertNull(o2.vector); | ||
assertThat(o1.block, equalTo(o2.block)); | ||
} | ||
} | ||
} finally { | ||
Releasables.close(output1); | ||
Releasables.close(output2); | ||
page.releaseBlocks(); | ||
} | ||
} | ||
|
||
public void test3BytesRefs() { | ||
final Page page; | ||
final int positions = randomIntBetween(1, 1000); | ||
|
@@ -1326,7 +1513,7 @@ public void close() { | |
fail("hashes should not close AddInput"); | ||
} | ||
}); | ||
assertThat(output1.size(), equalTo(output1.size())); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A typo probably |
||
assertThat(output1.size(), equalTo(output2.size())); | ||
for (int i = 0; i < output1.size(); i++) { | ||
Output o1 = output1.get(i); | ||
Output o2 = output2.get(i); | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this (and below) be
if (k <= 0)
{ ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be
0
because we grab the low bits. Right? I haven't had enough coffee for bit math.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the
finalHash.get(i)
not return smth like0xe0000000L
(or some other value with the 32nd bit set)?The
hash1.hash.get()
method seems to expect a non-negative argument. It might all be safe, it just stood out to me since we're truncating a long.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are always positive, they are generated by BytesRefBlockHash, see hashOrdToGroupNullReserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one of the implicit assumptions is that a block hash will never have more than 2B positions. If it does, we'll need a larger data structure anyway I guess.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure:
0xe0000000L
is a positive long. But(int) (0xe0000000L & 0xffffffffL)
is a negative int.Yeh, this is why I was thinking that this might still be safe. 👍 (At least if these are positions.)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @nik9000 should probably comment here as well - and whatever we end up with should probably become a Java comment here, that's too many assumptions for my head to infer, at least when skimming the file :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik9000 any additional observations on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Y'all are right - we need positive integers from the hash. We're using the one from core which allows
long
- but we do intend to use onlyint
. It's for sure worth a comment.