-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ES|QL block type for exponential histograms #133393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1a4c296
to
65b3d0e
Compare
bc159b1
to
b991e5b
Compare
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
07c9955
to
8a0a14f
Compare
x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/BlockUtils.java
Show resolved
Hide resolved
...l/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramBlockBuilder.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/AggregateMapper.java
Outdated
Show resolved
Hide resolved
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
x-pack/plugin/esql/compute/test/src/main/java/org/elasticsearch/compute/test/RandomBlock.java
Outdated
Show resolved
Hide resolved
.../testFixtures/src/main/java/org/elasticsearch/xpack/esql/JsonBackedExponentialHistogram.java
Outdated
Show resolved
Hide resolved
...l/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramBlockBuilder.java
Outdated
Show resolved
Hide resolved
...sql/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramArrayBlock.java
Outdated
Show resolved
Hide resolved
...sql/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramArrayBlock.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it to the esql experts to approve.
…h/compute/test/RandomBlock.java Co-authored-by: Kostas Krikellas <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the implementation of ExponentialHistogramFieldMapper. I think the Block/Builder should be similar to AggregateMetricDoubleArrayBlock. The confusing part is that the input/output of this block/builder is a decoded Histogram, but it should be an encoded Histogram in BytesRef. I've left some comments, and I think we can iterate on this quickly. Thanks Jonas!
|
||
@Override | ||
public ExponentialHistogramBlockBuilder endPositionEntry() { | ||
encodedHistogramsBuilder.endPositionEntry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we throw UnsupportedOperationException here - we should not have multi-valued with ExponentialHistogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponential_histogram
does not support multi-values for technical reasons, as the exponential histograms are stored split across multiple doc values and we can't guarantee the order in case of multivalues. There isn't really a use case justifying the effort required to support this.
For ES|QL, my understanding is that blocks are not only used when loading data, but also for intermediate and end results. So while loading a field of type exponential_histogram
, can never produces multi-values, why wouldn't we allow users to construct multivalues, e.g. via the VALUES aggregation?
I don't really see much of a use-case for constructing multi-valued exponential histogram columns this way. But is there that much effort behind supporting it? To me it felt like much more effort not supporting it, because a lot of tests construct multivalues, which would need to be adapted then. But my impression might be wrong here.
|
||
@Override | ||
public ExponentialHistogramBlockBuilder beginPositionEntry() { | ||
encodedHistogramsBuilder.beginPositionEntry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we throw UnsupportedOperationException here - we should not have multi-valued with ExponentialHistogram.
...l/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramBlockBuilder.java
Outdated
Show resolved
Hide resolved
...l/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramBlockBuilder.java
Outdated
Show resolved
Hide resolved
this.tempScratch = new BytesRef(new byte[INITIAL_SCRATCH_SIZE], 0, INITIAL_SCRATCH_SIZE); | ||
} | ||
|
||
public ExponentialHistogramBlockBuilder append(@Nullable ExponentialHistogram value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass the encoded ExponentialHistogram as a BytesRef here instead? I think this part caused confusion for me. In the compute lifecycle, we load BytesRef from binary doc_values and store them in BytesRef this block. When the actual ExponentialHistogram is needed, we retrieve the BytesRef and decode it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both the ES|QL block and for storage in doc values we use byte arrays, yes. The problem is here, that we will use them differently due to different things being most important.
So for the elasticsearch field, to my understanding the most important aspect is disk size, followed by decoding speed (for queries) and then least imporant encoding speed (ingestion), as that work is done only once.
Of course we make reasonable tradeoffs here, e.g. we don't do an optimization for 2% disk space savings while taking double the time per document for ingestion.
For the ES|QL blocks, to my understanding the most important things are decoding performance (the block is used as input of a query) and encoding performance (the block is used as output of a query or intermediate result). We care a lot less about size, which in this case is heap consumption.
For this reason, we will be using different encodings of exponential histograms when storing them vs when handling them with ES|QL. That's why I think it is correct to use the ExponentialHistogram
interface as abstraction here for moving the data into a block, as it isolates these two similar but yet different implementations from each other.
Another example where this difference is very visible is the zero-threshold: That value is only useful in combination with the histogram buckets, so for the ES|QL block we would just also store it in the byte[]
s. For storage however we've extracted it into a separate doc value, to allow for a good compression.
I do agree that it is wasteful for the blockloader to decode the doc values and then encode them again. However, for me this is an optimization to be added later and as a "special case" which is allowed to look through the abstraction for performance improvements:
- The
append(ExponentialHistogram value)
still works on anyExponentialHistogram
implementation - We add a special case in it: We detect if the passed histogram is a
CompressedExponentialHistogram
. In that case we gain direct access to thebytes[]
storing the buckets and copy them as is, saving the decode + encode work. - Per row in the block, we remember the encoding and still return
ExponentialHistograms
on access, decoding using whatever encoding was used for the row. This way, this optimization remains an implementation detail, not leaking into consumers.
return new ExponentialHistogramBlockBuilder(estimatedSize, this); | ||
} | ||
|
||
public final ExponentialHistogramBlock newConstantExponentialHistogramBlock(ExponentialHistogram value, int positionCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add the constant later - not sure if we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for testing. While we don't allow the construction of exponential_histogram literals via a query (at least for now), tests do construct literals. And to turn those literals into blocks, constant blocks are used.
So if we remove the constant block, we get a lot of test failures like this one:
REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:test" --tests "org.elasticsearch.xpack.esql.expression.function.scalar.nulls.IsNullTests.testEvaluate {TestCase=non-null exponential_histogram}" -Dtests.seed=E0138310B809341F -Dtests.locale=hi-Deva-IN -Dtests.timezone=Etc/GMT+2 -Druntime.java=25
unsupported element type [EXPONENTIAL_HISTOGRAM]
java.lang.UnsupportedOperationException: unsupported element type [EXPONENTIAL_HISTOGRAM]
at __randomizedtesting.SeedInfo.seed([E0138310B809341F:CABEDCCC64462444]:0)
at org.elasticsearch.compute.data.BlockUtils.constantBlock(BlockUtils.java:259)
at org.elasticsearch.compute.data.BlockUtils.constantBlock(BlockUtils.java:245)
at org.elasticsearch.compute.data.BlockUtils.fromListRow(BlockUtils.java:107)
at org.elasticsearch.compute.data.BlockUtils.fromListRow(BlockUtils.java:77)
at org.elasticsearch.xpack.esql.expression.function.AbstractFunctionTestCase.row(AbstractFunctionTestCase.java:577)
at org.elasticsearch.xpack.esql.expression.function.AbstractScalarFunctionTestCase.testEvaluate(AbstractScalarFunctionTestCase.java:117)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
And I don't think we'd want to exclude exponential histogram blocks from IsNullTests
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to "properly" implement the constant block. What I did was create constant-blocks for the sub-blocks (e.g. constant block for min, constant block for zero_threshold, etc).
Unfortunately it seems like something is wrong with the serialization of those blocks, as then several tests fail when trying to deserialize the block.
For that reason I reverted to the current, non-optimal solution, as it is only used in tests anyway AFAIK
} | ||
|
||
@Override | ||
public ExponentialHistogram getExponentialHistogram(int valueIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return the encoded histogram and let callers decode and use it - we also need to pass the scratch parameter. We can add a helper method in this class for decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's kind of what I inteded to do - but abstracted away.
The key here is that the ExponentialHistogram
intentionally does not allow random access on the buckets. You can only iterate over them.
This allows us to do encoding lazily in the ExponentialHistogram
implementation we return here without having to repeat this on every consumer of this block.
So for that reason I don't see the benefit of your suggestion here.
However, what I'm going to implement here is a suggestion already made by @nik9000 :
To minimize allocations (not trusting inlining + scalar replacement), we should provide an "Accessor" similar to e.g. lucene doc value iterators. This accessor can then reuse the returned ExponentialHistogram
instance across calls to minimize allocations.
@dnhatn: To summarize my understanding of what we discussed in the meeting: IIUC the problem with the current approach in this PR with the With my current approach, we'd have to always iterate over each document and do decoding: Not necesarily decoding the histogram buckets, but decoding the doc values from how they are stored in Lucene. Is that correct? What confuses me a little, it that IIUC this is not what's happening right now for So I'd propose to do the following in the follow-up PR, which will also contain the blockloader for exponential histograms:
So essentially, for the non-RowStrideReader path the We'll see if this works out or if there are other things I stumble across. Is this the correct approach in your opinion? |
I understand - we should reuse the existing DoublesBlockLoader/IntsBlockLoader to read sub-blocks. Similarly, for the histogram block, there will be at least two blocks: one for zeroThreshold and one for the buckets. The buckets block will be loaded using BytesRefsFromCustomBinary, and zeroThreshold can be loaded with LongsBlockLoader. This approach avoids combining these values during reading. When accessing the histogram value, we can retrieve zeroThreshold and decode the encoded histogram BytesRef from the buckets block.
Just to confirm - you prefer to experiment with our discussion in follow-up PRs. If so, I'll need to take another look to make sure we don’t break existing things. |
This reverts commit 8a3394b.
This reverts commit 9af71b4.
# Conflicts: # benchmarks/src/main/java/org/elasticsearch/benchmark/exponentialhistogram/ExponentialHistogramMergeBench.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java # x-pack/plugin/mapper-exponential-histogram/src/main/java/org/elasticsearch/xpack/exponentialhistogram/ExponentialHistogramFieldMapper.java
14d8614
to
f491a55
Compare
f491a55
to
621b472
Compare
Adds a barebones ES|QL and block type for exponential histograms.
To keep this PR as small as possible I've reduced the type to just storing the
scale
of a histogram and nothing else.Everything else will be added in follow-up PRs.
In this PR I've marked the shortcuts / disabled tests that definitely need work before we eventually can put this into tech-preview with
TODO(b/133393)
.Note that I've also excluded the type from some tests (e.g. TopN) without a
TODO(b/133393)
: These are tests which cover functionality which I think won't be needed, at least for a tech-preview. Please carefully review them and let me know if those cover functionality which should work, so that I can add theTODO(b/133393)
there aswell.Initally this PR also included some CSV-tests. But I decided to remove them for now, as they would require implementing a blockloader, increasing the size of the PR unnecessarily. I'll add them back together with the blockloader in a followup PR.