-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Esql skip null metrics #133087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Esql skip null metrics #133087
Conversation
…l-metrics' into esql-skip-null-metrics
…l-metrics' into esql-skip-null-metrics
After discussing this with Fang, we agreed that the correct place for this is in the substitutions step of the logical optimizer phase. Other rules that should only be run once and add nodes happen in that phase, so it seems like a good fit. |
@dnhatn fyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at IgnoreNullMetrics
. I added some thoughts on weird situations that may arise, but I don't see a risk for IgnoreNullMetrics
affecting existing queries that are non-TS queries, due to requiring a timeseries EsRelation
which I think only ever occurs when the TS
command is explicitly used.
I can't comment on whether or not it's correct/desired to filter out null metrics like this - my review is limited to "does this maybe affect other queries", which would require deeper thought and review before merging. Don't see this being the case here.
...sql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/UnresolvedAttribute.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/index/EsIndex.java
Show resolved
Hide resolved
...ql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/IgnoreNullMetrics.java
Outdated
Show resolved
Hide resolved
|
||
private Set<Attribute> collectMetrics(LogicalPlan logicalPlan) { | ||
Set<Attribute> metrics = new HashSet<>(); | ||
logicalPlan.forEachDown(p -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it suffice to specifically look for EsRelation's, in a single pass, and just iterate over its met
I guess you need to first confirm that the metric field is used in a STATS; maybe it's better to look for STATS first, and then traverse only this specific STATs' children.
I'd be a little afraid of multiple STATS in a row; should this only apply to the first STATS?
Also, there can be multiple EsRelation
s in a single plan. This happens in case of joins, and in case of forks. Collecting metrics from all the plans and then placing a filter involving all the metrics fields ahead of every relation sounds brittle; I'd go and look at some LOOKUP JOIN and FORK queries with the debugger on to double check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a little afraid of multiple STATS in a row; should this only apply to the first STATS?
Subsequent STATS would have to be operating on the output of the first STATS. Since the first STATS output will be ReferenceAttribute
s, which always return false
for isMetric()
, none of those values should be caught in this filter. I will add some tests for this to be sure, and to guard against future changes to ReferenceAttribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I've added a test that includes a second stats command. Please let me know if there's anything else you'd like to see there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the first STATS output will be ReferenceAttributes
Caution: I think we pass through the FieldAttribute
if the field is used in the BY
clause (maybe only in the case without renaming). I see that all tests don't have a BY
clause; is that intentional, i.e., do we not expect to group by any dimensions? Or non-dimensions, actually?
Otherwise, I'd add a couple more tests using the BY
clause for good measure, and I'd go and double check that something like STATS ... BY field | ... | STATS ... BY field
or STATS ... BY FIELD | ... | STATS min(field)
still works as expected.
private Set<Attribute> collectMetrics(LogicalPlan logicalPlan) { | ||
Set<Attribute> metrics = new HashSet<>(); | ||
logicalPlan.forEachDown(p -> { | ||
if (p instanceof Aggregate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know how this will affect INLINE STATS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know how this would affect INLINE STATS
. Do we have any plan for how INLINE STATS
will interact with the TS
command in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INLINE_STATS
shouldn't be placed between TS
and the first STATS
, which is polymorphic. @dnhatn to correct me as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the future we might consider making INLINE STATS work with TS, or forbidding it. You can check the TimeSeriesAggregate instead of Aggregate instance here to narrow the scope of these changes.
…l-metrics' into esql-skip-null-metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments, but looks good overall. Thank you, Mark!
* metrics involved in the query. In the case that there are multiple metrics, the not null checks are OR'd together, so we accept rows | ||
* where any of the metrics have values. | ||
*/ | ||
public final class IgnoreNullMetrics extends Rule<LogicalPlan, LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a logical local rule? I'm asking because we might switch to using HashJoin instead of this for semantic and performance reasons. With local logical rules, we don't need to worry much about BWC when deciding to change the execution method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think placing this in the local logical optimizer's Local rewrite
batch would be consistent. We'd also have search stats available if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a logical local rule?
Yes, I will work on this today.
With local logical rules, we don't need to worry much about BWC when deciding to change the execution method.
Can you elaborate on this? I don't see why one or the other place would impact backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With global rules, a problem is that if data nodes later rely on a specific optimization to already have taken place, removing or changing this optimization in the coordinator is hard, because the coordinator would have to send a different plan to old nodes. We haven't solved this problem, yet, even though we'll have to sometime, soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there may be a small bwc issue here: if we move this to a local rule, then old nodes will not know about it and will continue sending unfiltered data; which is generally fine except for the edge case where there are groups with no metrics - those will be completely removed from new nodes, but will still be sent from old nodes.
I guess it's fine because we don't really care about these groups, but wanted to highlight it in case this has some consequences for anything you folks are building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no "old nodes" yet. TS
is unreleased, so the first released version will include this rule. Any nodes that do not support this rule, will also not support the TS
command, and the query will fail anyway.
...ql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/IgnoreNullMetrics.java
Outdated
Show resolved
Hide resolved
conditional = new IsNotNull(logicalPlan.source(), metric); | ||
} else { | ||
// Join the is not null checks with OR nodes | ||
conditional = new Or(logicalPlan.source(), conditional, new IsNotNull(Source.EMPTY, metric)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either use Source.EMPTY and logicalPlan.source() for all clauses.
public final class IgnoreNullMetrics extends Rule<LogicalPlan, LogicalPlan> { | ||
@Override | ||
public LogicalPlan apply(LogicalPlan logicalPlan) { | ||
return logicalPlan.transformUp(TimeSeriesAggregate.class, agg -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can type check here instead of in transformUp, which executes another loop.
if (logicalPlan instanceof TimeSeriesAggregate) {
//...
} else {
return logicalPlan;
}
|
||
private Set<Attribute> collectMetrics(LogicalPlan logicalPlan) { | ||
Set<Attribute> metrics = new HashSet<>(); | ||
logicalPlan.forEachDown(p -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the first STATS output will be ReferenceAttributes
Caution: I think we pass through the FieldAttribute
if the field is used in the BY
clause (maybe only in the case without renaming). I see that all tests don't have a BY
clause; is that intentional, i.e., do we not expect to group by any dimensions? Or non-dimensions, actually?
Otherwise, I'd add a couple more tests using the BY
clause for good measure, and I'd go and double check that something like STATS ... BY field | ... | STATS ... BY field
or STATS ... BY FIELD | ... | STATS min(field)
still works as expected.
...sql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/UnresolvedAttribute.java
Show resolved
Hide resolved
private LogicalPlan analyze(String query) { | ||
EsqlParser parser = new EsqlParser(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: in the future, if you want to save some time when adding new tests, you can inherit from AbstractLogicalPlanOptimizerTests
. You'll have all the test helpers that the logical optimizer tests already have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm using that in another PR. I still need to load a schema that fits the tests. I can rework this test to inherit from that as well though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking at it, LocalLogicalPlanOptimizerTests
extends directly from ESTestCase
. If I'm switching this to be a local optimization (as per https://github.com/elastic/elasticsearch/pull/133087/files#r2320394569), should I follow that pattern instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, of course the local optimizer tests are structured inconsistently. Sorry about that! Use whatever base class works best for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I noticed that the local logical optimizer rules have InferNonNullAggConstraint
, which prepends a WHERE field IS NOT NULL OR other_field IS NOT NULL
to a STATS min(field), max(other_field)
.
This is overlapping a bit with what we're implementing here. The limitation is that InferNonNullAggConstraint
is not used with a BY
clause. It also places the filter directly ahead of the STATS
, relying on filter pushdown to make the filter Lucene-pushable.
I'm not trying to imply that we should make InferNonNullAggConstraint
handle TS
cases, too. I'm fine with having a more specific rule in place. I'm saying this because the two rules may interfere with one another, and you may want to add local logical plan optimizer tests and/or local physical plan optimizer tests to ensure that you get the filter properly pushed down to Lucene. That's not really visible from the tests added in IgnoreNullMetricsTests
. In fact, it may be that the added test cases are already covered by InferNonNullAggConstraint
but we don't see it because this requires local tests to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: InferNonNullAggConstraint
doesn't apply when there is a BY
clause. That's because it would filter out some groups that have only null values.
I think the same can happen here. If we filter out documents where all the metrics are missing but group by a non-metric field, we might remove a whole group from the output. This would be inconsistent with how STATS
normally works. I don't know if semantically correct for TS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is intended, but @kkrik-es can confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing groups containing only null values makes sense for time-series, indeed, as grouping attributes (dimensions) are included in documents along with metric values.
This is very interesting, @alex-spies. I wonder if we should be piggy-backing on InferNonNullAggConstraint
and apply it under TS
even in the presence of grouping attributes, instead of introducing a new rule. You def know better which one is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have two rules, I think it makes sense to modify th eInferNonNullAggConstraint
rule to not apply to TimeSeriesAggregation
nodes. We probably still want it to apply to any later aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known issue with this option. Groups without values will be omitted, and having two stats may return different groups than having one stat, even a single stat can return different groups than another. However, unlike FROM, which is document-centric, TS is metric-centric, and we are okay with this semantic. We should document this behavior in TS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting, @alex-spies. I wonder if we should be piggy-backing on InferNonNullAggConstraint and apply it under TS even in the presence of grouping attributes, instead of introducing a new rule. You def know better which one is cleaner.
I think piggy-backing and having an own rule both works. Even when piggy-backing, the separate logic between TS and non-TS queries can be made very clear in the code, so I have no issues with either approach.
If we go with two rules, I agree with @not-napoleon that we better adjust InferNonNullAggConstraint
to not apply to TS queries, otherwise we'll have 2 rules doing similar work at the same time - which cannot be good when evolving and/or debugging TS.
* metrics involved in the query. In the case that there are multiple metrics, the not null checks are OR'd together, so we accept rows | ||
* where any of the metrics have values. | ||
*/ | ||
public final class IgnoreNullMetrics extends Rule<LogicalPlan, LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think placing this in the local logical optimizer's Local rewrite
batch would be consistent. We'd also have search stats available if that helps.
...ql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/IgnoreNullMetrics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks Mark! Sorry for the delay - we needed to discuss on the semantic issues.
/** | ||
* Scans the given {@link LogicalPlan} to see if it is a "metrics mode" query | ||
*/ | ||
private static boolean isMetricsQuery(LogicalPlan logicalPlan) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@not-napoleon , as you suggested, I think it's better to adjust InferNonNullAggConstraint
in a follow-up PR so that you don't end up having both rules doing similar things and maybe messing up your planning later down the line.
I still recommend adding physical plan optimizer tests, too, or some other means to see the filter pushdown actually happening that I assume this PR is going for. Executing the added filter at runtime will likely not do much for performance, I think.
…l-metrics' into esql-skip-null-metrics
I've added the constraint as discussed. In point of fact, I think it is redundant, because the time series aggregation is always grouped, and the I did not add an explicit test for the pushdown. I think that's worth doing, but I want to do it in a follow up (I did look at doing it here, and unfortunately writing such a test is non-trivial.) There are a lot of other people working on this code, and I would like to get this functionality to them ASAP, and it is reasonably trustworthy that the Thank you @alex-spies and @dnhatn for your help getting this important optimization merged. |
Relates to #133087 @dnhatn reported that we weren't seeing the performance improvement we expected from the null filters. I added tests and investigated, and it turns out I had forgotten to load the TS metadata from a single filed caps response. This PR includes the tests that help find the problem and the fix.
Relates to elastic#133087 @dnhatn reported that we weren't seeing the performance improvement we expected from the null filters. I added tests and investigated, and it turns out I had forgotten to load the TS metadata from a single filed caps response. This PR includes the tests that help find the problem and the fix.
Relates to elastic#133087 @dnhatn reported that we weren't seeing the performance improvement we expected from the null filters. I added tests and investigated, and it turns out I had forgotten to load the TS metadata from a single filed caps response. This PR includes the tests that help find the problem and the fix.
Resolves #129524
This is meant to add a rewrite rule to filter out null metrics. I'm opening this PR early to collect feedback from the Analytics Engine team on the approach.
This rule scans the query plan to collect all of the metric attributes, creates an isNotNull expression for each, and then combines them into a single filter. For the initial version of this, we want to process any document that has any of the metrics in question, so we OR the filters together.
Feature Design Questions:
COALESCE
for that field already in the query?STATS
commands. So if a field is coalesced and then we compute a statistic on that result, no metric will be collected for it. This is a little fragile as written here, but it is working and it has tests.Implementation Questions: