Skip to content

Conversation

not-napoleon
Copy link
Member

@not-napoleon not-napoleon commented Aug 18, 2025

Resolves #129524

This is meant to add a rewrite rule to filter out null metrics. I'm opening this PR early to collect feedback from the Analytics Engine team on the approach.

This rule scans the query plan to collect all of the metric attributes, creates an isNotNull expression for each, and then combines them into a single filter. For the initial version of this, we want to process any document that has any of the metrics in question, so we OR the filters together.

Feature Design Questions:

  • Should we apply this filter even if the user has other filters or logic dealing with the given metric field? e.g. if they have a COALESCE for that field already in the query?
    • At this point, the rule only collects metrics from STATS commands. So if a field is coalesced and then we compute a statistic on that result, no metric will be collected for it. This is a little fragile as written here, but it is working and it has tests.
  • Is it correct to be OR'ing the filters together?
    • This seems correct. We want all documents that have a metric value for any of the metrics involved in the query.

Implementation Questions:

  • Where is the correct place in the query planning process to apply this rule? My instinct is that it should run in the "Finish Analysis" phase of the "Analyzer" step. As written it should only run once, and it seems like it should run after references and union types have been resolved.
    • Resolution: I discussed this with Fang, and we decided it was best placed in the substitutions phase of the logical plan optimizer.

@not-napoleon not-napoleon added >enhancement WIP :Analytics/ES|QL AKA ESQL v9.2.0 :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL labels Aug 18, 2025
@not-napoleon
Copy link
Member Author

Where is the correct place in the query planning process to apply this rule? My instinct is that it should run in the "Finish Analysis" phase of the "Analyzer" step. As written it should only run once, and it seems like it should run after references and union types have been resolved.

After discussing this with Fang, we agreed that the correct place for this is in the substitutions step of the logical optimizer phase. Other rules that should only be run once and add nodes happen in that phase, so it seems like a good fit.

@kkrik-es
Copy link
Contributor

@dnhatn fyi

@not-napoleon not-napoleon marked this pull request as ready for review August 25, 2025 13:02
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at IgnoreNullMetrics. I added some thoughts on weird situations that may arise, but I don't see a risk for IgnoreNullMetrics affecting existing queries that are non-TS queries, due to requiring a timeseries EsRelation which I think only ever occurs when the TS command is explicitly used.

I can't comment on whether or not it's correct/desired to filter out null metrics like this - my review is limited to "does this maybe affect other queries", which would require deeper thought and review before merging. Don't see this being the case here.


private Set<Attribute> collectMetrics(LogicalPlan logicalPlan) {
Set<Attribute> metrics = new HashSet<>();
logicalPlan.forEachDown(p -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it suffice to specifically look for EsRelation's, in a single pass, and just iterate over its met

I guess you need to first confirm that the metric field is used in a STATS; maybe it's better to look for STATS first, and then traverse only this specific STATs' children.

I'd be a little afraid of multiple STATS in a row; should this only apply to the first STATS?

Also, there can be multiple EsRelations in a single plan. This happens in case of joins, and in case of forks. Collecting metrics from all the plans and then placing a filter involving all the metrics fields ahead of every relation sounds brittle; I'd go and look at some LOOKUP JOIN and FORK queries with the debugger on to double check this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little afraid of multiple STATS in a row; should this only apply to the first STATS?

Subsequent STATS would have to be operating on the output of the first STATS. Since the first STATS output will be ReferenceAttributes, which always return false for isMetric(), none of those values should be caught in this filter. I will add some tests for this to be sure, and to guard against future changes to ReferenceAttribute

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I've added a test that includes a second stats command. Please let me know if there's anything else you'd like to see there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the first STATS output will be ReferenceAttributes

Caution: I think we pass through the FieldAttribute if the field is used in the BY clause (maybe only in the case without renaming). I see that all tests don't have a BY clause; is that intentional, i.e., do we not expect to group by any dimensions? Or non-dimensions, actually?

Otherwise, I'd add a couple more tests using the BY clause for good measure, and I'd go and double check that something like STATS ... BY field | ... | STATS ... BY field or STATS ... BY FIELD | ... | STATS min(field) still works as expected.

private Set<Attribute> collectMetrics(LogicalPlan logicalPlan) {
Set<Attribute> metrics = new HashSet<>();
logicalPlan.forEachDown(p -> {
if (p instanceof Aggregate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how this will affect INLINE STATS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how this would affect INLINE STATS. Do we have any plan for how INLINE STATS will interact with the TS command in general?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INLINE_STATS shouldn't be placed between TS and the first STATS, which is polymorphic. @dnhatn to correct me as needed.

Copy link
Member

@dnhatn dnhatn Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future we might consider making INLINE STATS work with TS, or forbidding it. You can check the TimeSeriesAggregate instead of Aggregate instance here to narrow the scope of these changes.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments, but looks good overall. Thank you, Mark!

* metrics involved in the query. In the case that there are multiple metrics, the not null checks are OR'd together, so we accept rows
* where any of the metrics have values.
*/
public final class IgnoreNullMetrics extends Rule<LogicalPlan, LogicalPlan> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a logical local rule? I'm asking because we might switch to using HashJoin instead of this for semantic and performance reasons. With local logical rules, we don't need to worry much about BWC when deciding to change the execution method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think placing this in the local logical optimizer's Local rewrite batch would be consistent. We'd also have search stats available if that helps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a logical local rule?

Yes, I will work on this today.

With local logical rules, we don't need to worry much about BWC when deciding to change the execution method.

Can you elaborate on this? I don't see why one or the other place would impact backwards compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With global rules, a problem is that if data nodes later rely on a specific optimization to already have taken place, removing or changing this optimization in the coordinator is hard, because the coordinator would have to send a different plan to old nodes. We haven't solved this problem, yet, even though we'll have to sometime, soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there may be a small bwc issue here: if we move this to a local rule, then old nodes will not know about it and will continue sending unfiltered data; which is generally fine except for the edge case where there are groups with no metrics - those will be completely removed from new nodes, but will still be sent from old nodes.

I guess it's fine because we don't really care about these groups, but wanted to highlight it in case this has some consequences for anything you folks are building.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no "old nodes" yet. TS is unreleased, so the first released version will include this rule. Any nodes that do not support this rule, will also not support the TS command, and the query will fail anyway.

conditional = new IsNotNull(logicalPlan.source(), metric);
} else {
// Join the is not null checks with OR nodes
conditional = new Or(logicalPlan.source(), conditional, new IsNotNull(Source.EMPTY, metric));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either use Source.EMPTY and logicalPlan.source() for all clauses.

public final class IgnoreNullMetrics extends Rule<LogicalPlan, LogicalPlan> {
@Override
public LogicalPlan apply(LogicalPlan logicalPlan) {
return logicalPlan.transformUp(TimeSeriesAggregate.class, agg -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can type check here instead of in transformUp, which executes another loop.

        if (logicalPlan instanceof TimeSeriesAggregate) {
            //...
        } else {
            return  logicalPlan;
        }


private Set<Attribute> collectMetrics(LogicalPlan logicalPlan) {
Set<Attribute> metrics = new HashSet<>();
logicalPlan.forEachDown(p -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the first STATS output will be ReferenceAttributes

Caution: I think we pass through the FieldAttribute if the field is used in the BY clause (maybe only in the case without renaming). I see that all tests don't have a BY clause; is that intentional, i.e., do we not expect to group by any dimensions? Or non-dimensions, actually?

Otherwise, I'd add a couple more tests using the BY clause for good measure, and I'd go and double check that something like STATS ... BY field | ... | STATS ... BY field or STATS ... BY FIELD | ... | STATS min(field) still works as expected.

Comment on lines 50 to 51
private LogicalPlan analyze(String query) {
EsqlParser parser = new EsqlParser();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: in the future, if you want to save some time when adding new tests, you can inherit from AbstractLogicalPlanOptimizerTests. You'll have all the test helpers that the logical optimizer tests already have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm using that in another PR. I still need to load a schema that fits the tests. I can rework this test to inherit from that as well though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at it, LocalLogicalPlanOptimizerTests extends directly from ESTestCase. If I'm switching this to be a local optimization (as per https://github.com/elastic/elasticsearch/pull/133087/files#r2320394569), should I follow that pattern instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, of course the local optimizer tests are structured inconsistently. Sorry about that! Use whatever base class works best for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I noticed that the local logical optimizer rules have InferNonNullAggConstraint, which prepends a WHERE field IS NOT NULL OR other_field IS NOT NULL to a STATS min(field), max(other_field).

This is overlapping a bit with what we're implementing here. The limitation is that InferNonNullAggConstraint is not used with a BY clause. It also places the filter directly ahead of the STATS, relying on filter pushdown to make the filter Lucene-pushable.

I'm not trying to imply that we should make InferNonNullAggConstraint handle TS cases, too. I'm fine with having a more specific rule in place. I'm saying this because the two rules may interfere with one another, and you may want to add local logical plan optimizer tests and/or local physical plan optimizer tests to ensure that you get the filter properly pushed down to Lucene. That's not really visible from the tests added in IgnoreNullMetricsTests. In fact, it may be that the added test cases are already covered by InferNonNullAggConstraint but we don't see it because this requires local tests to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: InferNonNullAggConstraint doesn't apply when there is a BY clause. That's because it would filter out some groups that have only null values.

I think the same can happen here. If we filter out documents where all the metrics are missing but group by a non-metric field, we might remove a whole group from the output. This would be inconsistent with how STATS normally works. I don't know if semantically correct for TS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is intended, but @kkrik-es can confirm.

Copy link
Contributor

@kkrik-es kkrik-es Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing groups containing only null values makes sense for time-series, indeed, as grouping attributes (dimensions) are included in documents along with metric values.

This is very interesting, @alex-spies. I wonder if we should be piggy-backing on InferNonNullAggConstraint and apply it under TS even in the presence of grouping attributes, instead of introducing a new rule. You def know better which one is cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have two rules, I think it makes sense to modify th eInferNonNullAggConstraint rule to not apply to TimeSeriesAggregation nodes. We probably still want it to apply to any later aggregations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a known issue with this option. Groups without values will be omitted, and having two stats may return different groups than having one stat, even a single stat can return different groups than another. However, unlike FROM, which is document-centric, TS is metric-centric, and we are okay with this semantic. We should document this behavior in TS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting, @alex-spies. I wonder if we should be piggy-backing on InferNonNullAggConstraint and apply it under TS even in the presence of grouping attributes, instead of introducing a new rule. You def know better which one is cleaner.

I think piggy-backing and having an own rule both works. Even when piggy-backing, the separate logic between TS and non-TS queries can be made very clear in the code, so I have no issues with either approach.

If we go with two rules, I agree with @not-napoleon that we better adjust InferNonNullAggConstraint to not apply to TS queries, otherwise we'll have 2 rules doing similar work at the same time - which cannot be good when evolving and/or debugging TS.

* metrics involved in the query. In the case that there are multiple metrics, the not null checks are OR'd together, so we accept rows
* where any of the metrics have values.
*/
public final class IgnoreNullMetrics extends Rule<LogicalPlan, LogicalPlan> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think placing this in the local logical optimizer's Local rewrite batch would be consistent. We'd also have search stats available if that helps.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks Mark! Sorry for the delay - we needed to discuss on the semantic issues.

/**
* Scans the given {@link LogicalPlan} to see if it is a "metrics mode" query
*/
private static boolean isMetricsQuery(LogicalPlan logicalPlan) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's unused.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@not-napoleon , as you suggested, I think it's better to adjust InferNonNullAggConstraint in a follow-up PR so that you don't end up having both rules doing similar things and maybe messing up your planning later down the line.

I still recommend adding physical plan optimizer tests, too, or some other means to see the filter pushdown actually happening that I assume this PR is going for. Executing the added filter at runtime will likely not do much for performance, I think.

@not-napoleon
Copy link
Member Author

I've added the constraint as discussed. In point of fact, I think it is redundant, because the time series aggregation is always grouped, and the InferNotNullAggConstraint does not apply on grouped aggs (which is why we didn't see any problems in the tests). That said, explicit is better than implicit, and the check is trivial.

I did not add an explicit test for the pushdown. I think that's worth doing, but I want to do it in a follow up (I did look at doing it here, and unfortunately writing such a test is non-trivial.) There are a lot of other people working on this code, and I would like to get this functionality to them ASAP, and it is reasonably trustworthy that the isNotNull filter this adds will be pushed down as they normally are. Prior manual testing indicated that adding these filters dramatically sped up TS queries, which gives a lot of confidence that the push down is working correctly. I will open a follow up ticket to add the tests, but I do not think it's high priority.

Thank you @alex-spies and @dnhatn for your help getting this important optimization merged.

@not-napoleon not-napoleon enabled auto-merge (squash) September 5, 2025 14:12
@not-napoleon not-napoleon merged commit 2e8fbe9 into elastic:main Sep 8, 2025
33 checks passed
not-napoleon added a commit that referenced this pull request Sep 16, 2025
Relates to #133087

@dnhatn reported that we weren't seeing the performance improvement we expected from the null filters. I added tests and investigated, and it turns out I had forgotten to load the TS metadata from a single filed caps response. This PR includes the tests that help find the problem and the fix.
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Sep 17, 2025
Relates to elastic#133087

@dnhatn reported that we weren't seeing the performance improvement we expected from the null filters. I added tests and investigated, and it turns out I had forgotten to load the TS metadata from a single filed caps response. This PR includes the tests that help find the problem and the fix.
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025
Relates to elastic#133087

@dnhatn reported that we weren't seeing the performance improvement we expected from the null filters. I added tests and investigated, and it turns out I had forgotten to load the TS metadata from a single filed caps response. This PR includes the tests that help find the problem and the fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL Team:StorageEngine v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filter out null metric values

5 participants