Skip to content

Conversation

@fang-xing-esql
Copy link
Member

@fang-xing-esql fang-xing-esql commented May 29, 2025

This is a subtask of the filter-by-filter aggregation in ES|QL. And the following changes are included in this PR.

  • Consolidated min/max per shard into SearchStats, according to the statistics retrieved from Lucene for DateFieldType only, added SearchContextStatsTests and tested min/max.
  • At date nodes, if min/max are available in SearchStats, provide them to DateTrunc.createRounding, call prepare(min, max), instead of prepareForUnknown().
  • At data node, pre-calculated rounding points are returned by fixedRoundingPoints(), substitute date_trunc and bucket with round_to function. A new rule LocalSubstituteSurrogateExpressions is added in the local rewrite batch in LocalLogicalPlanOptimizer to do the substitution.
  • The rewrite from date_trunc/bucket to round_to does not apply to date_nanos yet, as the underlying APIs support millis only, and nanos has not been tested yet.
  • PreparedRounding.maybeUseArray may hit an assert if the calendar intervals are multiple quantities, like 10 month or 3 quarter, date histogram aggregation has this limitation documented here,this is likely why the assert hasn't been hit before, and the same limitation will apply to this rewrite.
  • The performance improvement looks promising if the rewrite from date_trunc to round_to can be done, however it is not likely to be reflected in the existing perf regression tests. The nyc_taxis is used by the date_histogram/date_trunc related ES|QL tests, it has a wider date range, from year 1900 to 2253, the fixed rounding points of the date field - dropoff_datetime couldn't be calculated for this wide date range, so the date_trunc function in the existing perf regressions tests are not rewritten to round_to. With some modifications on nyc_taxis's data - keep data from year 2015 only, the query below shows about ~12% performance improvements, ~840ms(on main) vs ~740ms(on branch where date_trunc is rewritten to round_to).
+ curl -k -u esbench:super-secret-password 'https://elasticsearch-0:9200/_query?format=txt' -H Content-Type:application/json '-d{
    "query": "from nyc_taxis | stats c=count(*) by m=bucket(dropoff_datetime, 1 month) | sort m"
}'

**Main**: the measurements are collected from 5 runs of the query above, they are pretty consistent
  "profile" : {
    "query" : {
      "start_millis" : 1750975801544,
      "stop_millis" : 1750975802359,
      "took_millis" : 815,
      "took_nanos" : 814654568
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975797680,
      "stop_millis" : 1750975798512,
      "took_millis" : 832,
      "took_nanos" : 832115025
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975804516,
      "stop_millis" : 1750975805364,
      "took_millis" : 848,
      "took_nanos" : 848035085
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975808248,
      "stop_millis" : 1750975809093,
      "took_millis" : 845,
      "took_nanos" : 845664731
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975811275,
      "stop_millis" : 1750975812132,
      "took_millis" : 857,
      "took_nanos" : 857257746
    },

**Branch**: : the measurements are collected from 5 runs of the query above, they are pretty consistent
[2025-06-26T22:11:32,743][TRACE][o.e.x.e.o.LocalLogicalPlanOptimizer] [elasticsearch-0] About to apply rule local.SubstituteSurrogateExpressionsWithSearchStats
[2025-06-26T22:11:32,744][INFO ][stdout                   ] [elasticsearch-0] field: FieldName[string=dropoff_datetime], min: 1420070400000
[2025-06-26T22:11:32,744][INFO ][stdout                   ] [elasticsearch-0] field: FieldName[string=dropoff_datetime], max: 1451606399000
[2025-06-26T22:11:32,744][INFO ][stdout                   ] [elasticsearch-0] field: FieldName[string=dropoff_datetime], min string: 2015-01-01T00:00:00.000Z
[2025-06-26T22:11:32,744][INFO ][stdout                   ] [elasticsearch-0] field: FieldName[string=dropoff_datetime], max string: 2015-12-31T23:59:59.000Z
[2025-06-26T22:11:32,744][INFO ][stdout                   ] [elasticsearch-0] field name = FieldName[string=dropoff_datetime], min = 1420070400000, max = 1451606399000, roundingPoints = [1420070400000, 1422748800000, 1425168000000, 1427846400000, 1430438400000, 1433116800000, 1435708800000, 1438387200000, 1441065600000, 1443657600000, 1446336000000, 1448928000000]
[2025-06-26T22:11:32,745][TRACE][o.e.x.e.o.LocalLogicalPlanOptimizer] [elasticsearch-0] Rule local.SubstituteSurrogateExpressionsWithSearchStats applied
Aggregate[[m{r}#306],[COUNT(*[KEYWORD],true[BOOLEAN]) AS c#308, m{r}#306]] = Aggregate[[m{r}#306],[COUNT(*[KEYWORD],true[BOOLEAN]) AS c#308, m{r}#306]]
\_Eval[[DATETRUNC(P1M[DATE_PERIOD],dropoff_datetime{f}#310) AS m#306]]     ! \_Eval[[ROUNDTO(dropoff_datetime{f}#310,1420070400000[DATETIME],1422748800000[DATETIME],1425168000000[DATETIME],14278
  \_EsRelation[nyc_taxis][dropoff_datetime{f}#310]                         ! 46400000[DATETIME],1430438400000[DATETIME],1433116800000[DATETIME],1435708800000[DATETIME],1438387200000[DATETIME],1441065600000[DATETIME],1443657600000[DATETIME],1446336000000[DATETIME],1448928000000[DATETIME]) AS m#306]]
                                                                           !   \_EsRelation[nyc_taxis][dropoff_datetime{f}#310]

  "profile" : {
    "query" : {
      "start_millis" : 1750975835802,
      "stop_millis" : 1750975836517,
      "took_millis" : 715,
      "took_nanos" : 714436407
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975838797,
      "stop_millis" : 1750975839547,
      "took_millis" : 750,
      "took_nanos" : 749372816
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975832887,
      "stop_millis" : 1750975833632,
      "took_millis" : 745,
      "took_nanos" : 744764974
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975829550,
      "stop_millis" : 1750975830284,
      "took_millis" : 734,
      "took_nanos" : 733643219
    },
  "profile" : {
    "query" : {
      "start_millis" : 1750975826368,
      "stop_millis" : 1750975827132,
      "took_millis" : 764,
      "took_nanos" : 763297524
    },
  • The situation on wider date range found during perf tests can be improved by considering the filtering on the date field in the query, use the overlapped the date range from the filtering and search stats to calculate the fixed rounding point. This will be a follow up task, it is not implemented in this PR.

@elasticsearchmachine
Copy link
Collaborator

Hi @fang-xing-esql, I've created a changelog YAML for you.

@tylerperk tylerperk changed the title [ES|QL] Substitue date_trunc with round_to when the pre-calculated rounding points are available [ES|QL] Substitute date_trunc with round_to when the pre-calculated rounding points are available Jun 24, 2025
@fang-xing-esql fang-xing-esql marked this pull request as ready for review June 27, 2025 21:35
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000
Copy link
Member

nik9000 commented Jul 1, 2025

To call it out - if your data is clean - which most log indices are - then you'll see the ~12% performance improvement from this if you are CPU bound. There's a few follow ups coming:

  • Clamp the index min and max to the min and max from WHERE and filter. This should let even the weird nyc_taxis data benefit from this. And it'll also bring it in play when you do stuff like | STATS COUNT(*) BY DATE_TRUNC(1 minute, @timestamp) on the last 15 minutes of data.
  • Further improve the performance by flipping to a filter-by-filter execution model. This is what let's aggs run STATS COUNT(*) BY DATE_TRUNC(1 minute, @timestamp) instantly.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but probably wants someone with more QL background too.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks good, I think the surrogate expression approach can be used for this kind of replacement.
I have objections regarding code abstractions and readability. The code in Bucket and DateTrunc is almost identical with minor exception and can be refactored and simplified.
Also, the SurrogateExpression that we use today is specific to the LogicalPlanOptimizer and is used only there. I would create a separate interface only for things applicable on data nodes, even more so since this new rule will use SearchStats (which is a data node thing).

var min = searchStats.min(fieldName);
var max = searchStats.max(fieldName);
// If min/max is available create rounding with them
if (min instanceof Long minValue && max instanceof Long maxValue && interval().foldable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and max here can be null? Can they be something else other than Long? I am assuming yes, and if so why here it only matters if it's long?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, min and max can be null if we don't find statistics of the fields from Lucene. And in this PR, they can only be Long, because:

  • SearchStats only consolidates min and max for date fields, and they are Long in Lucene.
  • We only do the substitution for date fields in DateTrunc and Bucket here.


private final ZoneId zoneId;

protected BinaryDateTimeFunction(Source source, Expression argument, Expression timestamp) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up, this class is not used.

public class DateDiff extends EsqlScalarFunction {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "DateDiff", DateDiff::new);

public static final ZoneId UTC = ZoneId.of("Z");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpressionBuilder, Add and Sub all refer to DateUtils.UTC, which is the same as ZoneId.of("Z"), refactor this to point to DateUtils.UTC as well. This seems to be the only place that directly references ZoneId.of("Z").

Comment on lines 819 to 831
private LogicalPlan plan(String query, Analyzer analyzer) {
var analyzed = analyzer.analyze(parser.createStatement(query, EsqlTestUtils.TEST_CFG));
// System.out.println(analyzed);
var optimized = logicalOptimizer.optimize(analyzed);
// System.out.println(optimized);
return optimized;
return logicalOptimizer.optimize(analyzed);
}

private LogicalPlan plan(String query) {
protected LogicalPlan plan(String query) {
return plan(query, analyzer);
}

private LogicalPlan localPlan(LogicalPlan plan, SearchStats searchStats) {
protected LogicalPlan localPlan(LogicalPlan plan, SearchStats searchStats) {
var localContext = new LocalLogicalOptimizerContext(EsqlTestUtils.TEST_CFG, FoldContext.small(), searchStats);
// System.out.println(plan);
var localPlan = new LocalLogicalPlanOptimizer(localContext).localOptimize(plan);
// System.out.println(localPlan);
return localPlan;
return new LocalLogicalPlanOptimizer(localContext).localOptimize(plan);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small refactors.

@fang-xing-esql
Copy link
Member Author

In general it looks good, I think the surrogate expression approach can be used for this kind of replacement. I have objections regarding code abstractions and readability. The code in Bucket and DateTrunc is almost identical with minor exception and can be refactored and simplified. Also, the SurrogateExpression that we use today is specific to the LogicalPlanOptimizer and is used only there. I would create a separate interface only for things applicable on data nodes, even more so since this new rule will use SearchStats (which is a data node thing).

Thanks so much for reviewing and for the suggestions on the code refactor @astefan! All of the comments have been addressed.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Is there any existent IT test (csv-spec or yaml) that covers queries that are impacted by this change?
I mean, it feels right to also see this thing in a more real test (like something closer to what users will use and see and feel). If there is no such IT test, please add one which makes sense and which actually uses this thing this PR adds.

@nik9000
Copy link
Member

nik9000 commented Jul 15, 2025

For checking that things are actually rewritten in a test, you can do a new test case in RestEsqlIT and index a bunch of docs and run with a profile. Inside the EvalOperator you'll see it running RoundTo. You could also see the rewritten plan.

Once we push this stuff to the search index we can use the infrastructure in PushQueriesIT.

@nik9000
Copy link
Member

nik9000 commented Jul 15, 2025

Once we push this stuff to the search index we can use the infrastructure in PushQueriesIT.

But that's a followup.

@fang-xing-esql
Copy link
Member Author

LGTM

Is there any existent IT test (csv-spec or yaml) that covers queries that are impacted by this change? I mean, it feels right to also see this thing in a more real test (like something closer to what users will use and see and feel). If there is no such IT test, please add one which makes sense and which actually uses this thing this PR adds.

We have quite a few existing CsvTests with date_trunc and bucket that are impacted by this change, and they did uncover some bugs for this change, and they are fixed in this PR, I'll add an IT test in this PR for it as well.

@nik9000
Copy link
Member

nik9000 commented Jul 15, 2025

I'll add an IT test in this PR for it as well.

For me this kind of thing is mostly to make sure that we're really turning on the optimization when we expect to. It's not about today so much - it's about a year from now when you go and make another change and it turns off the optimization unexpectedly.

@astefan
Copy link
Contributor

astefan commented Jul 15, 2025

We have quite a few existing CsvTests with date_trunc and bucket that are impacted by this change, and they did uncover some bugs for this change, and they are fixed in this PR

It's good that we have them. I'm ok with the PR as is now, no need for another IT test. Let's merge it 👍

@fang-xing-esql
Copy link
Member Author

Thank you for reviewing @nik9000, @astefan !

@fang-xing-esql fang-xing-esql merged commit 04ae527 into elastic:main Jul 15, 2025
33 checks passed
@fang-xing-esql
Copy link
Member Author

To call it out - if your data is clean - which most log indices are - then you'll see the ~12% performance improvement from this if you are CPU bound. There's a few follow ups coming:

  • Clamp the index min and max to the min and max from WHERE and filter. This should let even the weird nyc_taxis data benefit from this. And it'll also bring it in play when you do stuff like | STATS COUNT(*) BY DATE_TRUNC(1 minute, @timestamp) on the last 15 minutes of data.

This is tracked in #131341

  • Further improve the performance by flipping to a filter-by-filter execution model. This is what let's aggs run STATS COUNT(*) BY DATE_TRUNC(1 minute, @timestamp) instantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants