Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented May 29, 2025

This change extracts the field extraction logic previously run inside TimeSeriesSourceOperator into a separate operator, executed in a separate driver. We should consider consolidating this operator with ValuesSourceReaderOperator. I tried to extend ValuesSourceReaderOperator to support this, but it may take some time to complete. Our current goal is to move quickly with experimental support for querying time-series data in ES|QL, so I am proceeding with a separate operator. I will spend more time on combining these two operators later.

With #128419 and this PR, the query time for the example below decreased from 41ms to 27ms.

POST /_query
{
    "profile": true,
    "query": "TS metrics-hostmetricsreceiver.otel-default 
        | WHERE @timestamp >= \"2025-05-08T18:00:08.001Z\"
        | STATS cpu = avg(rate(`metrics.process.cpu.time`)) BY host.name, BUCKET(@timestamp, 5 minute)"
}

@dnhatn dnhatn added >non-issue :StorageEngine/TSDB You know, for Metrics labels May 29, 2025
@dnhatn dnhatn requested review from kkrik-es, martijnvg and nik9000 May 29, 2025 21:42
@dnhatn dnhatn added the :Analytics/ES|QL AKA ESQL label May 29, 2025
@dnhatn dnhatn marked this pull request as ready for review May 29, 2025 21:43
@dnhatn dnhatn marked this pull request as draft May 29, 2025 21:43
@dnhatn dnhatn force-pushed the ts-parallel-extract branch from ff7256a to fdbf5c9 Compare May 29, 2025 21:51
@dnhatn dnhatn marked this pull request as ready for review May 29, 2025 23:09
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine labels May 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn dnhatn mentioned this pull request May 29, 2025
28 tasks
@@ -94,22 +89,14 @@ private PhysicalPlan addFieldExtract(
attributesToExtract.addAll(extract.attributesToExtract());
}
List<Attribute> attrs = query.attrs();
if (keepDocAttribute == false) {
attrs = attrs.stream().filter(a -> EsQueryExec.isSourceAttribute(a) == false).toList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed any more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the overheads of running them separately. Previously, if all fields were loaded directly in the source operator, we could skip emitting the doc vector. Now, however, we need to emit it for ValuesSourceReaderOperator.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! And there's still more room for parallelism, maybe across backing indices?

@dnhatn
Copy link
Member Author

dnhatn commented May 30, 2025

Thanks Kostas!

@dnhatn dnhatn merged commit c14e602 into elastic:main May 30, 2025
18 checks passed
@dnhatn dnhatn deleted the ts-parallel-extract branch May 30, 2025 14:15
mridula-s109 pushed a commit that referenced this pull request Jun 2, 2025
This change extracts the field extraction logic previously run inside 
`TimeSeriesSourceOperator` into a separate operator, executed in a
separate driver. We should consider consolidating this operator with
`ValuesSourceReaderOperator`. I tried to extend
`ValuesSourceReaderOperator` to support this, but it may take some time
to complete. Our current goal is to move quickly with experimental
support for querying time-series data in ES|QL, so I am proceeding with
a separate operator. I will spend more time on combining these two
operators later.

With #128419 and this PR, the query time for the example below decreased 
from 41ms to 27ms.

```
POST /_query
{
    "profile": true,
    "query": "TS metrics-hostmetricsreceiver.otel-default
        | WHERE @timestamp >= \"2025-05-08T18:00:08.001Z\"
        | STATS cpu = avg(rate(`metrics.process.cpu.time`)) BY host.name, BUCKET(@timestamp, 5 minute)"
}
```
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 3, 2025
This change extracts the field extraction logic previously run inside 
`TimeSeriesSourceOperator` into a separate operator, executed in a
separate driver. We should consider consolidating this operator with
`ValuesSourceReaderOperator`. I tried to extend
`ValuesSourceReaderOperator` to support this, but it may take some time
to complete. Our current goal is to move quickly with experimental
support for querying time-series data in ES|QL, so I am proceeding with
a separate operator. I will spend more time on combining these two
operators later.

With elastic#128419 and this PR, the query time for the example below decreased 
from 41ms to 27ms.

```
POST /_query
{
    "profile": true,
    "query": "TS metrics-hostmetricsreceiver.otel-default
        | WHERE @timestamp >= \"2025-05-08T18:00:08.001Z\"
        | STATS cpu = avg(rate(`metrics.process.cpu.time`)) BY host.name, BUCKET(@timestamp, 5 minute)"
}
```
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Jun 3, 2025
This change extracts the field extraction logic previously run inside 
`TimeSeriesSourceOperator` into a separate operator, executed in a
separate driver. We should consider consolidating this operator with
`ValuesSourceReaderOperator`. I tried to extend
`ValuesSourceReaderOperator` to support this, but it may take some time
to complete. Our current goal is to move quickly with experimental
support for querying time-series data in ES|QL, so I am proceeding with
a separate operator. I will spend more time on combining these two
operators later.

With elastic#128419 and this PR, the query time for the example below decreased 
from 41ms to 27ms.

```
POST /_query
{
    "profile": true,
    "query": "TS metrics-hostmetricsreceiver.otel-default
        | WHERE @timestamp >= \"2025-05-08T18:00:08.001Z\"
        | STATS cpu = avg(rate(`metrics.process.cpu.time`)) BY host.name, BUCKET(@timestamp, 5 minute)"
}
```
Samiul-TheSoccerFan pushed a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 5, 2025
This change extracts the field extraction logic previously run inside 
`TimeSeriesSourceOperator` into a separate operator, executed in a
separate driver. We should consider consolidating this operator with
`ValuesSourceReaderOperator`. I tried to extend
`ValuesSourceReaderOperator` to support this, but it may take some time
to complete. Our current goal is to move quickly with experimental
support for querying time-series data in ES|QL, so I am proceeding with
a separate operator. I will spend more time on combining these two
operators later.

With elastic#128419 and this PR, the query time for the example below decreased 
from 41ms to 27ms.

```
POST /_query
{
    "profile": true,
    "query": "TS metrics-hostmetricsreceiver.otel-default
        | WHERE @timestamp >= \"2025-05-08T18:00:08.001Z\"
        | STATS cpu = avg(rate(`metrics.process.cpu.time`)) BY host.name, BUCKET(@timestamp, 5 minute)"
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants