Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Sep 30, 2025

Skip filling in the topn values only if the row is competitive. This cuts the runtime of topn pretty significantly. That's important when topn is dominating the runtime, like we see when querying many many indices at once.

We can emulate that a little locally with something like:

rm /tmp/fields
for field in {1..500}; do
    echo -n ',"f'$field'": "foo"' >> /tmp/fields
done

for idx in {1..100}; do
    curl -uelastic:password -XDELETE localhost:9200/test$idx

    echo '{
        "settings": {
            "index.mapping.total_fields.limit": 10000
        },
        "mappings": {
            "properties": {
                "@timestamp": { "type": "date" }
    ' > /tmp/idx
    for field in {1..500}; do
        echo ',"f'$field'": { "type": "keyword" }' >> /tmp/idx
    done
    echo '
                }
        }
    }' >> /tmp/idx
    curl -uelastic:password -XPUT -HContent-Type:application/json localhost:9200/test$idx --data @/tmp/idx

    rm /tmp/bulk
    for doc in {1..1000}; do
        echo '{"index":{}}' >> /tmp/bulk
        echo -n '{"@timestamp": '$(($idx * 10000 + $doc)) >> /tmp/bulk
        cat /tmp/fields >> /tmp/bulk
        echo '}' >> /tmp/bulk
    done
    echo
    curl -s -uelastic:password -XPOST -HContent-Type:application/json "localhost:9200/test$idx/_bulk?refresh&pretty" --data-binary @/tmp/bulk | tee /tmp/bulk_result | grep error
    echo
done

while true; do
    curl -s -uelastic:password -XPOST -HContent-Type:application/json 'localhost:9200/_query?pretty' -d'{
        "query": "FROM *",
        "pragma": {
            "max_concurrent_shards_per_node": 100
        }
    }' | jq .took

    curl -s -uelastic:password -XPOST -HContent-Type:application/json 'localhost:9200/_query?pretty' -d'{
        "query": "FROM * | SORT @timestamp DESC",
        "pragma": {
            "max_concurrent_shards_per_node": 100
        }
    }' | jq .took
done

This only spends about 12.6% of it's time on topn and takes 2.7 seconds locally. If we apply this fix we spend 3.6% of our time on topn, taking 2.5 seconds. That's not a huge improvement. 7% is nothing to sneeze at, but it's not great. But the topn is dropping from 340 millis to 90 millis.

But in some summary clusters I'm seeing 65% of time spent on topn for queries taking 3 seconds. My kind of bad math says this improvement should drop this query to 1.6 seconds. Let's hope!

Hopefully our nightlies will see this and enjoy prove my math right.

  • Have you signed the contributor license agreement?
  • Have you followed the contributor guidelines?
  • If submitting code, have you built your formula locally prior to submission with gradle check?
  • If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • If you are submitting this code for a class then read our policy for that.

Skip filling in the topn *values* only if the row is competitive. This
cuts the runtime of topn pretty significantly. That's important when
topn is dominating the runtime, like we see when querying many many
indices at once.

We can emulate that a little locally with something like:
```
rm /tmp/fields
for field in {1..500}; do
    echo -n ',"f'$field'": "foo"' >> /tmp/fields
done

for idx in {1..100}; do
    curl -uelastic:password -XDELETE localhost:9200/test$idx

    echo '{
        "settings": {
            "index.mapping.total_fields.limit": 10000
        },
        "mappings": {
            "properties": {
                "@timestamp": { "type": "date" }
    ' > /tmp/idx
    for field in {1..500}; do
        echo ',"f'$field'": { "type": "keyword" }' >> /tmp/idx
    done
    echo '
                }
        }
    }' >> /tmp/idx
    curl -uelastic:password -XPUT -HContent-Type:application/json localhost:9200/test$idx --data @/tmp/idx

    rm /tmp/bulk
    for doc in {1..1000}; do
        echo '{"index":{}}' >> /tmp/bulk
        echo -n '{"@timestamp": '$(($idx * 10000 + $doc)) >> /tmp/bulk
        cat /tmp/fields >> /tmp/bulk
        echo '}' >> /tmp/bulk
    done
    echo
    curl -s -uelastic:password -XPOST -HContent-Type:application/json "localhost:9200/test$idx/_bulk?refresh&pretty" --data-binary @/tmp/bulk | tee /tmp/bulk_result | grep error
    echo
done

while true; do
    curl -s -uelastic:password -XPOST -HContent-Type:application/json 'localhost:9200/_query?pretty' -d'{
        "query": "FROM *",
        "pragma": {
            "max_concurrent_shards_per_node": 100
        }
    }' | jq .took

    curl -s -uelastic:password -XPOST -HContent-Type:application/json 'localhost:9200/_query?pretty' -d'{
        "query": "FROM * | SORT @timestamp DESC",
        "pragma": {
            "max_concurrent_shards_per_node": 100
        }
    }' | jq .took
done

```

This only spends about 12.6% of it's time on topn and takes 2.7 seconds
locally. If we apply this fix we spend 3.6% of our time on topn, taking
2.5 seconds. That's not a huge improvement. 7% is nothing to sneeze at,
but it's not great. But the topn is dropping from 340 millis to 90
millis.

But in some summary clusters I'm seeing 65% of time spent on topn for
queries taking 3 seconds. My kind of bad math says this improvement
should drop this query to 1.6 seconds. Let's hope!

Hopefully our nightlies will see this and enjoy prove my math right.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

spareValuesPreAllocSize = Math.max(spare.values.length(), spareValuesPreAllocSize / 2);
inputQueue.updateTop(spare);
spare = nextSpare;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have used the slightly shorter:

Row inserted = spare;
spare = inputQueue.insertWithOverflow(spare);
if (inserted != spare) {
  rowFiller.writeValues(i, spare);
  spareValuesPreAllocSize = Math.max(spare.values.length(), spareValuesPreAllocSize / 2);
}

but this feels less magic. And this is the hot path so I prefer seeing the guts a little bit.

Also, it cries out for a further optimization where we bail from the loop as soon as inputQueue.size() < inputQueue.topCount and then make another loop with inputQueue.lessThan(inputQueue.top(), spare).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the shorter one has the definite of making the common parts (the code inside the if) more obvious. Perhaps just extract Math.max(spare.values.length(), spareValuesPreAllocSize / 2); to a helper function?

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

// 1 is for the min-heap itself
assertThat(breaker.getMemoryRequestCount(), is(106L));
// could be less than because we don't always insert
assertThat(breaker.getMemoryRequestCount(), lessThanOrEqualTo(106L));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're making a performance improvement, and at the same time making the tests more lax, with no extra cases (No functional change + less/laxer testing == ⚠️).
Should we add a more specific case for the expected usage? Maybe something less randomized (Or not randomized at all). Or try to calculate the usage in this test (which feels a bit too intrincated).
Just a gut feeling, consider it a nitpick

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me double check. I was getting 105 sometimes and 106. which, maybe I should just make it either 105, 106 in that case. I'd assumed it was because we don't insert every time. But the input isn't randomized so I'm not entirely sure why. checking.

spareValuesPreAllocSize = Math.max(spare.values.length(), spareValuesPreAllocSize / 2);

spare = inputQueue.insertWithOverflow(spare);
if (inputQueue.size() < inputQueue.topCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe worth commenting here that this is a insertWithOverflow() that skips some work if the value is not competitive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

@nik9000 nik9000 merged commit ea64bf4 into elastic:main Oct 3, 2025
34 checks passed
@nik9000
Copy link
Member Author

nik9000 commented Oct 3, 2025

Great find!

It was @GalLalouche's find actually. I'd thought we'd do it but, alas, no. Now it's in though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants