Skip to content

Conversation

@pabloem
Copy link
Contributor

@pabloem pabloem commented Jun 24, 2025

Fixes #129567

This is a redesign of a couple test cases in TimeSeriesIT. The change does the following:

  • Originally, the test did the following:
  1. Generate fully random data
  2. Make Elasticsearch calculate rate aggregate statistics on the input data
  3. Calculate rate aggregate statistics on this random data (i.e. replicate the ES algorithm).
  4. Check that results from 2 and 3 match.

Replicating the same algo in test and db seemed off, so I re-designed the test to do the following:

  1. Generate random rates
  2. Generate data based on these random rates
  3. Make elasticsearch calculate rate aggregate statistics from the input data
  4. Make sure the results from 3 match the parameters from 1

Some assumptions from the test:

  • In testing, the actual result was about 10% lower than the original generated rate. The assumption is that this comes from the use or disuse of extrapolation algorithms in the ES rate calculation. The test accounts for this by reducing the expected result by 10%.
  • In testing, the actual results varied by up to 15% from the 10% lower estimate. The test also allows that difference.

Happy to discuss if the margin of error is just too wide.

PTAL @dnhatn

@pabloem pabloem marked this pull request as ready for review July 1, 2025 00:05
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 1, 2025
@dnhatn dnhatn self-requested a review July 1, 2025 19:35
@dnhatn dnhatn added :StorageEngine/TSDB You know, for Metrics >test Issues or PRs that are addressing/adding tests and removed needs:triage Requires assignment of a team area label labels Jul 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn
Copy link
Member

dnhatn commented Jul 2, 2025

@pabloem The approach looks good to me. I ran a hundred iterations and got some failures. Could you take a look? Thank you!

REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:internalClusterTest" --tests "org.elasticsearch.xpack.esql.action.TimeSeriesRateIT.testRateWithTimeBucketAndClusterMultipleMetricsByMin {seed=[271EF4A9835B6701:65D1E5F9A1347FC9]}" -Dtests.seed=271EF4A9835B6701

Values:
[80.17859426288307, 56.0, 2024-04-15T00:00:00.000Z, prod]
[20.49118102778968, 92.0, 2024-04-15T00:00:00.000Z, qa]
[83.52570348596274, 56.0, 2024-04-15T00:01:00.000Z, prod]
[19.896018843547324, 92.0, 2024-04-15T00:01:00.000Z, qa]
[68.79083801637314, 56.0, 2024-04-15T00:02:00.000Z, prod]

 Hosts:
p0 -> qa, rate=16, cpu=36, numDocs=169
p1 -> prod, rate=19, cpu=19, numDocs=164
p2 -> prod, rate=41, cpu=0, numDocs=162
p3 -> qa, rate=7, cpu=92, numDocs=168
p4 -> prod, rate=36, cpu=56, numDocs=167
Total rate: 119
Average rate: 23.8
Total CPU: 203
Average CPU: 40.6
Count of docs: 830
Docs in each minute:
Minute 320: 150 docs
Minute 321: 140 docs
Minute 322: 140 docs
Minute 323: 135 docs
Minute 324: 121 docs
Minute 325: 144 docs

java.lang.AssertionError: Values:
[80.17859426288307, 56.0, 2024-04-15T00:00:00.000Z, prod]
[20.49118102778968, 92.0, 2024-04-15T00:00:00.000Z, qa]
[83.52570348596274, 56.0, 2024-04-15T00:01:00.000Z, prod]
[19.896018843547324, 92.0, 2024-04-15T00:01:00.000Z, qa]
[68.79083801637314, 56.0, 2024-04-15T00:02:00.000Z, prod]

 Hosts:
p0 -> qa, rate=16, cpu=36, numDocs=169
p1 -> prod, rate=19, cpu=19, numDocs=164
p2 -> prod, rate=41, cpu=0, numDocs=162
p3 -> qa, rate=7, cpu=92, numDocs=168
p4 -> prod, rate=36, cpu=56, numDocs=167
Total rate: 119
Average rate: 23.8
Total CPU: 203
Average CPU: 40.6
Count of docs: 830
Docs in each minute:
Minute 320: 150 docs
Minute 321: 140 docs
Minute 322: 140 docs
Minute 323: 135 docs
Minute 324: 121 docs
Minute 325: 144 docs

Expected: a numeric value within <14.40000057220459> of <86.39999771118164>
     but: <68.79083801637314> differed by <3.209159122603907> more than delta <14.40000057220459>

var requestCount = requestCounts.compute(host, (k, curr) -> {
// 15% chance of reset
if (randomInt(100) <= 15) {
return Math.toIntExact(Math.round(hostToRate.get(k) * tsChange));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to add some randomization here too? Otherwise, it's just linear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test keeps a randomly-generated, constant linear-rate per host - with random resets. this is how we avoid re-calculating rates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's document this. I think we can revisit this separately, e.g. store samples in an array per time-series and get the expected rate per interval.

return Math.toIntExact(Math.round((curr == null ? 0 : curr) + hostToRate.get(k) * tsChange));
}
});
if (hosts.contains(host)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we skip first, before initializing requestCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because requestcount follows a linear rate - we always need to calculate for every point in time to be able to keep the same rate. lmk if that makes sense. I can try to rephrase.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. LGTM, thanks @pabloem

@dnhatn
Copy link
Member

dnhatn commented Jul 7, 2025

REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:internalClusterTest" --tests "org.elasticsearch.xpack.esql.action.TimeSeriesRateIT.testRateWithTimeBucketAndClusterMultipleStatsByMin {seed=[C03F3760789CCADD:E7E9047924F5E816]}" -Dtests.seed=C03F3760789CCADD -Dtests.locale=ru-KZ -Dtests.timezone=Africa/Cairo -Druntime.java=24

Values:
[27.19455030545576, 27.19455030545576, 27.19455030545576, 2024-04-15T00:00:00.000Z, prod]
[26.238585183367313, 36.124824330654754, 26.238585183367313, 2024-04-15T00:00:00.000Z, qa]
[21.26328096539162, 21.26328096539162, 21.26328096539162, 2024-04-15T00:01:00.000Z, prod]
[30.34564093259548, 49.98639208926789, 30.34564093259548, 2024-04-15T00:01:00.000Z, qa]
[24.22888264353169, 24.22888264353169, 24.22888264353169, 2024-04-15T00:02:00.000Z, prod]

 Hosts:
p0 -> qa, rate=39, cpu=42, numDocs=77
p1 -> qa, rate=26, cpu=11, numDocs=78
p2 -> prod, rate=29, cpu=82, numDocs=77
p3 -> qa, rate=14, cpu=11, numDocs=76
p4 -> qa, rate=50, cpu=91, numDocs=73
Total rate: 158
Average rate: 31.6
Total CPU: 237
Average CPU: 47.4
Count of docs: 381
Docs in each minute:
Minute 320: 59 docs
Minute 321: 63 docs
Minute 322: 64 docs
Minute 323: 62 docs
Minute 324: 57 docs
Minute 325: 65 docs
Minute 326: 11 docs

java.lang.AssertionError: Values:
[27.19455030545576, 27.19455030545576, 27.19455030545576, 2024-04-15T00:00:00.000Z, prod]
[26.238585183367313, 36.124824330654754, 26.238585183367313, 2024-04-15T00:00:00.000Z, qa]
[21.26328096539162, 21.26328096539162, 21.26328096539162, 2024-04-15T00:01:00.000Z, prod]
[30.34564093259548, 49.98639208926789, 30.34564093259548, 2024-04-15T00:01:00.000Z, qa]
[24.22888264353169, 24.22888264353169, 24.22888264353169, 2024-04-15T00:02:00.000Z, prod]

 Hosts:
p0 -> qa, rate=39, cpu=42, numDocs=77
p1 -> qa, rate=26, cpu=11, numDocs=78
p2 -> prod, rate=29, cpu=82, numDocs=77
p3 -> qa, rate=14, cpu=11, numDocs=76
p4 -> qa, rate=50, cpu=91, numDocs=73
Total rate: 158
Average rate: 31.6
Total CPU: 237
Average CPU: 47.4
Count of docs: 381
Docs in each minute:
Minute 320: 59 docs
Minute 321: 63 docs
Minute 322: 64 docs
Minute 323: 62 docs
Minute 324: 57 docs
Minute 325: 65 docs
Minute 326: 11 docs

Caused by: java.lang.AssertionError: 
Expected: a numeric value within <7.500000298023224> of <43.99999976158142>
     but: <36.124824330654754> differed by <0.3751751329034434> more than delta <7.500000298023224

@pabloem
Copy link
Contributor Author

pabloem commented Jul 8, 2025

@dnhatn the failure rate is now around 5% - still pretty high. It comes from outliers in the sampling rate. WDYT?

a couple ideas ...

  • I could make tests retriable, so if it fails, it runs again (it has to populateIndex again) - this would reduce the failure rate significantly
  • I could widen the margin of error but I dont really like this...

@dnhatn
Copy link
Member

dnhatn commented Jul 8, 2025

Either way works for me.

@pabloem pabloem merged commit 1669e8d into elastic:main Jul 15, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:StorageEngine/TSDB You know, for Metrics Team:StorageEngine >test Issues or PRs that are addressing/adding tests v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-enable TimeSeriesIT.testRateWithTimeBucket

4 participants