Skip to content

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Aug 8, 2025

Instead of hashing dimensions during routing and then again during document parsing, this combines the two steps. The tsid is created during routing and then used to create a routing hash. The tsid is then sent to the data nodes which acts as a signal that creating the tsid during document parsing isn't required anymore.

Instead of populating the index.routing_path setting that can differ from the document dimensions, this now populates a new index.dimensions index setting containing all dimensions. This setting isn't user-configurable (todo). In case users manually set index.routing_path, the new optimization doesn't kick in so that routing and tsid creation is working as before. Additionally, if the dimension fields can't be expressed as a simple set of path matches (for example when using a dynamic template with a match_mapping_type that sets time_series_dimension: true), it falls back to populating index.routing_path.

As an additional benefit, the new _tsids are shorter, which may have benefits at query time. While they're shorter, they still retain the main properties: clustering similar time series together (which helps in compression) and making collisions very unlikely. More details in the JavaDoc of TsidBuilder. In fact, based on my testing, the compression is even a bit better after this change.

Remaining issues to work out:

  • Make index.dimensions a private setting
  • Update index.dimensions when adding a new dimension field to the mappings.
  • Ensure that all dynamic templates are incorporated into index.dimensions so that the coordinating node always knows which paths will be considered dimensions.

Sub-PRs

@felixbarny felixbarny requested review from a team as code owners August 8, 2025 09:09
@felixbarny felixbarny added the :StorageEngine/TSDB You know, for Metrics label Aug 8, 2025
@felixbarny felixbarny marked this pull request as draft August 8, 2025 09:09
@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.2.0 labels Aug 8, 2025
@felixbarny felixbarny marked this pull request as ready for review August 11, 2025 10:13
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@felixbarny felixbarny requested a review from kkrik-es August 11, 2025 10:13
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a shallow initial pass. I'll need to reeducate myself on a few concepts to fully grasp everything though. But seems relatively isolated.

@felixbarny felixbarny disabled auto-merge September 17, 2025 08:56
@felixbarny felixbarny merged commit a3f5ea5 into elastic:main Sep 21, 2025
34 checks passed
@felixbarny felixbarny deleted the tsdb-hash-once branch September 21, 2025 09:03
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, sorry for just missing the merge time.

szybia added a commit to szybia/elasticsearch that referenced this pull request Sep 22, 2025
* upstream/main: (50 commits)
  Disable utf-8 parsing optimization (elastic#135172)
  rest-api-spec: fix master_timeout typo (elastic#135167)
  Fixes countDistinctWithConditions in csv-spec tests (elastic#135097)
  Fix test failure by checking for feature flag (elastic#135174)
  Fix deadlock in ThreadPoolMergeScheduler when a failing merge closes the IndexWriter (elastic#134656)
  Make SecureString comparisons constant time (elastic#135053)
  Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/160_exists_query/Test exists query on mapped geo_point field with no doc values} elastic#135164
  ESQL: Replace function count tests (elastic#134951)
  Mute org.elasticsearch.compute.aggregation.SampleBooleanAggregatorFunctionTests testSimpleWithCranky elastic#135163
  Mute org.elasticsearch.xpack.test.rest.XPackRestIT test {p0=analytics/nested_top_metrics_sort/terms order by top metrics numeric not null integer values} elastic#135162
  Mute org.elasticsearch.xpack.test.rest.XPackRestIT test {p0=analytics/nested_top_metrics_sort/terms order by top metrics numeric not null double values} elastic#135159
  TSDB ingest performance: combine routing and tsdb hashing (elastic#132566)
  Mute org.elasticsearch.compute.aggregation.SampleBytesRefAggregatorFunctionTests testSimpleWithCranky elastic#135157
  Mute org.elasticsearch.xpack.logsdb.qa.BulkStoredSourceChallengeRestIT testHistogramAggregation elastic#135156
  Mute org.elasticsearch.xpack.logsdb.qa.StandardVersusStandardReindexedIntoLogsDbChallengeRestIT testHistogramAggregation elastic#135155
  Mute org.elasticsearch.xpack.logsdb.qa.LogsDbVersusLogsDbReindexedIntoStandardModeChallengeRestIT testHistogramAggregation elastic#135154
  Mute org.elasticsearch.xpack.logsdb.qa.BulkChallengeRestIT testHistogramAggregation elastic#135153
  Mute org.elasticsearch.discovery.ClusterDisruptionIT testAckedIndexing elastic#117024
  Mute org.elasticsearch.lucene.RollingUpgradeSearchableSnapshotIndexCompatibilityIT testMountSearchableSnapshot {p0=[9.2.0, 9.2.0, 9.2.0]} elastic#135151
  Mute org.elasticsearch.lucene.RollingUpgradeSearchableSnapshotIndexCompatibilityIT testSearchableSnapshotUpgrade {p0=[9.2.0, 9.2.0, 9.2.0]} elastic#135150
  ...
felixbarny added a commit to felixbarny/elasticsearch that referenced this pull request Sep 22, 2025
With implementations IndexRouting.ExtractFromSource.ForRoutingPath and IndexRouting.ExtractFromSource.ForIndexDimensions.
This addresses review comments from elastic#132566.
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 22, 2025
…2566)

Instead of hashing dimensions during routing and then again during document parsing, this combines the two steps. The tsid is created during routing and then used to create a routing hash. The tsid is then sent to the data nodes which acts as a signal that creating the tsid during document parsing isn't required anymore.
DonalEvans pushed a commit to DonalEvans/elasticsearch that referenced this pull request Sep 22, 2025
…2566)

Instead of hashing dimensions during routing and then again during document parsing, this combines the two steps. The tsid is created during routing and then used to create a routing hash. The tsid is then sent to the data nodes which acts as a signal that creating the tsid during document parsing isn't required anymore.
felixbarny added a commit to felixbarny/elasticsearch that referenced this pull request Sep 23, 2025
In elastic#133232, we've added the ability to provide index metadata with an IndexSettingProvider.
It turned out that we don't need that functionality as we ended up using a private index setting in elastic#132566.

This also adds the `IndexVersion` as another parameter. This is in preparation for [this](elastic#132566 (comment)) suggestion to conditionally set one or another setting, depending on the index version.
felixbarny added a commit that referenced this pull request Sep 24, 2025
In #133232, we've added the ability to provide index metadata with an IndexSettingProvider.
It turned out that we don't need that functionality as we ended up using a private index setting in #132566.

This also adds the `IndexVersion` as another parameter. This is in preparation for [this](#132566 (comment)) suggestion to conditionally set one or another setting, depending on the index version.

`IndexSettingProvider`s are now disallowed from providing the `index.version.created` setting. Otherwise, they can't rely on the `IndexVersion` they receive to be the one that will be actually used for the created index as another provider may change it.
felixbarny added a commit that referenced this pull request Sep 25, 2025
)

With implementations IndexRouting.ExtractFromSource.ForRoutingPath and IndexRouting.ExtractFromSource.ForIndexDimensions.
This addresses review comments from #132566.

Also fixes cases where the tsid is not provided by the coordinating node, such as for translog operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue serverless-linked Added by automation, don't add manually :StorageEngine/TSDB You know, for Metrics Team:StorageEngine v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants