-
Notifications
You must be signed in to change notification settings - Fork 25.4k
TSDB ingest performance: combine routing and tsdb hashing #132566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fall back to index.routing_path if the dimensions can't be identified by a simple path math
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
I think @martijnvg should also take a look, in case I missed anything wrt routing and tsids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time with this PR, I wasn't able to do a full review yet, my main concern is around dimensions that get dynamically introduced and that are not part of passthrough
object fields.
Before if a new dimension field is introduced, then we only route based on routing fields and then at index time compute a tsid, which would be different compared to time series without the new dimension field, but at least the new tsid would consistently land in the same shard. More critically, also subsequent documents for new tsid would also land in the same shard.
With this change, if a new dimension is dynamically introduced outside passthrough
object fields, then the first document for the updated time series is routed to a shard, and then at index time a new dimension is dynamically mapped. I think the problem we can now run into is that next document for the updated time serie can be routed to a different shard than the previous document was indexed into.
The fact that the tsid changes and that a time serie becomes a different time serie is not what I'm worried about. This is also what happens today. I'm worried about documents from the same tsid ending up in different shards. Please let me know, if that indeed can happen or if my understanding is incorrect.
server/src/main/java/org/elasticsearch/action/index/IndexRequestBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/IndexRouting.java
Show resolved
Hide resolved
* <p> | ||
* The _tsid can not be directly set by a user, it is set by the coordinating node. | ||
*/ | ||
public IndexRequest tsid(BytesRef tsid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't too bad and perhaps preferred over routing. It can't be set via write APIs and it clear what the purpose of this property is.
At a high-level, the idea is that we only enter the new
I'm not sure if I fully understand the scenario you've described. Here's my take based on my understanding of your question. Let me know if I'm off. With this new change, the _tsid is created in the coordinating node and the routing happens based on the tsid. So for a given _tsid, all documents are still consistently routed to the same shard. There can be instances where users manually mark an existing field as a dimension. After the next rollover, this will lead to the _tsid and the shard routing to be different. But that should be fine because it's a new time series for all intents and purposes. There can't be a situation where an existing field is dynamically mapped to a dimension. That's because dynamic mappings only apply to fields that aren't in the mapping, yet. If a new field is mapped based on a dynamic mapping, it's the first time we're encountered it, so it shouldn't affect existing time series. Also, we know that the new field has to be in the |
There's an interesting test failure related to reindexing. The issue is that the custom Lines 115 to 122 in 44549d0
There's no way to provide the custom metadata form the source index to the target index as the In other places, like in downsampling this is solved by the action running on the master node and submitting a task that uses Not sure how to best tackle this. A few options:
|
That wouldn't work for reindex?
I think we would need more use cases to justify adding this capability to create index api.
I prefer this approach. I do wonder whether we can set |
Why wouldn't this work for reindex?
@dakrone would you be ok with that? You proposed using custom index metadata rather than a private index setting.
Not today, but this is doable. Probably needs reviews/buy-in from core infra, though. In fact, this is what I implemented initially. See also d449b79c and a409a95c. |
@rjernst I'd like to get your perspective on this as well. You can start reading from this comment. The TL;DR is that we want to store the time series dimension somewhere in index metadata. Either as a private index setting or as custom metadata. If we go with the private index settings based approach, we'd need functionality that allows us to set private settings if they're system-provided. This affects settings added via an @gmarouli noted another potential issue with the approach of using custom index metadata. This would probably not work well with CCR. At least, we would need to add functionality to copy custom index metadata into the follower index. |
Without having given this a ton of thought, this seems like the most straightforward solution (we've done such things in other places before). |
FWIW I think making the transport request allow private index settings would also work, but it would mean needing to move validation around to the edge (where the rest request is parsed), or have a separate part of the request specifically for private index settings that can't be set by the rest request. |
Thanks for the timely feedback, Ryan!
What I did in a409a95c was to add a flag to The change in d449b79c is about making it so that during index creation/validation, we only disallow private settings coming from index templates and allow When we all agree that this is the best path forward, I'll revert some of the changes made in #133232, which added the ability to provide custom index metadata via Sounds like @martijnvg is on board and @rjernst seems to be generally favorable after having a first glance. Curious to hear from @dakrone. |
I think that sounds reasonable to me as well, as long as Ryan and Martijn are happy. |
I've created a PR that allows system-provided private settings: #133789 |
Instead of hashing dimensions during routing and then again during document parsing, this combines the two steps. The tsid is created during routing and then used to create a routing hash. The tsid is then sent to the data nodes which acts as a signal that creating the tsid during document parsing isn't required anymore.
Instead of populating the
index.routing_path
setting that can differ from the document dimensions, this now populates a newindex.dimensions
index setting containing all dimensions. This setting isn't user-configurable (todo). In case users manually setindex.routing_path
, the new optimization doesn't kick in so that routing and tsid creation is working as before. Additionally, if the dimension fields can't be expressed as a simple set of path matches (for example when using a dynamic template with amatch_mapping_type
that setstime_series_dimension: true
), it falls back to populatingindex.routing_path
.As an additional benefit, the new
_tsid
s are shorter, which may have benefits at query time. While they're shorter, they still retain the main properties: clustering similar time series together (which helps in compression) and making collisions very unlikely. More details in the JavaDoc ofTsidBuilder
. In fact, based on my testing, the compression is even a bit better after this change.I've added a dependency on hash4j which provides an efficient way to hash strings, without having to create a temporary utf-8 byte array, as well as a nice API.
Remaining issues to work out:
index.dimensions
a private settingindex.dimensions
when adding a new dimension field to the mappings.index.dimensions
so that the coordinating node always knows which paths will be considered dimensions.Sub-PRs