Add Temporal Merge Policy for time-series data#15620
Conversation
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
@msfroh the github actions bot is labeling this as stale, do you guys have any interest in this PR? If not, no worries we can just fork and keep this in-house and I'll close this one out. Thank you. |
lucene/CHANGES.txt
Outdated
|
|
||
| New Features | ||
| --------------------- | ||
| * GITHUB#15620: Add Temporal Merge Policy for time-series data |
There was a problem hiding this comment.
I think this could be part of next minor version release? We need not wait for 11? I will wait for anyone more versed with this to chime in.
There was a problem hiding this comment.
yeah i don't know what is the protocol with lucene for these types of changes, I don't believe they need to be in a major version bump.
There was a problem hiding this comment.
I think 10.5 is perfectly fine. As long as we mark it as "experimental" and it doesn't change any default behaviors. Folks can "opt-in" knowing its experimental and might evolve in the next minor.
There was a problem hiding this comment.
@churromorales -- I'm going to go ahead and move this to the 10.5 block. Then I can backport it to 10.x
There was a problem hiding this comment.
of course, whatever you feel is best.
lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java
Outdated
Show resolved
Hide resolved
…cy.java Co-authored-by: Shubham Sharma <shubham234727@gmail.com>
…cy.java Co-authored-by: Shubham Sharma <shubham234727@gmail.com>
|
@churromorales I am wondering, since data is so rarely out of order, How does this actually impact performance vs. the currently the LogMergePolicy? If data is sent in order, its logical to consider adjacent segments are within a common time-window and then just simply merging adjacent segments makes the most sense. |
|
@benwtrent good quesiton, logbytesize merges adjacent segments, but we have no control over which adjacent segments are merged, or more importantly when to stop merging. The primary benefit of this feature is at query time, what segments can be skipped based on your query pattern. For example we have a set of queries that have a default 90 day look-back window, and another set that have a default 3 year look-back window. I can't think of how you could optimize segment pruning with LogMergePolicy to handle this case. LogByteSizedMergePolicy doesn't care about time-boundaries. If you know your query pattern up front, then you can set time-windows thus you never merge across these windows, so you keep these time-ranges disjoint. |
|
@churromorales AH, I better understand now. logbytesize doesn't:
This is indeed interesting. I haven't reviewed, but I did read your summary. I noticed a lack of information on handling "deletes and updates". I realize that this might be...adverse to this type of policy, but it should be mentioned/handled in a sane way. |
|
@benwtrent of course, you make a great point. This was something I was hoping that would be brought up during a PR review actually :). It is one of those things where I made a decision and I'm not necessarily sure if it was the right one. Let me update the description to reflect what I did, and what is possible in terms of trade-offs. TLDR i don't handle deletes during regular merges (although I could, for us it wasn't worth the extra I/O, but for upstream I wasn't totally sure. |
|
@churromorales please, update the description on how deletes & updates are considered and the resulting behavior. Deletes aren't that large of a concern, I would expect merge with only expunging deletes will should as normal and deletes will be removed with "natural" merges. Updates generally then mean the document will be "reindexed" into a brand new segment. I could see this fitting in nicely if the updates occur to documents that have timestamps that are already within the same range. However, if not, then you end up with new segments with a hodgepodge of various timestamps. Basically, a bunch of unordered data. Which you do call out as being handled, but it should be specified that this also applies to document updates. |
Oh... that's a very interesting idea! For a use-case that's almost entirely append-only (like most time series), I can see how that would be the right choice. I wonder if it makes sense to add a toggle to choose this behavior? Or would that be adding too many knobs? (Even if it is configurable, I'm kind of inclined to make "ignore deletes during regular merges" the default, since that's probably what the target audience would want.) |
lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java
Outdated
Show resolved
Hide resolved
|
@msfroh honeslty I felt if you had less than 5% deletes than this works just fine, but we could do what |
|
I made all the requested updates, any other concerns for you guys? I am leaving out of town for a bit - so if you guys have any feedback, I can respond today / tomorrow. |
|
@benwtrent made the updates you requested, anything more required from my end? Thanks for the feedback. |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
Hey -- sorry for the delay, @churromorales. I'll give this one last check this afternoon, then I'll go ahead and merge it if no other objections. |
|
no worries @msfroh thanks for taking a look, appreciate it. |
Introduces TemporalMergePolicy, a new merge policy designed for time-series workloads where documents contain a timestamp field. The policy groups segments into time windows and merges segments within the same window, but never merges segments across different time windows. This preserves temporal locality and improves query performance for time-range queries. --------- Co-authored-by: Shubham Sharma <shubham234727@gmail.com> Co-authored-by: Michael Froh <msfroh@apache.org>
|
@churromorales @msfroh milestone says 11, i see comments about 10.5. Is this to be backported, etc.? Is just the assigned milestone wrong? |
|
Sorry im unfamiliar how the milestone is set here, but #15620 (comment) states it goes into 10.5, if you need a backport, I can make it against whatever branch you want, I assume that would be the way - first time contributing here, so if you want me to do it, let me know what I can do to help. |
|
It looks like @msfroh already backported and adjusted the CHANGES entry. I just corrected the milestone on the PR. |
Description
This PR introduces TemporalMergePolicy, a new merge policy designed for time-series workloads where documents contain a timestamp field. The policy groups segments into time windows and merges segments within the same window, but never merges segments across different time windows. This preserves temporal locality and improves query performance for time-range queries. relates to #15412.
How it works
Time Bucketing
Merge Triggers
Merges are triggered when a time window meets two conditions:
Key Constraints
Handling Late-Arriving and Out-of-Order Data
Time-series data rarely arrives perfectly in order. TemporalMergePolicy handles various timing scenarios:
Late-Arriving Data
When data with older timestamps arrives after newer data has been indexed:
Example:
Result: Segments A and C can merge together (same bucket), but never with B
Future Data
Data with timestamps in the future (beyond current time):
Out-of-Order Writes Within a Segment
If a single segment contains documents spanning multiple time windows:
Deletes
forceMergeDeletesPctAllowed(default 10%) and merges those within the same time window.Updates