Add Temporal Merge Policy for time-series data by churromorales · Pull Request #15620 · apache/lucene

churromorales · 2026-01-27T18:27:44Z

Description

This PR introduces TemporalMergePolicy, a new merge policy designed for time-series workloads where documents contain a timestamp field. The policy groups segments into time windows and merges segments within the same window, but never merges segments across different time windows. This preserves temporal locality and improves query performance for time-range queries. relates to #15412.

How it works

Time Bucketing

Segments are assigned to time windows based on their maximum timestamp:
Exponential bucketing (default): Recent data uses small windows (e.g., 1 hour), older data uses progressively larger windows (4 hours, 16 hours, etc.)
Fixed bucketing: All time windows have the same size
Old data bucket: Segments older than maxAgeSeconds are placed in a special bucket and not merged

Merge Triggers

Merges are triggered when a time window meets two conditions:

Contains at least minThreshold segments (default: 4)
Total document count exceeds largestSegment * compactionRatio (default: 1.2)

Key Constraints

Never merge across time windows: Even forceMerge(1) respects bucket boundaries
Old data protection: Very old segments (configurable via maxAgeSeconds) are excluded from merging
Concurrency safety: Properly checks MergeContext.getMergingSegments() to avoid "segment already merging" errors

Handling Late-Arriving and Out-of-Order Data

Time-series data rarely arrives perfectly in order. TemporalMergePolicy handles various timing scenarios:

Late-Arriving Data

When data with older timestamps arrives after newer data has been indexed:

Each segment is assigned to a time window based on its maximum timestamp
A segment containing mostly recent data with a few old records will be placed in the recent bucket
A segment containing only old data will be placed in the appropriate older bucket
Segments with mixed timestamps (spanning multiple windows) are assigned based on their max timestamp

Example:

  Segment A: timestamps [2024-01-01 to 2024-01-02] → Jan 2024 bucket                                                                                                                                                                                                                                                               
  Segment B: timestamps [2024-02-01 to 2024-02-02] → Feb 2024 bucket                                                                                                                                                                                                                                                               
  Segment C: timestamps [2024-01-15 to 2024-01-16] → Jan 2024 bucket (late arrival)

Result: Segments A and C can merge together (same bucket), but never with B

Future Data

Data with timestamps in the future (beyond current time):

Treated as age = 0 (most recent)
Placed in the smallest (most recent) time window
Prevents errors from clock skew or timestamp bugs

Out-of-Order Writes Within a Segment

If a single segment contains documents spanning multiple time windows:

The segment is bucketed by its max timestamp only
This prevents pathological cases where a single document with a far-future timestamp would prevent merging
Trade-off: Some temporal mixing can occur within individual segments before merging

Deletes

This policy has no delete-awareness in its normal merge path (findMerges). It groups segments purely by time bucket and uses doc count + compaction ratio to decide merges, ignoring how many documents are deleted. The findForcedDeletesMerges method only kicks in on an explicit forceMergeDeletes() call, where it filters to segments exceeding forceMergeDeletesPctAllowed (default 10%) and merges those within the same time window.
To handle deletes better, you could incorporate a delete ratio into the normal findMerges scoring — e.g., boost the merge priority of segments with high delete percentages (via mergeContext.numDeletesToMerge) so that heavily-deleted segments get merged sooner without waiting for an explicit forceMergeDeletes() call. This would reclaim space more proactively while still respecting time-bucket boundaries.

Updates

Updates in Lucene are delete+reinsert, so an updated document lands in the newest segment regardless of its original timestamp. If the document's timestamp falls within the current time window, it naturally merges back into the correct bucket; if not, the new segment contains a mix of timestamps from different eras, which the policy handles by bucketing based on the segment's max timestamp, but this means a single old-timestamp update can pull the entire segment into an older bucket, or a segment with scattered timestamps gets assigned to whichever window its max falls in. This is worth calling out as the behavior isn't just from out-of-order ingestion, it's the expected outcome of updating documents whose timestamps don't match the current window.

…times]

github-actions · 2026-02-11T00:40:52Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

churromorales · 2026-02-11T01:11:38Z

@msfroh the github actions bot is labeling this as stale, do you guys have any interest in this PR? If not, no worries we can just fork and keep this in-house and I'll close this one out. Thank you.

shubhamsrkdev · 2026-02-11T01:28:41Z

lucene/CHANGES.txt

-
 New Features
 ---------------------
+* GITHUB#15620: Add Temporal Merge Policy for time-series data


I think this could be part of next minor version release? We need not wait for 11? I will wait for anyone more versed with this to chime in.

yeah i don't know what is the protocol with lucene for these types of changes, I don't believe they need to be in a major version bump.

I think 10.5 is perfectly fine. As long as we mark it as "experimental" and it doesn't change any default behaviors. Folks can "opt-in" knowing its experimental and might evolve in the next minor.

@churromorales -- I'm going to go ahead and move this to the 10.5 block. Then I can backport it to 10.x

of course, whatever you feel is best.

lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java

…cy.java Co-authored-by: Shubham Sharma <shubham234727@gmail.com>

benwtrent · 2026-02-11T18:55:21Z

@churromorales I am wondering, since data is so rarely out of order, How does this actually impact performance vs. the currently the LogMergePolicy? If data is sent in order, its logical to consider adjacent segments are within a common time-window and then just simply merging adjacent segments makes the most sense.

churromorales · 2026-02-11T19:33:30Z

@benwtrent good quesiton, logbytesize merges adjacent segments, but we have no control over which adjacent segments are merged, or more importantly when to stop merging. The primary benefit of this feature is at query time, what segments can be skipped based on your query pattern. For example we have a set of queries that have a default 90 day look-back window, and another set that have a default 3 year look-back window. I can't think of how you could optimize segment pruning with LogMergePolicy to handle this case. LogByteSizedMergePolicy doesn't care about time-boundaries. If you know your query pattern up front, then you can set time-windows thus you never merge across these windows, so you keep these time-ranges disjoint.

benwtrent · 2026-02-11T20:13:05Z

@churromorales AH, I better understand now. logbytesize doesn't:

Handle the granular time patterns you care about as size doesn't directly reflect temporal sizes (e.g. merge segments into chunks of 1h instead of merge X sized segments).
Segments being flushed next to each other might actually cross your temporal thresholds (e.g. 1200UTC might span multiple segments, when optimally, it would be a single segment).

This is indeed interesting.

I haven't reviewed, but I did read your summary. I noticed a lack of information on handling "deletes and updates". I realize that this might be...adverse to this type of policy, but it should be mentioned/handled in a sane way.

churromorales · 2026-02-11T21:20:46Z

@benwtrent of course, you make a great point. This was something I was hoping that would be brought up during a PR review actually :). It is one of those things where I made a decision and I'm not necessarily sure if it was the right one. Let me update the description to reflect what I did, and what is possible in terms of trade-offs. TLDR i don't handle deletes during regular merges (although I could, for us it wasn't worth the extra I/O, but for upstream I wasn't totally sure. IndexWriter.forceMergeDeletes() works as expected, but there are a couple of caveats, want me to update the PR description or should we just have this discussion in thread?

benwtrent · 2026-02-11T21:41:47Z

@churromorales please, update the description on how deletes & updates are considered and the resulting behavior.

Deletes aren't that large of a concern, I would expect merge with only expunging deletes will should as normal and deletes will be removed with "natural" merges.

Updates generally then mean the document will be "reindexed" into a brand new segment. I could see this fitting in nicely if the updates occur to documents that have timestamps that are already within the same range. However, if not, then you end up with new segments with a hodgepodge of various timestamps. Basically, a bunch of unordered data. Which you do call out as being handled, but it should be specified that this also applies to document updates.

msfroh · 2026-02-11T21:45:22Z

TLDR i don't handle deletes during regular merges (although I could, for us it wasn't worth the extra I/O, but for upstream I wasn't totally sure.

Oh... that's a very interesting idea! For a use-case that's almost entirely append-only (like most time series), I can see how that would be the right choice.

I wonder if it makes sense to add a toggle to choose this behavior? Or would that be adding too many knobs? (Even if it is configurable, I'm kind of inclined to make "ignore deletes during regular merges" the default, since that's probably what the target audience would want.)

lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java

churromorales · 2026-02-13T20:31:53Z

@msfroh honeslty I felt if you had less than 5% deletes than this works just fine, but we could do what TieredMergePolicy does and add a deletesPctAllowed and follow that model. I am totally fine with either approach, I will defer to you guys here.

churromorales · 2026-02-17T03:51:47Z

I made all the requested updates, any other concerns for you guys? I am leaving out of town for a bit - so if you guys have any feedback, I can respond today / tomorrow.

churromorales · 2026-02-26T00:24:09Z

@benwtrent made the updates you requested, anything more required from my end? Thanks for the feedback.

github-actions · 2026-03-13T00:35:08Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

msfroh · 2026-03-13T17:14:48Z

Hey -- sorry for the delay, @churromorales. I'll give this one last check this afternoon, then I'll go ahead and merge it if no other objections.

churromorales · 2026-03-13T18:44:45Z

no worries @msfroh thanks for taking a look, appreciate it.

Introduces TemporalMergePolicy, a new merge policy designed for time-series workloads where documents contain a timestamp field. The policy groups segments into time windows and merges segments within the same window, but never merges segments across different time windows. This preserves temporal locality and improves query performance for time-range queries. --------- Co-authored-by: Shubham Sharma <shubham234727@gmail.com> Co-authored-by: Michael Froh <msfroh@apache.org>

benwtrent · 2026-03-18T20:52:07Z

@churromorales @msfroh milestone says 11, i see comments about 10.5. Is this to be backported, etc.? Is just the assigned milestone wrong?

churromorales · 2026-03-18T21:16:01Z

Sorry im unfamiliar how the milestone is set here, but #15620 (comment) states it goes into 10.5, if you need a backport, I can make it against whatever branch you want, I assume that would be the way - first time contributing here, so if you want me to do it, let me know what I can do to help.

benwtrent · 2026-03-18T21:25:24Z

It looks like @msfroh already backported and adjusted the CHANGES entry. I just corrected the milestone on the PR.

First commit

bd45e5d

github-actions bot added the module:core/index label Jan 27, 2026

churromorales added 2 commits January 27, 2026 10:38

Fixing unused parameter failure

d7aeda8

Add to changes.txt, remove all logs less than warn

10e58c9

github-actions bot added this to the 11.0.0 milestone Jan 27, 2026

fix java.lang.System#currentTimeMillis() [Don't depend on wall clock …

e16d22b

…times]

github-actions bot added the Stale label Feb 11, 2026

shubhamsrkdev reviewed Feb 11, 2026

View reviewed changes

churromorales and others added 5 commits February 11, 2026 09:07

Update lucene/core/src/java/org/apache/lucene/index/TemporalMergePoli…

3560f73

…cy.java Co-authored-by: Shubham Sharma <shubham234727@gmail.com>

Update lucene/core/src/java/org/apache/lucene/index/TemporalMergePoli…

07d4761

…cy.java Co-authored-by: Shubham Sharma <shubham234727@gmail.com>

Changes per PR review

b4bc4b7

Fix the build

13deb08

Formatting the java code

511a161

shubhamsrkdev reviewed Feb 11, 2026

View reviewed changes

github-actions bot removed the Stale label Feb 12, 2026

More changes per PR review

1e9cacf

shubhamsrkdev approved these changes Feb 26, 2026

View reviewed changes

github-actions bot added the Stale label Mar 13, 2026

github-actions bot removed the Stale label Mar 14, 2026

Move changelog entry to 10.5

7f2f044

msfroh merged commit 1596b76 into apache:main Mar 17, 2026
13 checks passed

benwtrent modified the milestones: 11.0.0, 10.5.0 Mar 18, 2026

Conversation

churromorales commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How it works

Merge Triggers

Key Constraints

Handling Late-Arriving and Out-of-Order Data

Late-Arriving Data

Future Data

Out-of-Order Writes Within a Segment

Deletes

Updates

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

churromorales commented Feb 11, 2026

Uh oh!

shubhamsrkdev Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

churromorales Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

msfroh Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

churromorales Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benwtrent commented Feb 11, 2026

Uh oh!

churromorales commented Feb 11, 2026

Uh oh!

benwtrent commented Feb 11, 2026

Uh oh!

churromorales commented Feb 11, 2026

Uh oh!

benwtrent commented Feb 11, 2026

Uh oh!

msfroh commented Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

churromorales commented Feb 13, 2026

Uh oh!

churromorales commented Feb 17, 2026

Uh oh!

churromorales commented Feb 26, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

msfroh commented Mar 13, 2026

Uh oh!

churromorales commented Mar 13, 2026

Uh oh!

Uh oh!

benwtrent commented Mar 18, 2026

Uh oh!

churromorales commented Mar 18, 2026

Uh oh!

benwtrent commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

churromorales commented Jan 27, 2026 •

edited

Loading