Skip to content

feat(orch): add observability for build cache#2279

Open
jakubno wants to merge 1 commit intomainfrom
observability-measure-how-long-are-data-kept-in-cache-eng-3727
Open

feat(orch): add observability for build cache#2279
jakubno wants to merge 1 commit intomainfrom
observability-measure-how-long-are-data-kept-in-cache-eng-3727

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Apr 1, 2026

Adds a histogram for duration how long the builds stays in cache

@linear
Copy link
Copy Markdown

linear bot commented Apr 1, 2026

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 1, 2026

PR Summary

Low Risk
Low risk observability-only change, but relies on correctly tracking insertion timestamps and eviction callbacks (missed/duplicated timestamps could skew metrics or slowly grow the timestamp map).

Overview
This instruments the build diff TTL cache with an OpenTelemetry histogram (orchestrator.build.cache.residence_duration) by tracking per-entry insertion times and recording the residence duration whenever an item is evicted.

Written by Cursor Bugbot for commit fc07293. This will update automatically on new commits. Configure here.


func (s *DiffStore) Add(d Diff) {
s.resetDelete(d.CacheKey())
s.insertionTimes.Store(d.CacheKey(), time.Now())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unconditionally overwrites the insertion time for keys already present in the cache. If Add is called for an existing key (e.g. a re-upload of the same build), the timer resets and the eviction metric will undercount the actual total residence duration.

Consider using LoadOrStore to preserve the original insertion time:

ds.insertionTimes.LoadOrStore(d.CacheKey(), time.Now())

Comment on lines 141 to 143

func (s *DiffStore) Add(d Diff) {
s.resetDelete(d.CacheKey())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In Add(), insertionTimes.Store is called before cache.Set, so when replacing an existing key the eviction callback for the old item deletes the just-stored insertion time for the new item. This causes the old item to record a near-zero residence duration, and the new item has no insertion time tracked so its eventual eviction silently records no metric at all. Fix: move insertionTimes.Store(d.CacheKey(), time.Now()) to after s.cache.Set(...) returns.

Extended reasoning...

Root Cause

In Add() (cache.go:142-143), the insertion time is stored before the item is placed in the cache:

func (s *DiffStore) Add(d Diff) {
    s.resetDelete(d.CacheKey())
    s.insertionTimes.Store(d.CacheKey(), time.Now())  // stored first
    s.cache.Set(d.CacheKey(), d, ttlcache.DefaultTTL) // eviction fires here
}

When cache.Set is called on an existing key, ttlcache fires the OnEviction callback synchronously for the old item before returning. That callback (cache.go:79-82) calls ds.insertionTimes.LoadAndDelete(item.Key()), which finds and removes the just-stored T2 (the NEW item's insertion time) rather than T1 (the old item's time), because they share the same key.

Step-by-Step Proof

  1. Key K is already in the cache with insertion time T1 stored in insertionTimes.
  2. Add() is called again with the same key K.
  3. insertionTimes.Store(K, T2) runs — overwrites T1 with T2 (now).
  4. cache.Set(K, newValue) is called; ttlcache evicts the old item synchronously.
  5. The eviction callback calls insertionTimes.LoadAndDelete(K), which deletes T2 (not T1).
  6. Result: the old item records time.Since(T2) ~= 0 ns as its residence duration — wildly incorrect.
  7. When the new item eventually evicts, LoadAndDelete(K) returns ok=false — no metric recorded at all.

Why Existing Code Doesn't Prevent This

The resetDelete(d.CacheKey()) call at the top of Add() explicitly handles the case where the key is already in the cache (scheduled for deletion), confirming the author anticipated replacements. However, the ordering of insertionTimes.Store relative to cache.Set was not considered.

Impact

This is a correctness bug in the exact observability feature this PR introduces. Every time Add() is called with a key that already exists in the cache:

  • The old item reports ~0 seconds residence (severely underreported).
  • The new item is never tracked, so the orchestrator.build.cache.residence_duration histogram silently drops a data point.

Fix

Move insertionTimes.Store to after cache.Set returns:

func (s *DiffStore) Add(d Diff) {
    s.resetDelete(d.CacheKey())
    s.cache.Set(d.CacheKey(), d, ttlcache.DefaultTTL) // eviction callback for old item runs, deletes T1 correctly
    s.insertionTimes.Store(d.CacheKey(), time.Now())  // store AFTER Set, so new insertion time is not clobbered
}

With this ordering, when cache.Set evicts the old item, LoadAndDelete(K) finds and removes T1 (the old item's correct insertion time) and records an accurate duration. After Set returns, Store(K, T2) records the new item's insertion time uncontested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants