fix: PDX-2459 avoid parallel cache-misses on the same token #17

arnecls · 2026-02-10T11:17:28Z

When a lot of requests for the same token come in at the same time, hitting a cold cache, each request will trigget a cache miss.

In severe cases, this can lead to rate limiting.

This change introduces a FIFO, lock based queue based on the token ID to tackle this problem. The locks may time out based on request context and return a 429 (overloaded, come back later).

Copilot

Pull request overview

Introduces a per-token FIFO ticket lock to prevent parallel cache misses for the same token, reducing thundering-herd behavior and downstream rate limiting.

Changes:

Added a TicketLock (FIFO lock with context-aware cancellation) and a uint64 min-heap for tracking canceled tickets.
Updated access token handler to serialize same-token cache-miss fetches and return 429 on lock acquisition timeout.
Extended token cache with a per-token inflight lock map.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
internal/shared/ticketlock.go	Adds FIFO ticket lock with canceled-ticket tracking to avoid parallel cache misses.
internal/shared/intheap.go	Adds `HeapUint64` to support min-heap operations for canceled tickets.
internal/shared/ticketlock_test.go	Adds a basic unit test for ticket allocation, cancellation, and unlock behavior.
cmd/metadata-server/tokencache.go	Adds `inflight` lock map and accessor to retrieve per-token locks.
cmd/metadata-server/tokenhandlers.go	Uses per-token lock on cache miss and returns `429` when lock can’t be acquired.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/shared/ticketlock.go

cmd/metadata-server/tokencache.go

cmd/metadata-server/tokenhandlers.go

internal/shared/ticketlock_test.go

cmd/metadata-server/tokencache.go

Co-authored-by: Copilot <[email protected]> Signed-off-by: Arne Claus <[email protected]>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmd/metadata-server/tokencache.go

internal/shared/ticketlock.go

internal/shared/ticketlock_test.go

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmd/metadata-server/tokencache.go

internal/shared/ticketlock.go

internal/shared/ticketlock_test.go

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmd/metadata-server/tokencache.go

internal/shared/ticketlock.go

cmd/metadata-server/tokencache.go

internal/shared/ticketlock.go

internal/shared/ticketlock_test.go

Co-authored-by: Copilot <[email protected]> Signed-off-by: Arne Claus <[email protected]>

github-actions · 2026-02-11T16:09:50Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/trivago/identity-metadata-server/cmd/metadata-server	0.00% (ø)
github.com/trivago/identity-metadata-server/internal/shared	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/trivago/identity-metadata-server/cmd/metadata-server/tokencache.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/cmd/metadata-server/tokenhandlers.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/internal/shared/intheap.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/internal/shared/ticketlock.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/trivago/identity-metadata-server/cmd/metadata-server/tokencache_test.go
github.com/trivago/identity-metadata-server/internal/shared/ticketlock_test.go

github-actions · 2026-02-11T16:09:51Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/trivago/identity-metadata-server/cmd/metadata-server	0.00% (ø)
github.com/trivago/identity-metadata-server/internal/shared	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/trivago/identity-metadata-server/cmd/metadata-server/tokencache.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/cmd/metadata-server/tokenhandlers.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/internal/shared/intheap.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/internal/shared/ticketlock.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/trivago/identity-metadata-server/cmd/metadata-server/tokencache_test.go
github.com/trivago/identity-metadata-server/internal/shared/ticketlock_test.go

github-actions · 2026-02-11T16:10:07Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/trivago/identity-metadata-server/cmd/metadata-server	0.00% (ø)
github.com/trivago/identity-metadata-server/internal/shared	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/trivago/identity-metadata-server/cmd/metadata-server/tokencache.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/cmd/metadata-server/tokenhandlers.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/internal/shared/intheap.go	0.00% (ø)	0	0	0
github.com/trivago/identity-metadata-server/internal/shared/ticketlock.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/trivago/identity-metadata-server/cmd/metadata-server/tokencache_test.go
github.com/trivago/identity-metadata-server/internal/shared/ticketlock_test.go

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/shared/ticketlock_test.go

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T10:55:18Z

internal/shared/ticketlock.go

+			defer l.ticketGuard.Unlock()
+			heap.Push(l.canceledTickets, ticket)


Cancellation can deadlock the lock: there’s a race where activeTicket can become ticket right after the activeTicket == ticket check but before this ctx.Done() branch executes. In that case, returning 0 without advancing activeTicket leaves the ticket ‘active’ with no owner to call Unlock(). Consider handling ctx.Done() by re-checking activeTicket while holding ticketGuard and either (a) treating it as acquired if activeTicket == ticket, or (b) if aborting while activeTicket == ticket, advancing activeTicket (and skipping any consecutive canceled tickets) so other waiters can make progress.

Suggested change

defer l.ticketGuard.Unlock()

heap.Push(l.canceledTickets, ticket)

// Re-check if this ticket has become active while we were waiting

// for the context to be canceled. If so, treat the lock as acquired

// to avoid leaving an active ticket with no owner to call Unlock().

if atomic.LoadUint64(&l.activeTicket) == ticket {

l.ticketGuard.Unlock()

return ticket

}

heap.Push(l.canceledTickets, ticket)

l.ticketGuard.Unlock()

Copilot · 2026-02-12T10:55:19Z

internal/shared/ticketlock.go

+	for {
+		ticket := atomic.AddUint64(&l.activeTicket, 1)
+		nextCanceledTicket, hasCanceledTickets := l.canceledTickets.Peek()
+
+		switch {
+		// No canceled tickets, we can return
+		case !hasCanceledTickets:
+			return
+
+		// The last canceled ticket is the same as the current ticket.
+		// We need to try again with the next ticket (which might also be
+		// canceled).
+		case nextCanceledTicket == ticket:
+			heap.Pop(l.canceledTickets)
+
+		// There are canceled tickets, but the current ticket is smaller than
+		// the first canceled ticket.
+		default:
+			return


If canceledTickets ever contains an entry less than the current ticket (e.g., due to the cancellation race described above), the default branch returns and the stale smallest canceled ticket will remain at the heap root forever. That prevents future cancellations from being skipped when their ticket number is reached. A robust fix is to pop and discard any canceled tickets < ticket (stale) before comparing for equality, then continue skipping while nextCanceledTicket == ticket.

Suggested change

for {

ticket := atomic.AddUint64(&l.activeTicket, 1)

nextCanceledTicket, hasCanceledTickets := l.canceledTickets.Peek()

switch {

// No canceled tickets, we can return

case !hasCanceledTickets:

return

// The last canceled ticket is the same as the current ticket.

// We need to try again with the next ticket (which might also be

// canceled).

case nextCanceledTicket == ticket:

heap.Pop(l.canceledTickets)

// There are canceled tickets, but the current ticket is smaller than

// the first canceled ticket.

default:

return

outer:

for {

ticket := atomic.AddUint64(&l.activeTicket, 1)

for {

nextCanceledTicket, hasCanceledTickets := l.canceledTickets.Peek()

switch {

// No canceled tickets, we can return

case !hasCanceledTickets:

return

// Discard any stale canceled tickets that are less than the current ticket.

case nextCanceledTicket < ticket:

heap.Pop(l.canceledTickets)

continue

// The next canceled ticket is the same as the current ticket.

// Skip this ticket and try again with the next one (which might also be canceled).

case nextCanceledTicket == ticket:

heap.Pop(l.canceledTickets)

continue outer

// There are canceled tickets, but the current ticket is smaller than

// the first (smallest) canceled ticket.

default:

return

}

Copilot · 2026-02-12T10:55:19Z

internal/shared/ticketlock.go

+func NewTicketLock(granularity time.Duration) *TicketLock {
+	return &TicketLock{
+		nextTicket:      1,
+		activeTicket:    1,
+		pauseDuration:   granularity,
+		canceledTickets: &HeapUint64{},
+		ticketGuard:     &sync.Mutex{},
+	}
+}


LockWithContext uses time.NewTicker(l.pauseDuration), which panics if pauseDuration <= 0. Since NewTicketLock is a general-purpose constructor, it should defensively validate granularity (e.g., clamp to a minimum positive duration or return a well-defined behavior) to avoid runtime panics.

Copilot · 2026-02-12T10:55:19Z

internal/shared/ticketlock_test.go

+				for {
+					ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)
+					defer cancel()
+
+					ticket := lock.LockWithContext(ctx)
+					time.Sleep(5 * time.Millisecond)
+					if ticket != 0 {
+						order = append(order, ticket)
+						lock.Unlock()
+						return
+					}
+				}


defer cancel() is inside a potentially long-running loop; defers will accumulate until the goroutine returns, causing unnecessary resource retention. Call cancel() explicitly at the end of each iteration (after LockWithContext returns) instead of deferring it.

arnecls added 3 commits February 9, 2026 14:32

add a ticketlock

f889fda

fix: wait for parallel requests

3f68477

refetch after lock

4efa6f3

arnecls requested review from a team and mzjulian as code owners February 10, 2026 11:17

arnecls self-assigned this Feb 10, 2026

github-actions bot added 🐞 bug Something that is not working as intended 🔀 metadata-server Affects the metadata-server codebase and removed 🐞 bug Something that is not working as intended labels Feb 10, 2026

arnecls added 🐞 bug Something that is not working as intended and removed 🔀 metadata-server Affects the metadata-server codebase labels Feb 10, 2026

add retry-after header when timing out

ba30e1d

github-actions bot added the 🔀 metadata-server Affects the metadata-server codebase label Feb 10, 2026

explain granularity

0000f96

qgoulart requested a review from Copilot February 10, 2026 11:24

Copilot AI reviewed Feb 10, 2026

View reviewed changes

qgoulart reviewed Feb 10, 2026

View reviewed changes

cmd/metadata-server/tokencache.go Outdated Show resolved Hide resolved

arnecls and others added 2 commits February 10, 2026 15:53

Apply suggestions from code review

27d7dcb

Co-authored-by: Copilot <[email protected]> Signed-off-by: Arne Claus <[email protected]>

improve lock handling

4f6f962

qgoulart requested review from Copilot and qgoulart February 11, 2026 10:40

Copilot AI reviewed Feb 11, 2026

View reviewed changes

arnecls requested a review from Copilot February 11, 2026 13:40

Copilot AI reviewed Feb 11, 2026

View reviewed changes

arnecls added 3 commits February 11, 2026 14:45

make sure lock returns 0 after Close() has been called

eda3a2a

fix pause ticker contention

915174d

fix several issues on the use of inflight locks

8595e69

arnecls requested a review from Copilot February 11, 2026 13:53

Copilot AI reviewed Feb 11, 2026

View reviewed changes

fix/improve some outdated documentation

50497a1

arnecls and others added 2 commits February 11, 2026 15:10

improve variable naming for clarity

abc83fb

Update internal/shared/ticketlock_test.go

ad763d4

Co-authored-by: Copilot <[email protected]> Signed-off-by: Arne Claus <[email protected]>

qgoulart previously approved these changes Feb 11, 2026

View reviewed changes

add concurrency test for ticket locks

cd88bcb

arnecls dismissed qgoulart’s stale review via cd88bcb February 11, 2026 15:50

arnecls added 2 commits February 11, 2026 16:59

add concurrent cancel test

27cf679

add lock behavior test

4211718

arnecls requested review from Copilot and qgoulart February 11, 2026 16:15

Copilot AI reviewed Feb 11, 2026

View reviewed changes

internal/shared/ticketlock_test.go Show resolved Hide resolved

internal/shared/ticketlock_test.go Show resolved Hide resolved

arnecls requested a review from Copilot February 12, 2026 10:52

Copilot AI reviewed Feb 12, 2026

View reviewed changes

		defer l.ticketGuard.Unlock()
		heap.Push(l.canceledTickets, ticket)

fix: PDX-2459 avoid parallel cache-misses on the same token #17

Are you sure you want to change the base?

fix: PDX-2459 avoid parallel cache-misses on the same token #17

Conversation

arnecls commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2026

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Feb 11, 2026

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Feb 11, 2026

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026