-
Notifications
You must be signed in to change notification settings - Fork 0
fix: PDX-2459 avoid parallel cache-misses on the same token #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Introduces a per-token FIFO ticket lock to prevent parallel cache misses for the same token, reducing thundering-herd behavior and downstream rate limiting.
Changes:
- Added a
TicketLock(FIFO lock with context-aware cancellation) and auint64min-heap for tracking canceled tickets. - Updated access token handler to serialize same-token cache-miss fetches and return
429on lock acquisition timeout. - Extended token cache with a per-token inflight lock map.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/shared/ticketlock.go | Adds FIFO ticket lock with canceled-ticket tracking to avoid parallel cache misses. |
| internal/shared/intheap.go | Adds HeapUint64 to support min-heap operations for canceled tickets. |
| internal/shared/ticketlock_test.go | Adds a basic unit test for ticket allocation, cancellation, and unlock behavior. |
| cmd/metadata-server/tokencache.go | Adds inflight lock map and accessor to retrieve per-token locks. |
| cmd/metadata-server/tokenhandlers.go | Uses per-token lock on cache miss and returns 429 when lock can’t be acquired. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Arne Claus <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Arne Claus <[email protected]>
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
2 similar comments
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| defer l.ticketGuard.Unlock() | ||
| heap.Push(l.canceledTickets, ticket) |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancellation can deadlock the lock: there’s a race where activeTicket can become ticket right after the activeTicket == ticket check but before this ctx.Done() branch executes. In that case, returning 0 without advancing activeTicket leaves the ticket ‘active’ with no owner to call Unlock(). Consider handling ctx.Done() by re-checking activeTicket while holding ticketGuard and either (a) treating it as acquired if activeTicket == ticket, or (b) if aborting while activeTicket == ticket, advancing activeTicket (and skipping any consecutive canceled tickets) so other waiters can make progress.
| defer l.ticketGuard.Unlock() | |
| heap.Push(l.canceledTickets, ticket) | |
| // Re-check if this ticket has become active while we were waiting | |
| // for the context to be canceled. If so, treat the lock as acquired | |
| // to avoid leaving an active ticket with no owner to call Unlock(). | |
| if atomic.LoadUint64(&l.activeTicket) == ticket { | |
| l.ticketGuard.Unlock() | |
| return ticket | |
| } | |
| heap.Push(l.canceledTickets, ticket) | |
| l.ticketGuard.Unlock() |
| for { | ||
| ticket := atomic.AddUint64(&l.activeTicket, 1) | ||
| nextCanceledTicket, hasCanceledTickets := l.canceledTickets.Peek() | ||
|
|
||
| switch { | ||
| // No canceled tickets, we can return | ||
| case !hasCanceledTickets: | ||
| return | ||
|
|
||
| // The last canceled ticket is the same as the current ticket. | ||
| // We need to try again with the next ticket (which might also be | ||
| // canceled). | ||
| case nextCanceledTicket == ticket: | ||
| heap.Pop(l.canceledTickets) | ||
|
|
||
| // There are canceled tickets, but the current ticket is smaller than | ||
| // the first canceled ticket. | ||
| default: | ||
| return |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If canceledTickets ever contains an entry less than the current ticket (e.g., due to the cancellation race described above), the default branch returns and the stale smallest canceled ticket will remain at the heap root forever. That prevents future cancellations from being skipped when their ticket number is reached. A robust fix is to pop and discard any canceled tickets < ticket (stale) before comparing for equality, then continue skipping while nextCanceledTicket == ticket.
| for { | |
| ticket := atomic.AddUint64(&l.activeTicket, 1) | |
| nextCanceledTicket, hasCanceledTickets := l.canceledTickets.Peek() | |
| switch { | |
| // No canceled tickets, we can return | |
| case !hasCanceledTickets: | |
| return | |
| // The last canceled ticket is the same as the current ticket. | |
| // We need to try again with the next ticket (which might also be | |
| // canceled). | |
| case nextCanceledTicket == ticket: | |
| heap.Pop(l.canceledTickets) | |
| // There are canceled tickets, but the current ticket is smaller than | |
| // the first canceled ticket. | |
| default: | |
| return | |
| outer: | |
| for { | |
| ticket := atomic.AddUint64(&l.activeTicket, 1) | |
| for { | |
| nextCanceledTicket, hasCanceledTickets := l.canceledTickets.Peek() | |
| switch { | |
| // No canceled tickets, we can return | |
| case !hasCanceledTickets: | |
| return | |
| // Discard any stale canceled tickets that are less than the current ticket. | |
| case nextCanceledTicket < ticket: | |
| heap.Pop(l.canceledTickets) | |
| continue | |
| // The next canceled ticket is the same as the current ticket. | |
| // Skip this ticket and try again with the next one (which might also be canceled). | |
| case nextCanceledTicket == ticket: | |
| heap.Pop(l.canceledTickets) | |
| continue outer | |
| // There are canceled tickets, but the current ticket is smaller than | |
| // the first (smallest) canceled ticket. | |
| default: | |
| return | |
| } |
| func NewTicketLock(granularity time.Duration) *TicketLock { | ||
| return &TicketLock{ | ||
| nextTicket: 1, | ||
| activeTicket: 1, | ||
| pauseDuration: granularity, | ||
| canceledTickets: &HeapUint64{}, | ||
| ticketGuard: &sync.Mutex{}, | ||
| } | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LockWithContext uses time.NewTicker(l.pauseDuration), which panics if pauseDuration <= 0. Since NewTicketLock is a general-purpose constructor, it should defensively validate granularity (e.g., clamp to a minimum positive duration or return a well-defined behavior) to avoid runtime panics.
| for { | ||
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond) | ||
| defer cancel() | ||
|
|
||
| ticket := lock.LockWithContext(ctx) | ||
| time.Sleep(5 * time.Millisecond) | ||
| if ticket != 0 { | ||
| order = append(order, ticket) | ||
| lock.Unlock() | ||
| return | ||
| } | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer cancel() is inside a potentially long-running loop; defers will accumulate until the goroutine returns, causing unnecessary resource retention. Call cancel() explicitly at the end of each iteration (after LockWithContext returns) instead of deferring it.
When a lot of requests for the same token come in at the same time, hitting a cold cache, each request will trigget a cache miss.
In severe cases, this can lead to rate limiting.
This change introduces a FIFO, lock based queue based on the token ID to tackle this problem. The locks may time out based on request context and return a 429 (overloaded, come back later).