[auth] Introduce token renewals system#714
[auth] Introduce token renewals system#714Fricounet wants to merge 11 commits intocontainerd:mainfrom
Conversation
Add a RenewableProvider interface with a CanRenew() bool method so that callers can determine at runtime whether a provider's credentials can be refreshed without user interaction. Implement it on DockerProvider, KubeletProvider and KubeSecretProvider (all return true); CRI and Labels providers are excluded. Add String() to all providers for structured log fields. Refactor GetRegistryKeyChain to iterate over a provider list built by a replaceable buildProviders var, enabling unit-test injection without global state manipulation.
Add CredentialRenewals (counter, by image_ref + result) and CredentialStoreEntries (TTL gauge, by image_ref) to track renewal health and store occupancy. Register both in the global Prometheus registry.
Add credentialStore: an in-memory map of image ref -> PassKeyChain with per-entry TTL-based expiration, protected by a sync.RWMutex. Add InitCredentialRenewal, which creates the global store, seeds it from a list of existing refs (used on snapshotter restart), and starts a background goroutine that renews all live entries at a configurable interval. GetRegistryKeyChain now serves from the store on a cache hit and writes back credentials obtained from a RenewableProvider when the store is active.
Add CredentialRenewalInterval (time.Duration, default 0 = disabled) to AuthConfig. When set to a positive value the snapshotter activates the credential renewal subsystem.
Call auth.InitCredentialRenewal after service recovery when CredentialRenewalInterval > 0. Seed the store from unique ImageIDs present in the recovered RAFS cache so credentials for running containers are renewed immediately after a snapshotter restart. Call auth.RemoveCredentials in Filesystem.Umount so the renewal goroutine stops refreshing tokens for images that are no longer mounted, preventing unbounded store growth.
Move the authentication section out of configure_nydus.md into a new docs/registry_authentication.md. Add documentation for the new credential_renewal_interval option and update the cross-reference in configure_nydus.md.
6b58922 to
6e1889d
Compare
Signed-off-by: Baptiste Girard-Carrabin <baptiste.girardcarrabin@datadoghq.com>
6e1889d to
50e895f
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #714 +/- ##
==========================================
+ Coverage 22.02% 22.57% +0.55%
==========================================
Files 130 131 +1
Lines 11931 12016 +85
==========================================
+ Hits 2628 2713 +85
+ Misses 8960 8957 -3
- Partials 343 346 +3
🚀 New features to boost your workflow:
|
|
Putting this back in draft because I'm considering a slight architectural change for this:
However, this approach leads to a few edge cases:
The alternative would be that on each renewal run, we get the up to date list of rafs and derive the images to renew from there. This ensure we renew all the creds currently in use. We still keep an internal list of refs in the renewal store. But now:
My main point of concern is how to know which provider to call GetCredentials on in that case:
|
Add renewableProviders alongside buildProviders. The renewal goroutine must not use CRI (caches credentials globally after the pull) or Labels (only available at pull time via snapshot labels). Restricting renewal to Docker, Kubelet, and KubeSecret ensures the correct providers are reached at renewal time regardless of what served credentials at pull time. Refactor buildProviders to compose renewableProviders, eliminating duplication. Extract fetchFromProviders from getRegistryKeyChainFromProviders so the renewal goroutine can call providers directly without going through the store cache check. getRegistryKeyChainFromProviders checks the store first, then delegates to fetchFromProviders; the renewal path calls fetchFromProviders directly to obtain fresh credentials.
Replace the store-driven eviction model (explicit RemoveCredentials on umount) with a reconciliation loop that derives the active set from the live RAFS instance list on every tick. This fixes two edge cases: - Multiple RAFS instances sharing the same image ref: the first umount no longer evicts the entry while other instances are still running. - Non-renewable providers (CRI, Labels) blocking renewable ones at renewal time: the renewal goroutine now uses renewableProviders() exclusively, so CRI's cached global state never interferes. On each tick, reconcile compares store entries against the live RAFS set. Entries in RAFS are renewed; entries absent from RAFS are evicted once their renewedAt age exceeds interval/2. The interval/2 grace period prevents eviction of entries added during an in-progress pull (the RAFS entry is created after the mount completes, so a tick arriving mid-pull would otherwise evict a valid entry). renewEntry is now a plain function that delegates to fetchFromProviders. The store write on success happens inside fetchFromProviders, so renewEntry only needs to record metrics. InitCredentialRenewal drops the existingRefs parameter; the initial reconcile pass seeds the store from the live RAFS cache directly.
Remove auth.RemoveCredentials from Filesystem.Umount; eviction is now handled by the reconciliation loop in the renewal goroutine. Remove the auth import from fs.go and the rafs import from snapshotter.go which are no longer needed. Simplify InitCredentialRenewal call site: no existingRefs slice to build, just pass the interval. Update the credential renewal section in docs/registry_authentication.md to reflect that renewal re-queries providers in priority order each tick rather than re-using the provider that originally issued the credentials.
|
I've made the changes mentioned there. I think this approach is more robust. PR is ready for reivew |
Overview
Please briefly describe the changes your pull request makes.
Introduce a background credential renewal subsystem for nydus-snapshotter. When enabled, a goroutine periodically reconciles the set of active RAFS instances against an in-memory credential store, renewing credentials for images currently in use.
Renewal is scoped to providers that can re-fetch credentials autonomously: Docker config, kubelet credential provider plugins, and Kubernetes docker config secrets. CRI-based and label-based credentials are excluded. CRI caches its state globally and would shadow renewable providers at renewal time; labels are only present during pull.
The feature is disabled by default. No behavior change unless credential_renewal_interval is set.
This change is not yet wired to push refreshed credentials into nydusd. That is the next step, pending dragonflyoss/nydus#1864.
Related Issues
Please link to the relevant issue. For example:
Fix #123orRelated #456.This is the next implementation step for #690.
Next steps are gonna be about wiring this renewal store to update nydusd creds as well once dragonflyoss/nydus#1864 is merged.
Change Details
Please describe your changes in detail:
Credential store and reconciliation loop (pkg/auth/renewal.go):
On each tick the goroutine:
There is a grace period of interval/2 grace period to a renewal tick from evicting entries that were just added during an in-progress pull (the RAFS entry is created after mount completes, not at first credential fetch).
Provider separation (pkg/auth/keychain.go):
2 prometheus metrics were added to follow the state of the renewal store:
snapshotter_credential_renewals_total{image_ref, result}: counter of renewal attempts by outcomesnapshotter_credential_store_entries{image_ref}: gauge of tracked entriesTest Results
If you have any relevant screenshots or videos that can help illustrate your changes, please add them here.
I've tested the feature in a live AWS environment with kubelet creds providers configured using the datadog-agent image:
Change Type
Please select the type of change your pull request relates to:
Self-Checklist
Before submitting a pull request, please ensure you have completed the following: