-
Notifications
You must be signed in to change notification settings - Fork 378
Document caching strategy for Managed Identity v2 #5526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| # Managed Identity v2 (Attested TB) — Resilience & Caching Plan | ||
|
|
||
| ## TL;DR | ||
| We reduce cold-start latency and dependency risk for MSI v2 by caching safe, long-lived artifacts, coordinating renewal across processes, and keeping the hot path in memory. **MAA is used only to (re)issue the binding certificate**; bound AT acquisition relies on that cert. Result: fewer failures, less churn, smoother CX. | ||
|
|
||
| --- | ||
|
|
||
| ## Problem | ||
| - Cold starts/reboots trigger extra external calls (MAA → IMDS → eSTS). | ||
| - OS certificate store I/O can contend under load. | ||
| - Multiple processes may race to re-issue binding certificates. | ||
| - We want resilience to **MAA** issues and predictable **cert renewal**. | ||
|
|
||
| --- | ||
|
|
||
| ## Solution (What’s Changing) | ||
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is "link-local" ? |
||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does it mean "treat as primary anchor". Pls use more precise wording. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should not hardcode any expirations. We rely on services returning expirations. |
||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pls be precise. Specify:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays? |
||
| 4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you want cross-process coordination, please specify the IPC that is going to be used. This needs to exist on Windows and Linux and it needs to be available in sanctioned libraries across all supported MSAL languages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal? |
||
| 5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not enough precision. What does it mean "short-lived" cache? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation? |
||
|
|
||
| --- | ||
|
|
||
| ## Call Sequence (cold start) | ||
| ``` | ||
| Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1) | ||
| 1 (local): Create KeyGuard key (per reboot) | ||
| 2 (external): Get MAA token // only for (re)issuing cert | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is local and what is external ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there retries or backoff strategies if the MAA call fails? Is exponential backoff used or is it a fixed retry policy? |
||
| 3 (local): IMDS /issuecredential → binding cert + metadata | ||
| 4 (external): eSTS-R → bound AT (mtls_pop/bearer) using client mTLS | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If eSTS rejects the AT request due to cert issues (e.g., early expiry, clock skew), does the system automatically trigger re-issuance? |
||
| 5 (external): Call resource with bound AT + client mTLS | ||
| ``` | ||
| --- | ||
|
|
||
| ## Cache & Renewal Matrix | ||
|
|
||
| | Item | Scope | Where | TTL | Notes | | ||
| |---|---|---|---|---| | ||
| | **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here | | ||
| | **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are you going to deal with atomicity, multiple file writers, and a process that gets killed in the middle of a write? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are policy changes detected? Is it polled, pushed, or inferred from failures? |
||
| | **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What protects against file corruption or unauthorized access on Linux? Is there a fallback if the file is deleted outside of MSAL? |
||
| | **Access tokens (`bearer` or `mtls_pop`)** | Per audience | In memory | Service-configured | Reacquire after reboot (new key) | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there scenarios where token invalidation lags behind key rotation? How does the system ensure that stale tokens aren’t accidentally reused? |
||
|
|
||
| --- | ||
|
|
||
| ## Invalidation Rules | ||
| - **Reboot** → Use **persisted binding cert** to fetch new ATs; re-attest on first demand on service failure. | ||
| - **Cert expiry** → re-issue. | ||
| - **MAA token expired** → re-attest and re-issue. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there built-in safeguards to prevent a thundering herd if all processes notice expiry at the same time? |
||
|
|
||
| --- | ||
|
|
||
| ## Security | ||
| - Keys are **non-exportable** in **KeyGuard**; MSAL stores **handles**, not private keys. | ||
| - Persisted items are **user-scoped** and protected (DPAPI on Windows; restricted file perms/OS keyring on Linux). | ||
|
|
||
| --- | ||
|
|
||
| ## Why This Improves CX | ||
| - **MAA is out of the hot path**—steady-state calls rely on a **multi-day binding cert**. | ||
| - Different identities on the same VM, uses **cached MAA token** | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the cache not keyed per identity? |
||
| - **No thundering herd**—single process renews certificate; others reuse. | ||
| - **Predictable renewals**—half-life + jitter prevents synchronized spikes. | ||
|
|
||
| --- | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the fallback if the binding cert is lost or corrupted? Is there any emergency recovery path?