Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/msi_v2/caching_strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Managed Identity v2 (Attested TB) — Resilience & Caching Plan

## TL;DR
We reduce cold-start latency and dependency risk for MSI v2 by caching safe, long-lived artifacts, coordinating renewal across processes, and keeping the hot path in memory. **MAA is used only to (re)issue the binding certificate**; bound AT acquisition relies on that cert. Result: fewer failures, less churn, smoother CX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the fallback if the binding cert is lost or corrupted? Is there any emergency recovery path?


---

## Problem
- Cold starts/reboots trigger extra external calls (MAA → IMDS → eSTS).
- OS certificate store I/O can contend under load.
- Multiple processes may race to re-issue binding certificates.
- We want resilience to **MAA** issues and predictable **cert renewal**.

---

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "link-local" ?

2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean "treat as primary anchor". Pls use more precise wording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not hardcode any expirations. We rely on services returning expirations.

3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls be precise. Specify:

  • jitter (e.g. 5 min)
  • if renewal should happen on front-end or back-end thread. I think front-end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?

4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want cross-process coordination, please specify the IPC that is going to be used. This needs to exist on Windows and Linux and it needs to be available in sanctioned libraries across all supported MSAL languages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?

5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough precision. What does it mean "short-lived" cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?


---

## Call Sequence (cold start)
```
Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1)
1 (local): Create KeyGuard key (per reboot)
2 (external): Get MAA token // only for (re)issuing cert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is local and what is external ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there retries or backoff strategies if the MAA call fails? Is exponential backoff used or is it a fixed retry policy?

3 (local): IMDS /issuecredential → binding cert + metadata
4 (external): eSTS-R → bound AT (mtls_pop/bearer) using client mTLS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eSTS rejects the AT request due to cert issues (e.g., early expiry, clock skew), does the system automatically trigger re-issuance?

5 (external): Call resource with bound AT + client mTLS
```
---

## Cache & Renewal Matrix

| Item | Scope | Where | TTL | Notes |
|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you going to deal with atomicity, multiple file writers, and a process that gets killed in the middle of a write?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are policy changes detected? Is it polled, pushed, or inferred from failures?

| **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What protects against file corruption or unauthorized access on Linux? Is there a fallback if the file is deleted outside of MSAL?

| **Access tokens (`bearer` or `mtls_pop`)** | Per audience | In memory | Service-configured | Reacquire after reboot (new key) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there scenarios where token invalidation lags behind key rotation? How does the system ensure that stale tokens aren’t accidentally reused?


---

## Invalidation Rules
- **Reboot** → Use **persisted binding cert** to fetch new ATs; re-attest on first demand on service failure.
- **Cert expiry** → re-issue.
- **MAA token expired** → re-attest and re-issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there built-in safeguards to prevent a thundering herd if all processes notice expiry at the same time?


---

## Security
- Keys are **non-exportable** in **KeyGuard**; MSAL stores **handles**, not private keys.
- Persisted items are **user-scoped** and protected (DPAPI on Windows; restricted file perms/OS keyring on Linux).

---

## Why This Improves CX
- **MAA is out of the hot path**—steady-state calls rely on a **multi-day binding cert**.
- Different identities on the same VM, uses **cached MAA token**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cache not keyed per identity?

- **No thundering herd**—single process renews certificate; others reuse.
- **Predictable renewals**—half-life + jitter prevents synchronized spikes.

---