fix(api): update InternalRBACRules SPIFFE identifiers to nico-* prefix#1907
Conversation
be0388e to
7a4dc89
Compare
There was a problem hiding this comment.
- Production spiffe_service_base_paths is ["/nico-system/sa/", "/default/sa/"] (see helm/charts/nico-api/files/carbide-api-config.toml:64-66, dev/deployment/devspace/values.base.yaml:69, helm/examples/values-full.yaml:324-326). There's no /carbide-system/sa/ in the list.
That means a fully-stale carbide-era cert (spiffe://carbide.local/carbide-system/sa/carbide-dns) would fail at the trust-domain check or base-path strip in extract_service_identifier (crates/authn/src/lib.rs:144-183), long before reaching this RBAC matcher.
- Stale fixtures in crates/api/src/auth/test_certs.rs and crates/api/src/auth.rs (nit, follow-up). Test fixtures still hardcode URI:spiffe://example.test/carbide-system/sa/carbide-dhcp (test_certs.rs:60) and service_base_paths: ["/carbide-system/sa/", ...] (auth.rs:286).
7a4dc89 to
8c73b08
Compare
8c73b08 to
5bdb1f8
Compare
After the carbide → nico platform rename, all newly deployed services
present SPIFFE identifiers with the nico-* prefix, but InternalRBACRules
in crates/api/src/auth/internal_rbac_rules.rs still matched against
hardcoded carbide-* strings. Every internal service-to-api gRPC call
failed mTLS authorization with HTTP 403, silently breaking all
service-to-service communication.
Switch RuleInfo::new from `map(|x| -> Principal)` to
`flat_map(|x| -> Vec<Principal>)` so each rule can accept multiple
acceptable SPIFFE identifiers, then have each renamed variant emit
BOTH the new nico-* and the legacy carbide-* identifier via an
svc_compat helper:
Dns -> nico-dns, carbide-dns
Dhcp -> nico-dhcp, carbide-dhcp
Ssh -> nico-ssh-console, carbide-ssh-console
SshRs -> nico-ssh-console-rs, carbide-ssh-console-rs
Pxe -> nico-pxe, carbide-pxe
BmcProxy -> nico-bmc-proxy, carbide-bmc-proxy
Health -> nico-hardware-health, carbide-hardware-health
Flow -> nico-flow, carbide-flow
MaintenanceJobs -> nico-maintenance-jobs, carbide-maintenance-jobs
DsxExchangeConsumer -> nico-dsx-exchange-consumer,
carbide-dsx-exchange-consumer
The `allowed()` matcher already walks this Vec with `.any(...)`, so the
change is transparent to callers: deployed sites still presenting a
carbide-* cert continue to authorize, and freshly deployed sites with
nico-* certs work too. The carbide-* aliases should be dropped once
every site has rotated to a nico-* cert.
Failure mode before this fix: inbound gRPC from e.g. nico-dns to nico-api
surfaced as
WARN auth::internal_rbac_rules — principal SpiffeServiceIdentifier("nico-dns")
not authorized for method LookupRecordLegacy — no matching rule
with no TLS-level error, masking the root cause. Impact spanned DNS
resolution, DHCP lease lookups, PXE GetCloudInitInstructions, SSH console
access, hardware health reporting, and maintenance job scheduling — every
internal principal that authenticates via SpiffeServiceIdentifier.
Follow-up (not in this PR): these identifiers are stringly-typed with no
compile-time link to the actual deployed service names. Worth deriving
them from a shared constant or asserting consistency in an integration
test that round-trips each principal through cert subject + RBAC lookup.
Fixes NVIDIA#1891
5bdb1f8 to
a0c0a42
Compare
|
@kunzhao-nv you're right, this fails at the trust-domain check (carbide.local ≠ nico.local) before reaching the matcher. This PR doesn't claim to support that case. A site running fully-stale certs would also be running the fully-stale carbide-api binary against its rust domain + base paths are deployment-config (TOML, settable per-deployment); only the RBAC matcher principal strings are hardcoded in the Rust source, so only those need backward-compat in code. Adding /forge-system/sa/ or forge.local to the helm config would just be dead config in helm deployments (helm-deployed services never present those). |
Description
After the carbide → nico platform rename, all newly deployed services present SPIFFE identifiers with the nico-* prefix, but InternalRBACRules in crates/api/src/auth/internal_rbac_rules.rs still matched against hardcoded carbide-* strings. Every internal service-to-api gRPC call failed mTLS authorization with HTTP 403, silently breaking all service-to-service communication.
Switch RuleInfo::new from
map(|x| -> Principal)toflat_map(|x| -> Vec<Principal>)so each rule can accept multiple acceptable SPIFFE identifiers, then have each renamed variant emit BOTH the new nico-* and the legacy carbide-* identifier via ansvc_compat helperFailure mode before this fix: inbound gRPC from e.g. nico-dns to nico-apisurfaced as
Fixes #1891
Type of Change
Related Issues (Optional)
Fixes #1891
Breaking Changes
Testing
Verified deployed serviceNames in
helm/charts/nico-*/values.yamlmatch the updated rule strings (nico-dns, nico-dhcp, nico-pxe, nico-bmc-proxy, nico-hardware-health, nico-ssh-console-rs, nico-dsx-exchange-consumer, nico-flow).Additional Notes