Skip to content

fix(api): update InternalRBACRules SPIFFE identifiers to nico-* prefix#1907

Merged
shayan1995 merged 1 commit into
NVIDIA:mainfrom
shayan1995:fix/internal-rbac-spiffe-identifiers
May 27, 2026
Merged

fix(api): update InternalRBACRules SPIFFE identifiers to nico-* prefix#1907
shayan1995 merged 1 commit into
NVIDIA:mainfrom
shayan1995:fix/internal-rbac-spiffe-identifiers

Conversation

@shayan1995
Copy link
Copy Markdown
Contributor

@shayan1995 shayan1995 commented May 22, 2026

Description

After the carbide → nico platform rename, all newly deployed services present SPIFFE identifiers with the nico-* prefix, but InternalRBACRules in crates/api/src/auth/internal_rbac_rules.rs still matched against hardcoded carbide-* strings. Every internal service-to-api gRPC call failed mTLS authorization with HTTP 403, silently breaking all service-to-service communication.

Switch RuleInfo::new from map(|x| -> Principal) to flat_map(|x| -> Vec<Principal>) so each rule can accept multiple acceptable SPIFFE identifiers, then have each renamed variant emit BOTH the new nico-* and the legacy carbide-* identifier via ansvc_compat helper

Failure mode before this fix: inbound gRPC from e.g. nico-dns to nico-apisurfaced as

     WARN auth::internal_rbac_rules — principal SpiffeServiceIdentifier("nico-dns")
          not authorized for method LookupRecordLegacy — no matching rule

Fixes #1891

Type of Change

  • Add
  • Change
  • Fix
  • Remove
  • Internal

Related Issues (Optional)

Fixes #1891

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required

Verified deployed serviceNames in helm/charts/nico-*/values.yaml match the updated rule strings (nico-dns, nico-dhcp, nico-pxe, nico-bmc-proxy, nico-hardware-health, nico-ssh-console-rs, nico-dsx-exchange-consumer, nico-flow).

Additional Notes

@shayan1995 shayan1995 requested a review from a team as a code owner May 22, 2026 21:49
@shayan1995 shayan1995 force-pushed the fix/internal-rbac-spiffe-identifiers branch 2 times, most recently from be0388e to 7a4dc89 Compare May 27, 2026 16:26
@shayan1995 shayan1995 enabled auto-merge (squash) May 27, 2026 16:41
Copy link
Copy Markdown
Contributor

@kunzhao-nv kunzhao-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Production spiffe_service_base_paths is ["/nico-system/sa/", "/default/sa/"] (see helm/charts/nico-api/files/carbide-api-config.toml:64-66, dev/deployment/devspace/values.base.yaml:69, helm/examples/values-full.yaml:324-326). There's no /carbide-system/sa/ in the list.

That means a fully-stale carbide-era cert (spiffe://carbide.local/carbide-system/sa/carbide-dns) would fail at the trust-domain check or base-path strip in extract_service_identifier (crates/authn/src/lib.rs:144-183), long before reaching this RBAC matcher.

  1. Stale fixtures in crates/api/src/auth/test_certs.rs and crates/api/src/auth.rs (nit, follow-up). Test fixtures still hardcode URI:spiffe://example.test/carbide-system/sa/carbide-dhcp (test_certs.rs:60) and service_base_paths: ["/carbide-system/sa/", ...] (auth.rs:286).

@shayan1995 shayan1995 force-pushed the fix/internal-rbac-spiffe-identifiers branch from 7a4dc89 to 8c73b08 Compare May 27, 2026 17:19
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shayan1995 shayan1995 force-pushed the fix/internal-rbac-spiffe-identifiers branch from 8c73b08 to 5bdb1f8 Compare May 27, 2026 17:52
After the carbide → nico platform rename, all newly deployed services
present SPIFFE identifiers with the nico-* prefix, but InternalRBACRules
in crates/api/src/auth/internal_rbac_rules.rs still matched against
hardcoded carbide-* strings. Every internal service-to-api gRPC call
failed mTLS authorization with HTTP 403, silently breaking all
service-to-service communication.

Switch RuleInfo::new from `map(|x| -> Principal)` to
`flat_map(|x| -> Vec<Principal>)` so each rule can accept multiple
acceptable SPIFFE identifiers, then have each renamed variant emit
BOTH the new nico-* and the legacy carbide-* identifier via an
svc_compat helper:

  Dns                    -> nico-dns, carbide-dns
  Dhcp                   -> nico-dhcp, carbide-dhcp
  Ssh                    -> nico-ssh-console, carbide-ssh-console
  SshRs                  -> nico-ssh-console-rs, carbide-ssh-console-rs
  Pxe                    -> nico-pxe, carbide-pxe
  BmcProxy               -> nico-bmc-proxy, carbide-bmc-proxy
  Health                 -> nico-hardware-health, carbide-hardware-health
  Flow                   -> nico-flow, carbide-flow
  MaintenanceJobs        -> nico-maintenance-jobs, carbide-maintenance-jobs
  DsxExchangeConsumer    -> nico-dsx-exchange-consumer,
                            carbide-dsx-exchange-consumer

The `allowed()` matcher already walks this Vec with `.any(...)`, so the
change is transparent to callers: deployed sites still presenting a
carbide-* cert continue to authorize, and freshly deployed sites with
nico-* certs work too. The carbide-* aliases should be dropped once
every site has rotated to a nico-* cert.

Failure mode before this fix: inbound gRPC from e.g. nico-dns to nico-api
surfaced as

  WARN auth::internal_rbac_rules — principal SpiffeServiceIdentifier("nico-dns")
       not authorized for method LookupRecordLegacy — no matching rule

with no TLS-level error, masking the root cause. Impact spanned DNS
resolution, DHCP lease lookups, PXE GetCloudInitInstructions, SSH console
access, hardware health reporting, and maintenance job scheduling — every
internal principal that authenticates via SpiffeServiceIdentifier.

Follow-up (not in this PR): these identifiers are stringly-typed with no
compile-time link to the actual deployed service names. Worth deriving
them from a shared constant or asserting consistency in an integration
test that round-trips each principal through cert subject + RBAC lookup.

Fixes NVIDIA#1891
@shayan1995 shayan1995 force-pushed the fix/internal-rbac-spiffe-identifiers branch from 5bdb1f8 to a0c0a42 Compare May 27, 2026 17:54
@shayan1995
Copy link
Copy Markdown
Contributor Author

@kunzhao-nv you're right, this fails at the trust-domain check (carbide.local ≠ nico.local) before reaching the matcher. This PR doesn't claim to support that case. A site running fully-stale certs would also be running the fully-stale carbide-api binary against its forged-deployed services; they should upgrade together as a unit.

rust domain + base paths are deployment-config (TOML, settable per-deployment); only the RBAC matcher principal strings are hardcoded in the Rust source, so only those need backward-compat in code. Adding /forge-system/sa/ or forge.local to the helm config would just be dead config in helm deployments (helm-deployed services never present those).

@shayan1995 shayan1995 merged commit 5a4a07e into NVIDIA:main May 27, 2026
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: SPIFFE service identifiers in InternalRBACRules not updated after carbide→NICo rename — all internal gRPC calls return 403

3 participants