Skip to content

Refactor SPIRE Integration and Optimize Spire Memory Footprint#1837

Open
gngram wants to merge 7 commits intotiiuae:mainfrom
gngram:pull/spire_module_update
Open

Refactor SPIRE Integration and Optimize Spire Memory Footprint#1837
gngram wants to merge 7 commits intotiiuae:mainfrom
gngram:pull/spire_module_update

Conversation

@gngram
Copy link
Contributor

@gngram gngram commented Mar 18, 2026

Description of Changes

This PR refactors SPIRE (formerly referred as SPIFFE module) server and agent implementation and reduces reduces the memory footprint of the SPIRE binaries.

  • Includes an option to use custom spire package.
  • A custom spire package(spire-min) is overlay-ed by stripping of unused cloud-specific modules to reduce memory foot print.
  • standardized naming conventions by renaming remaining spiffe references to spire.
  • Server health report published, server health is checked before initiating agent to avoid failure when server is not ready.
  • implemented a generic agent registration flow based on the specific node attestation method used by each VM.
  • Updated common configurations to allow the server to dynamically retrieve VM lists, per-VM workloads, and attestation methods.
  • per-VM workload registration, earlier there was no separation of workloads.
  • single global config to enable or disable the spiffe identity issuance using spire

Type of Change

  • New Feature
  • Bug Fix
  • Improvement / Refactor

Related Issues / Tickets

Checklist

  • Clear summary in PR description
  • Detailed and meaningful commit message(s)
  • Commits are logically organized and squashed if appropriate
  • Contribution guidelines followed
  • Ghaf documentation updated with the commit - https://tiiuae.github.io/ghaf/
  • Author has run make-checks and it passes
  • All automatic GitHub Action checks pass - see actions
  • Author has added reviewers and removed PR draft status

Testing Instructions

Applicable Targets

  • Orin AGX aarch64
  • Orin NX aarch64
  • Lenovo X1 x86_64
  • Dell Latitude x86_64
  • System 76 x86_64

Installation Method

  • Requires full re-installation
  • Can be updated with nixos-rebuild ... switch
  • Other:

Test Steps To Verify:

  1. check the status of spire-agent service in each vm(except admin), it should be running and without any error.
  2. check the status of spire-server service in admin vm, it should be running and without any error.

@milva-unikie
Copy link

spire-publish-bundle.service and x509pop-key-setup.service fail in admin-vm on all Orins that we test (native Orin-AGX, cross-compiled Orin-AGX, native Orin-NX, cross-compiled Orin-NX)

spire-publish-bundle.service:

Mar 18 15:37:53 admin-vm spire-publish-bundle[703]: ERROR: SPIRE server socket not found at /tmp/spire-server/private/api.sock
Mar 18 15:37:53 admin-vm systemd[1]: spire-publish-bundle.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 15:37:53 admin-vm systemd[1]: spire-publish-bundle.service: Failed with result 'exit-code'.
Mar 18 15:37:53 admin-vm systemd[1]: Failed to start Publish SPIRE trust bundle (PoC).
Mar 18 15:37:53 admin-vm systemd[1]: spire-publish-bundle.service: Consumed 1.975s CPU time, 5.7M memory peak.

x509pop-key-setup.service has no logs.

These issues should be fixed and the code reviewed before we start manual testing.

@milva-unikie milva-unikie added bug on Orin AGX Issues found on NVIDIA Jetson AGX Orin while checking this PR bug on Orin NX Issues found on NVIDIA Jetson NX Orin while checking this PR and removed Needs Testing CI Team to pre-verify labels Mar 19, 2026
@kajusnau
Copy link
Collaborator

As a general note, is there some overlapping functionality with what NixOS/nixpkgs#481447 introduced recently? Maybe we could reduce the footprint of the module(s) on our side if we can re-use some of the upstream services.spire namespace.

Copy link
Collaborator

@kajusnau kajusnau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some basic comments, also suggest using getExe throughout the patch, since it's already used in some places but not all.

@kajusnau
Copy link
Collaborator

Maybe we should put a hold on this until after #1799 and #1822 are in

From c00cf4f1dd5a3fedd7dc701828d476efe1f09e4a Mon Sep 17 00:00:00 2001
From: Ganga Ram <Ganga.Ram@tii.ae>
Date: Tue, 17 Feb 2026 12:46:42 +0400
Subject: [PATCH] removed cloud infra to reduce memory footprint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we can try to push this patch to upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hack basically to remove some builtin plugins. I thought to create a upstream patch but it requires major restructuring. They have used common files in different modules to manage plugins but since Go doesn't provides conditional compilation of code segments within a file, so I will have to split those files for each plugin in order to use Go tags. Which is least likely they will accept also there will be so many tags manage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared the memory footprint difference after applying this patch? If it's a hacky patch, then at least an upstream issue could be created to gather ideas and suggestions for improvement.

Could we also consider creating minimal profile files that exclude the required plugins?

@gngram
Copy link
Contributor Author

gngram commented Mar 19, 2026

As a general note, is there some overlapping functionality with what NixOS/nixpkgs#481447 introduced recently? Maybe we could reduce the footprint of the module(s) on our side if we can re-use some of the upstream services.spire namespace.

@kajusnau Thanks, pointing to this, server and agent configurations we can use from this.
I will use their agent and server config as it is and will separately handle token generation, bundle publishing, credential sharing, workload registration etc. When we'll rebase to the nixpkgs, we'll simply remove the file and rename the config.

@gngram
Copy link
Contributor Author

gngram commented Mar 21, 2026

Rebased, and now using spire services from nixpkgs

everton-dematos and others added 7 commits March 24, 2026 18:06
…s Publish bundle/join-tokens and allow non-root Workload API access via /run/spire tmpfiles.

Co-authored-by: shamma-alblooshi1 <shamma.alblooshi@tii.ae>
Signed-off-by: Everton de Matos <everton.dematos@tii.ae>
- Binary size reduced by 40%
- Also reduces memory footprint

Signed-off-by: Ganga Ram <Ganga.Ram@tii.ae>
Signed-off-by: Everton de Matos <everton.dematos@tii.ae>
- spire server gets following information from the VMs using the common config
1) list of VMs running spire agents
2) workloads per vm
3) node attestation method per vm

- also server address and port information is shared thorugh common
config.

Signed-off-by: Ganga Ram <Ganga.Ram@tii.ae>
- all cloud related and other unused modules stripped

Signed-off-by: Ganga Ram <Ganga.Ram@tii.ae>
    - using spire agent and server services from nixpkgs
    - per vm workload registration
    - fine grained spire server and agent configuration
    - generic agent registartion based on node attestation method
    - agent and server synchronization
    - spiffe module renamed to spire

Signed-off-by: Ganga Ram <Ganga.Ram@tii.ae>
Signed-off-by: Ganga Ram <Ganga.Ram@tii.ae>
@gngram gngram force-pushed the pull/spire_module_update branch from 13bf900 to c824c63 Compare March 25, 2026 06:25
@gngram gngram changed the title Refactor SPIRE Integration, Enable x509pop Attestation, and Optimize Footprint Refactor SPIRE Integration and Optimize Spire Memory Footprint Mar 25, 2026
@gngram gngram mentioned this pull request Mar 25, 2026
19 tasks
security = {
fail2ban.enable = globalConfig.development.ssh.daemon.enable or false;
audit.enable = lib.mkDefault (globalConfig.security.audit.enable or false);

Copy link
Contributor

@everton-dematos everton-dematos Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the admin-vm will also have workloads, shouldn't it have an spire agent as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug on Orin AGX Issues found on NVIDIA Jetson AGX Orin while checking this PR bug on Orin NX Issues found on NVIDIA Jetson NX Orin while checking this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants