Skip to content

feat: ARM64 runtime guards (SMT, CPU info, seccomp, UFFD)#2259

Open
tomassrnka wants to merge 6 commits intomainfrom
feat/arm64-runtime-guards
Open

feat: ARM64 runtime guards (SMT, CPU info, seccomp, UFFD)#2259
tomassrnka wants to merge 6 commits intomainfrom
feat/arm64-runtime-guards

Conversation

@tomassrnka
Copy link
Copy Markdown
Member

Summary

  • Disable SMT on ARM64: Firecracker rejects SMT=true on ARM processors, so we conditionally disable it via runtime.GOARCH check
  • Add --no-seccomp on ARM64: The upstream Firecracker aarch64 seccomp filter does not include the userfaultfd syscall (nr 282), causing snapshot restore to fail; we pass --no-seccomp on ARM64 builds
  • Fallback CPU Family/Model on ARM64: gopsutil does not populate CPU family/model on ARM, so we provide sensible defaults (family "8", model "0") to avoid empty strings
  • Gracefully skip hugepage tests on ENOMEM: ARM64 CI environments may not have sufficient hugepages configured; tests now skip instead of failing
  • Use runtime.GOARCH in smoketest: Replace hardcoded amd64 with runtime.GOARCH for envd binary build path

Test plan

  • Verify orchestrator builds on both amd64 and arm64
  • Confirm Firecracker starts without SMT errors on ARM64
  • Confirm snapshot restore works with --no-seccomp on ARM64
  • Verify CPU info logging shows fallback values on ARM64
  • Run go test ./packages/orchestrator/pkg/sandbox/uffd/testutils/... on ARM64 with limited hugepages
  • Run smoketest on both architectures

Note: This PR depends on #2258 (feat/target-arch-path-resolution) and is part of the ARM64 support effort tracked in #1875.

🤖 Generated with Claude Code

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 29, 2026

PR Summary

Medium Risk
Touches VM machine configuration and CPU platform detection, which can affect Firecracker startup and snapshot compatibility on ARM64. Behavior changes are gated by runtime.GOARCH but could still surface as architecture-specific regressions in CI/production.

Overview
Improves ARM64 support by making runtime-dependent decisions instead of hardcoding amd64: the smoketest now builds envd for the host GOARCH, Firecracker machine config disables SMT on arm64, and CPU family/model detection falls back to non-empty defaults on arm64 to avoid errors. UFFD hugepage test utilities now skip (instead of fail) when hugepage mmap returns ENOMEM, reducing flakiness on hosts without preallocated hugepages.

Written by Cursor Bugbot for commit 2bb018d. This will update automatically on new commits. Configure here.

// userfaultfd syscall (nr 282), causing snapshot loading to fail with
// "Failed to UFFD object: System error".
var extraArgs string
if runtime.GOARCH == "arm64" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-seccomp disables Firecracker's syscall filter entirely on ARM64, meaningfully reducing sandbox isolation. Any syscall the guest can trigger becomes reachable. This should be tracked as a known security regression until upstream Firecracker ships an aarch64 seccomp filter that includes userfaultfd (nr 282), at which point this flag should be removed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known tradeoff — documented in the code comment. The upstream Firecracker aarch64 seccomp filter does not include the userfaultfd syscall (nr 282 on ARM64), causing snapshot restore to fail with 'Failed to UFFD object: System error'. There is no alternative until upstream adds uffd to the aarch64 filter. Tracked as a known limitation; a custom seccomp filter is a potential follow-up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — updated the comment. Verified against Firecracker v1.12 and v1.14: userfaultfd is absent from both x86_64 and aarch64 seccomp filters by design (the UFFD fd is created in persist.rs before seccomp is installed in builder.rs). The root cause is likely a missing ioctl or other syscall in the aarch64 filter. Added a TODO to investigate upstream.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update after live testing: Tested UFFD snapshot restore + resume on Firecracker v1.12 with kernel 6.17 on ARM64 (Lima VM on Apple Silicon). It works correctly WITH seccomp enabled. The UFFD fd is created via /dev/userfaultfd (kernel 6.1+) before seccomp is installed — no userfaultfd syscall needed in the filter.

The original failure was likely caused by host config (missing /dev/userfaultfd, permissions, or vm.unprivileged_userfaultfd=0), not seccomp. Keeping --no-seccomp as a precaution until validated on production ARM64 hardware. Added a TODO to remove it once confirmed.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ad797fe94

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +30 to +34
if runtime.GOARCH == "arm64" {
if family == "" {
family = "arm64"
}
if model == "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The ARM64 CPU family fallback at line 32 sets family = "arm64" (an architecture name), but the PR description explicitly states the intended default is "8" (the numeric ARMv8 family identifier). This is semantically wrong and inconsistent with all x86 nodes, which report numeric families like "6"; fix by changing the fallback to family = "8".

Extended reasoning...

What the bug is

In packages/orchestrator/pkg/service/machineinfo/main.go (lines 30-34), the ARM64 fallback for CPU family is set to the string "arm64" rather than the numeric string "8". The PR description explicitly states: "we provide sensible defaults (family "8", model "0")". The code correctly sets model = "0" but incorrectly uses family = "arm64" instead of family = "8".

How it manifests

On any ARM64 node where gopsutil cannot populate Family from /proc/cpuinfo (the stated motivation for this fallback), the MachineInfo.Family field will be set to the architecture name "arm64" rather than the ARMv8 numeric family identifier "8". This value propagates to the cpu_family database column and gRPC messages sent to the orchestrator.

Why existing code does not prevent it

There is no validation that Family must be numeric. The guard if family == "" || model == "" only checks for empty strings — it passes through any non-empty value. So "arm64" is stored without error.

Impact

CPU family from /proc/cpuinfo is conventionally a numeric string (e.g., "6" for x86 Haswell, "8" for ARMv8). The CpuFamily field is used in template build records, snapshot creation, and orchestrator pool/node matching. Any downstream code that parses CpuFamily as a number (for Firecracker CPU template selection or similar) will fail on ARM64 nodes reporting "arm64". It also breaks semantic consistency: all x86 nodes report numeric families, while ARM64 nodes would report an architecture name.

How to fix

Change line 32 from family = "arm64" to family = "8". This matches the PR description and the conventional ARMv8 numeric family identifier.

Step-by-step proof

  1. ARM64 orchestrator starts; gopsutil reads /proc/cpuinfo but finds no CPU family field (ARM64 format does not include it).
  2. info[0].Family is "".
  3. runtime.GOARCH == "arm64" is true, so the fallback block is entered.
  4. family == "" is true, so family = "arm64" is assigned (bug — should be "8").
  5. MachineInfo{Family: "arm64", ...} is returned to callers.
  6. The value "arm64" is stored in cpu_family in the database and sent via gRPC.
  7. Any code attempting to parse cpu_family as an integer receives an error; any comparison with numeric families from x86 nodes is semantically inconsistent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses family = "arm64" which is correct — it's a label for CPU family grouping consistent with runtime.GOARCH semantics, not the ARMv8 numeric family identifier. The PR description mentioning "8" was a mistake in the bot's interpretation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we make sure that 2 different ARM CPUs are compatible?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment documenting this limitation. Using generic "arm64" family works for same-host snapshots. Cross-host restore between different ARM CPUs (e.g. Graviton2 vs Graviton3) may need MIDR_EL1 register values — noted as a follow-up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in the code documenting this limitation — generic arm64 family works for same-host, cross-host may need MIDR_EL1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw what's in ModelName? Could using that for Model work for now as a workaround?

@tomassrnka tomassrnka force-pushed the feat/arm64-runtime-guards branch from 728e162 to 16e45c9 Compare March 30, 2026 07:29
@jakubno jakubno assigned jakubno and unassigned dobrac Mar 30, 2026
Comment on lines +30 to +34
if runtime.GOARCH == "arm64" {
if family == "" {
family = "arm64"
}
if model == "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we make sure that 2 different ARM CPUs are compatible?

e2b and others added 3 commits March 30, 2026 19:19
- Disable SMT on ARM64 (Firecracker rejects SMT=true on ARM processors)
- Add --no-seccomp flag on ARM64 (upstream seccomp filter lacks uffd syscall)
- Provide fallback CPU Family/Model on ARM64 (gopsutil doesn't populate these)
- Gracefully skip hugepage tests when ENOMEM (insufficient hugepages on CI)
- Use runtime.GOARCH instead of hardcoded amd64 in smoketest envd build

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move const archARM64 out of setMachineConfig per review feedback
- Add comment about ARM64 CPU compatibility limitations for
  cross-host snapshot restore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tested full sandbox lifecycle (UFFD snapshot restore + VM resume) on
ARM64 with seccomp ENABLED using Firecracker v1.12 on kernel 6.17:

- Lima VM (Apple Silicon), full E2B local-infra stack
- Sandbox created, Firecracker launched without --no-seccomp
- UFFD page fault handling worked correctly
- VM resumed and envd initialized successfully

The userfaultfd fd is created via /dev/userfaultfd (kernel 6.1+) before
seccomp is installed, so the userfaultfd syscall is not needed in the
seccomp filter. The original "Failed to UFFD object" error was likely
caused by host configuration (missing /dev/userfaultfd device,
permissions, or vm.unprivileged_userfaultfd=0).

Reverts script_builder.go to match main — no ARM64-specific Firecracker
args needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tomassrnka tomassrnka force-pushed the feat/arm64-runtime-guards branch from 7141334 to 32ed031 Compare March 31, 2026 07:41
@tomassrnka
Copy link
Copy Markdown
Member Author

Addressing outstanding review comments:

@jakubno re: CPU compatibility — Added a comment in machineinfo/main.go documenting the limitation. Generic arm64 family works for same-host snapshot restore; cross-host restore between different ARM CPUs (Graviton2 vs Graviton3) may need MIDR_EL1 register values. Noted as follow-up.

@jakubno re: move const — Done, archARM64 moved to package level in client.go.

@jakubno re: config.go fallback + v1.10config.go is no longer in this PR. Those changes were in #2258 (now merged). v1.10 amd64 population will be addressed separately.

@tomassrnka
Copy link
Copy Markdown
Member Author

tomassrnka commented Apr 1, 2026

@jakubno re the architecture compatibility question:

On ARM64, /proc/cpuinfo exposes CPU implementer (vendor), CPU part (model), and CPU architecture — gopsutil maps CPU part → Model, but Family is always empty on ARM (no cpu family field exists, unlike x86).

On real hardware (e.g. Graviton2 CPU part = 0xd0c, Graviton3 = 0xd40), the Model field is populated and the existing IsCompatibleWith check (arch + family + model) works correctly to reject cross-generation snapshot restore. The check doesn't currently include ,odelName or CPU flags — we could tighten it later by also comparing flags, or loosen it by checking only architecture + flags if we want cross-model compatibility.

When CPU part is 0x000 (VMs where KVM doesn't expose it), Model is effectively meaningless — but that's fine because all instances on the same VM host share the same physical CPU. The fallback to family="arm64" ensures the compatibility check still has something to compare.

We'll need to validate this on real ARM server hardware — but availability of different ARM CPU families/models to test cross-generation incompatibility is limited given sparse usage atm. The current approach is safe for same-host deployments.

@tomassrnka
Copy link
Copy Markdown
Member Author

tomassrnka commented Apr 1, 2026

1.10 has amd64 folder, thanks @jakubno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants