-
Notifications
You must be signed in to change notification settings - Fork 0
feat(metrics): add per-VM resource utilization metrics #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add real-time VM resource utilization metrics using /proc/<pid>/stat and
/proc/<pid>/statm for accurate per-process measurements (instead of cgroups
which aggregate at the session level).
New metrics exported via OpenTelemetry:
- hypeman_vm_cpu_seconds_total: CPU time consumed by VM hypervisor
- hypeman_vm_allocated_vcpus: Number of vCPUs allocated
- hypeman_vm_memory_rss_bytes: Resident Set Size (actual physical memory)
- hypeman_vm_memory_vms_bytes: Virtual Memory Size
- hypeman_vm_allocated_memory_bytes: Total allocated memory
- hypeman_vm_network_rx_bytes_total: Network bytes received (from TAP)
- hypeman_vm_network_tx_bytes_total: Network bytes transmitted (from TAP)
- hypeman_vm_memory_utilization_ratio: RSS / allocated memory
Also adds REST endpoint GET /instances/{id}/stats for per-instance stats.
- Add InstanceStats schema and /instances/{id}/stats endpoint to openapi.yaml
- Regenerate oapi code with make oapi-generate
- Move stats implementation to instances.go following existing patterns
- Remove custom stats.go and route from main.go
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs.
|
⚠️ hypeman-go studio · code · diff
There was a regression in your SDK.
generate ⚠️(prev:generate ✅) →lint ✅→test ✅go get github.com/stainless-sdks/hypeman-go@3c59ef636f7ab71682c6362247501185e880ea5dNew diagnostics (1 warning)
⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.
❗ hypeman-cli studio
Unknown conclusion: fatal
New diagnostics (1 warning)
⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-01-23 18:05:47 UTC
Replace hardcoded 4096 page size with os.Getpagesize() to support ARM systems (AWS Graviton, Apple Silicon) which may use 16KB or 64KB pages. Without this fix, memory metrics would be underreported by 4x-16x on non-x86 systems.
cmd/api/api/instances.go
Outdated
| } | ||
|
|
||
| // generateTAPName generates TAP device name from instance ID | ||
| func generateTAPName(instanceID string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should reuse existing code that determine tap name from instance name
lib/instances/manager.go
Outdated
|
|
||
| // generateTAPName generates TAP device name from instance ID. | ||
| // This matches the logic in network/allocate.go. | ||
| func generateTAPName(instanceID string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
lib/resources/utilization.go
Outdated
|
|
||
| // VMUtilization holds actual resource utilization metrics for a VM. | ||
| // These are real-time values read from /proc/<pid>/stat, /proc/<pid>/statm, and TAP interfaces. | ||
| type VMUtilization struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be nice to move the new code for this feature in lib/vm_metrics or lib/utilization or whatever name is best, just because it's a separate feature from the current lib/resources feature which is about the host's resources. I think since this is all net-new that would allow for almost all this change to live in new feature directory and isolated from the rest mostly.
cmd/api/api/instances.go
Outdated
| if err != nil { | ||
| log.DebugContext(ctx, "failed to read proc stat", "pid", pid, "error", err) | ||
| } else { | ||
| stats.CpuSeconds = float64(cpuUsec) / 1_000_000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like too much logic happening in the API handler, api handler ought to just translate from domain types (e.g. lib/utilization/types.go) into API types and other handler-level concerns like error mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it would be nice if moved to new lib/ directory to get a README explaining the feature, similar to other features in the repo
This addresses PR review feedback: - Create new lib/vm_metrics package for better separation of concerns - Export GenerateTAPName from lib/network (fixes TAP name mismatch bug) - Simplify API handler to use vm_metrics.Manager - Add comprehensive README.md for the feature - Remove utilization code from lib/resources package The TAP name bug was causing network stats to always be zero because the API handler used different truncation logic (10 chars, no lowercase) than the canonical implementation (8 chars, lowercase).
sjmiller609
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you
TAP interface statistics are from the host's perspective: - TAP rx_bytes = bytes host receives = bytes VM transmits - TAP tx_bytes = bytes host transmits = bytes VM receives The API documents these as "bytes received/transmitted by the VM", so we need to swap them to match the VM's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
The PR comment raised a valid concern about hardcoded ticksPerSecond=100. However, /proc always reports CPU times in USER_HZ (not kernel CONFIG_HZ): - CONFIG_HZ: kernel's internal tick rate (100, 250, 300, or 1000) - USER_HZ: userspace ABI constant, always 100 on Linux since 2.4 The kernel converts from CONFIG_HZ to USER_HZ when writing to /proc to maintain a stable userspace ABI. Added documentation explaining this. See: https://man7.org/linux/man-pages/man5/proc.5.html
Summary
/proc/<pid>/statand/proc/<pid>/statmfor accurate per-process measurements/procinstead of cgroups to avoid session-level aggregation issuesGET /instances/{id}/statsfor per-instance utilization dataNew OTel Metrics
hypeman_vm_cpu_seconds_totalhypeman_vm_allocated_vcpushypeman_vm_memory_rss_byteshypeman_vm_memory_vms_byteshypeman_vm_allocated_memory_byteshypeman_vm_network_rx_bytes_totalhypeman_vm_network_tx_bytes_totalhypeman_vm_memory_utilization_ratioStats Endpoint
curl -H "Authorization: Bearer <token>" http://localhost:8083/instances/{id}/statsResponse:
{ "instance_id": "qilviffnqzck2jrim1x6s2b1", "instance_name": "test-vm", "cpu_seconds": 29.94, "memory_rss_bytes": 443338752, "memory_vms_bytes": 4330745856, "network_rx_bytes": 0, "network_tx_bytes": 0, "allocated_vcpus": 2, "allocated_memory_bytes": 4294967296, "memory_utilization_ratio": 0.103 }Prometheus Queries
Test plan
go test ./lib/resources/...)/instances/{id}/statsreturns correct data for running VMNote
Adds end-to-end VM utilization telemetry from host perspective and exposes it via API and OTel.
lib/vm_metrics/reads/proc/<pid>/stat|statmand TAP stats; registers OTel instruments (hypeman_vm_*) and providesManagerwith adaptersGET /instances/{id}/statsroute (OpenAPI, client, server) returningInstanceStats;ApiServicewired withVMMetricsManagerListRunningInstancesInfoadded toinstances.Managerto supply PID/TAP and allocations for metricsnetwork.GenerateTAPName(renamed fromgenerateTAPName) for metrics derivationProvideVMMetricsManager; DI and wire gen updatedMinor: tests for metrics collectors/managers; OpenAPI schema and swagger regenerated.
Written by Cursor Bugbot for commit 01f8240. This will update automatically on new commits. Configure here.