Skip to content

feat: cache syscall results in health check endpoints#135

Merged
nao1215 merged 6 commits intomainfrom
feat/health-cache-syscalls
Mar 13, 2026
Merged

feat: cache syscall results in health check endpoints#135
nao1215 merged 6 commits intomainfrom
feat/health-cache-syscalls

Conversation

@nao1215
Copy link
Owner

@nao1215 nao1215 commented Mar 13, 2026

Summary

  • Add lock-free HealthCache that caches disk_free_bytes() and process_rss_bytes() results with a configurable TTL (default 5s), eliminating redundant kernel context switches under high-frequency health polling
  • Add TRUSS_HEALTH_CACHE_TTL_SECS environment variable (0–300, default 5) to control cache duration; set to 0 to disable caching
  • Extract collect_resource_checks() helper to deduplicate ~70 lines of identical check logic between handle_health() and handle_health_ready()
  • Remove unused TransformImageRequestPayload import

Closes #74

Test plan

  • Unit tests for HealthCache: cached values within TTL, refresh after TTL, TTL=0 bypass, default TTL in ServerConfig
  • Unit tests for TRUSS_HEALTH_CACHE_TTL_SECS env var parsing (valid value, zero)
  • All 913 existing tests pass
  • cargo clippy --tests reports zero warnings

Summary by CodeRabbit

  • New Features

    • Performance-optimized health check caching with configurable TTL (default 5s) via TRUSS_HEALTH_CACHE_TTL_SECS; setting TTL to 0 disables caching
    • Health endpoints now return unified, richer payloads including service, version, uptime, max input pixels, and per-check details
    • Health cache available as a public component for integrations
  • Tests

    • Added tests covering TTL parsing, cache behavior, and default TTL handling

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Warning

Rate limit exceeded

@nao1215 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a0e767b-f5b6-4b99-a5d4-53273c93b9a9

📥 Commits

Reviewing files that changed from the base of the PR and between 6388746 and 0614fa2.

📒 Files selected for processing (3)
  • src/adapters/server/config.rs
  • src/adapters/server/handler.rs
  • src/adapters/server/mod.rs
📝 Walkthrough

Walkthrough

Adds a lock-free HealthCache with configurable TTL, reworks /health and /health/ready to use the cache via a collect_resource_checks helper, and wires HealthCache into ServerConfig and public exports; includes TTL parsing and cache-behavior tests.

Changes

Cohort / File(s) Summary
ServerConfig & env wiring
src/adapters/server/config.rs
Added pub health_cache: Arc<HealthCache> to ServerConfig; initialize in new() and from_env() using TRUSS_HEALTH_CACHE_TTL_SECS; updated Clone, Debug, PartialEq; added env-parse and TTL tests.
HealthCache & handler refactor
src/adapters/server/handler.rs
Introduced HealthCache (Atomic/lock-free TTL cache for disk_free and RSS), DEFAULT_HEALTH_CACHE_TTL_SECS, disk_free and rss accessors, and collect_resource_checks(); refactored handle_health/handle_health_ready to use cache and emit unified JSON checks/metadata.
Public exports & tests
src/adapters/server/mod.rs
Exported HealthCache via pub use handler::{HealthCache, ...}; added tests validating cached RSS/disk_free behavior, TTL refresh semantics, and ServerConfig default TTL wiring.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Server as Server (actix)
    participant HealthCache as HealthCache
    participant Storage as Filesystem
    participant Proc as /proc

    Client->>Server: GET /health or /health/ready
    Server->>HealthCache: disk_free(path)?
    alt cached and fresh
        HealthCache-->>Server: cached disk_free
    else expired or missing
        HealthCache->>Storage: statvfs()/disk_free_bytes()
        Storage-->>HealthCache: disk_free_bytes
        HealthCache-->>Server: disk_free (cached)
    end
    Server->>HealthCache: rss()?
    alt cached and fresh
        HealthCache-->>Server: cached rss
    else expired or missing
        HealthCache->>Proc: read /proc/self/status (rss)
        Proc-->>HealthCache: process_rss_bytes
        HealthCache-->>Server: rss (cached)
    end
    Server->>Server: collect_resource_checks(config)
    Server-->>Client: JSON health payload (status, checks, service, version, uptime, ...)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰
I cached the hops of disk and RAM,
Five seconds calm — no syscall jam.
Checks line up, then quietly rest,
A rabbit's cache keeps probes impressed. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding caching to syscall results in health check endpoints, which directly aligns with the PR's primary objective.
Linked Issues check ✅ Passed The implementation meets all acceptance criteria from #74: disk_free_bytes() and process_rss_bytes() are cached with TTL, cache TTL is configurable via environment variable (0–300s, default 5s), and tests verify cache refresh behavior.
Out of Scope Changes check ✅ Passed All changes are within scope: HealthCache implementation, environment variable parsing, health endpoint refactoring with collect_resource_checks(), and removal of unused import align with issue #74 objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/health-cache-syscalls
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/adapters/server/handler.rs`:
- Around line 557-565: The cached timestamp/value pair must use acquire/release
ordering: in the reader path replace the Relaxed load of disk_free_at with an
Acquire load and keep the disk_free load Relaxed; in the writer keep storing the
disk_free value before publishing the timestamp but change the timestamp store
to Release (use disk_free.store(..., Relaxed) then disk_free_at.store(now,
Release)). Apply the same pattern to the rss/rss_at pair so readers load rss_at
with Acquire and writers publish rss value then rss_at with Release.

In `@src/adapters/server/mod.rs`:
- Around line 123-124: The import list for the server module is missing
TransformImageRequestPayload which causes unresolved type errors when the s3
feature is enabled; restore TransformImageRequestPayload in the same use block
that currently imports HealthCache, TransformSlot, TransformSourcePayload,
WatermarkSource, etc., so the s3-only tests referencing
TransformImageRequestPayload (used in the test functions guarded by
#[cfg(feature = "s3")]) can compile successfully.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 420cbf7a-9845-44ad-a021-557f94187337

📥 Commits

Reviewing files that changed from the base of the PR and between 2d86a9f and 7a0e06b.

📒 Files selected for processing (3)
  • src/adapters/server/config.rs
  • src/adapters/server/handler.rs
  • src/adapters/server/mod.rs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/adapters/server/handler.rs (1)

663-678: Inconsistent response format for 503 status codes.

The draining case (line 660) uses HttpResponse::problem("503 Service Unavailable", body) while resource check failures (line 676) use HttpResponse::json("503 Service Unavailable", body). Both return the same HTTP status but with potentially different Content-Type headers (application/problem+json vs application/json).

If this is intentional (draining is an RFC 7807 "problem" while resource failures are health-check format), consider adding a brief comment. Otherwise, consider using consistent response types for 503 responses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/adapters/server/handler.rs` around lines 663 - 678, The 503 responses use
inconsistent content types: the draining branch calls HttpResponse::problem("503
Service Unavailable", body) while the resource-check failure branch calls
HttpResponse::json("503 Service Unavailable", body); decide on a consistent
response shape and update the resource-check branch (the code using
collect_resource_checks and the all_ok variable) to match the draining branch or
vice versa, e.g., change the resource failure return to
HttpResponse::problem(...) or change the draining return to
HttpResponse::json(...), and if the difference is intentional, add a brief
comment next to the HttpResponse::problem and HttpResponse::json calls
explaining why RFC7807 problem+json is used for draining but plain
application/json is used for resource failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/adapters/server/config.rs`:
- Around line 1322-1324: Update the from_env rustdoc to document the new
TRUSS_HEALTH_CACHE_TTL_SECS environment variable: add an entry describing
TRUSS_HEALTH_CACHE_TTL_SECS (accepted range 0–300, default value taken from
super::handler::DEFAULT_HEALTH_CACHE_TTL_SECS) so operators know it controls the
health_cache_ttl_secs used when constructing the HealthCache via
super::handler::HealthCache::new; ensure the doc entry matches the
parse_env_u64_ranged call and notes units (seconds) and default behavior when
unset.

---

Nitpick comments:
In `@src/adapters/server/handler.rs`:
- Around line 663-678: The 503 responses use inconsistent content types: the
draining branch calls HttpResponse::problem("503 Service Unavailable", body)
while the resource-check failure branch calls HttpResponse::json("503 Service
Unavailable", body); decide on a consistent response shape and update the
resource-check branch (the code using collect_resource_checks and the all_ok
variable) to match the draining branch or vice versa, e.g., change the resource
failure return to HttpResponse::problem(...) or change the draining return to
HttpResponse::json(...), and if the difference is intentional, add a brief
comment next to the HttpResponse::problem and HttpResponse::json calls
explaining why RFC7807 problem+json is used for draining but plain
application/json is used for resource failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ebcff73a-01af-4a69-8541-e12dda39aaeb

📥 Commits

Reviewing files that changed from the base of the PR and between 7a0e06b and c1abe1e.

📒 Files selected for processing (3)
  • src/adapters/server/config.rs
  • src/adapters/server/handler.rs
  • src/adapters/server/mod.rs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/adapters/server/config.rs (2)

1329-1331: Add one integration test for the actual wiring.

These tests stop at parse_env_u64_ranged(). A regression in ServerConfig::from_env() or ServerConfig::new() would still pass, so please assert health_cache.ttl_nanos on the constructed config as well.

Example test
#[test]
#[serial]
fn from_env_wires_health_cache_ttl_secs() {
    let _env = ScopedEnv::set("TRUSS_HEALTH_CACHE_TTL_SECS", "10");
    let config = ServerConfig::from_env().unwrap();
    assert_eq!(config.health_cache.ttl_nanos, 10 * 1_000_000_000);
}

Also applies to: 2341-2355

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/adapters/server/config.rs` around lines 1329 - 1331, Add an integration
test that verifies the health cache TTL is actually wired into ServerConfig
instead of stopping at parse_env_u64_ranged; call ServerConfig::from_env() (or
ServerConfig::new() if used in that codepath) with a ScopedEnv setting like
"TRUSS_HEALTH_CACHE_TTL_SECS" and assert the resulting
config.health_cache.ttl_nanos equals the expected seconds * 1_000_000_000,
referencing the HealthCache constructed via super::handler::HealthCache to
ensure the field is propagated.

429-433: Expose TTL, not the cache instance.

ServerConfig::new() hardcodes the default cache TTL, but HealthCache::new is pub(super), so embedders using the public ServerConfig API cannot override this programmatically. A with_health_cache_ttl_secs() builder would keep the knob available without leaking HealthCache as public runtime state.

Possible API shape
pub fn with_health_cache_ttl_secs(mut self, ttl_secs: u64) -> Self {
    self.health_cache = Arc::new(super::handler::HealthCache::new(ttl_secs));
    self
}

Also applies to: 807-809

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/adapters/server/config.rs` around lines 429 - 433, The ServerConfig
currently exposes the HealthCache instance publicly and hardcodes its TTL in
ServerConfig::new(), preventing callers from setting TTL; make the health_cache
field non-public (remove pub) and add a builder method pub fn
with_health_cache_ttl_secs(mut self, ttl_secs: u64) -> Self that replaces
self.health_cache = Arc::new(super::handler::HealthCache::new(ttl_secs)) so
callers can set the TTL without leaking HealthCache; also ensure
ServerConfig::new initializes health_cache with the default TTL and update other
sites that relied on the public field (e.g., the usages around the other
referenced lines) to use the new API or internal accessors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/adapters/server/handler.rs`:
- Around line 770-771: The docstring claims a "writability" probe but the
implementation only calls config.storage_root.is_dir(); either update the
comment to remove "writability" or implement an actual write probe: in the same
handler where the storage health check runs (the function currently using
config.storage_root.is_dir()), attempt a safe write-and-remove (e.g., create a
temporary file under config.storage_root with OpenOptions create+write, flush,
then delete) and return a failure if that write fails; ensure you still check
is_dir() first and handle permission errors cleanly.

---

Nitpick comments:
In `@src/adapters/server/config.rs`:
- Around line 1329-1331: Add an integration test that verifies the health cache
TTL is actually wired into ServerConfig instead of stopping at
parse_env_u64_ranged; call ServerConfig::from_env() (or ServerConfig::new() if
used in that codepath) with a ScopedEnv setting like
"TRUSS_HEALTH_CACHE_TTL_SECS" and assert the resulting
config.health_cache.ttl_nanos equals the expected seconds * 1_000_000_000,
referencing the HealthCache constructed via super::handler::HealthCache to
ensure the field is propagated.
- Around line 429-433: The ServerConfig currently exposes the HealthCache
instance publicly and hardcodes its TTL in ServerConfig::new(), preventing
callers from setting TTL; make the health_cache field non-public (remove pub)
and add a builder method pub fn with_health_cache_ttl_secs(mut self, ttl_secs:
u64) -> Self that replaces self.health_cache =
Arc::new(super::handler::HealthCache::new(ttl_secs)) so callers can set the TTL
without leaking HealthCache; also ensure ServerConfig::new initializes
health_cache with the default TTL and update other sites that relied on the
public field (e.g., the usages around the other referenced lines) to use the new
API or internal accessors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22527b32-6306-4ca5-9385-a09c59b38729

📥 Commits

Reviewing files that changed from the base of the PR and between c1abe1e and 6388746.

📒 Files selected for processing (2)
  • src/adapters/server/config.rs
  • src/adapters/server/handler.rs

@nao1215 nao1215 merged commit 127617d into main Mar 13, 2026
17 checks passed
@nao1215 nao1215 deleted the feat/health-cache-syscalls branch March 13, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache syscall results in health check endpoints

1 participant