feat: cache syscall results in health check endpoints#135
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds a lock-free HealthCache with configurable TTL, reworks /health and /health/ready to use the cache via a collect_resource_checks helper, and wires HealthCache into ServerConfig and public exports; includes TTL parsing and cache-behavior tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Server as Server (actix)
participant HealthCache as HealthCache
participant Storage as Filesystem
participant Proc as /proc
Client->>Server: GET /health or /health/ready
Server->>HealthCache: disk_free(path)?
alt cached and fresh
HealthCache-->>Server: cached disk_free
else expired or missing
HealthCache->>Storage: statvfs()/disk_free_bytes()
Storage-->>HealthCache: disk_free_bytes
HealthCache-->>Server: disk_free (cached)
end
Server->>HealthCache: rss()?
alt cached and fresh
HealthCache-->>Server: cached rss
else expired or missing
HealthCache->>Proc: read /proc/self/status (rss)
Proc-->>HealthCache: process_rss_bytes
HealthCache-->>Server: rss (cached)
end
Server->>Server: collect_resource_checks(config)
Server-->>Client: JSON health payload (status, checks, service, version, uptime, ...)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/adapters/server/handler.rs`:
- Around line 557-565: The cached timestamp/value pair must use acquire/release
ordering: in the reader path replace the Relaxed load of disk_free_at with an
Acquire load and keep the disk_free load Relaxed; in the writer keep storing the
disk_free value before publishing the timestamp but change the timestamp store
to Release (use disk_free.store(..., Relaxed) then disk_free_at.store(now,
Release)). Apply the same pattern to the rss/rss_at pair so readers load rss_at
with Acquire and writers publish rss value then rss_at with Release.
In `@src/adapters/server/mod.rs`:
- Around line 123-124: The import list for the server module is missing
TransformImageRequestPayload which causes unresolved type errors when the s3
feature is enabled; restore TransformImageRequestPayload in the same use block
that currently imports HealthCache, TransformSlot, TransformSourcePayload,
WatermarkSource, etc., so the s3-only tests referencing
TransformImageRequestPayload (used in the test functions guarded by
#[cfg(feature = "s3")]) can compile successfully.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 420cbf7a-9845-44ad-a021-557f94187337
📒 Files selected for processing (3)
src/adapters/server/config.rssrc/adapters/server/handler.rssrc/adapters/server/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/adapters/server/handler.rs (1)
663-678: Inconsistent response format for 503 status codes.The draining case (line 660) uses
HttpResponse::problem("503 Service Unavailable", body)while resource check failures (line 676) useHttpResponse::json("503 Service Unavailable", body). Both return the same HTTP status but with potentially differentContent-Typeheaders (application/problem+jsonvsapplication/json).If this is intentional (draining is an RFC 7807 "problem" while resource failures are health-check format), consider adding a brief comment. Otherwise, consider using consistent response types for 503 responses.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/adapters/server/handler.rs` around lines 663 - 678, The 503 responses use inconsistent content types: the draining branch calls HttpResponse::problem("503 Service Unavailable", body) while the resource-check failure branch calls HttpResponse::json("503 Service Unavailable", body); decide on a consistent response shape and update the resource-check branch (the code using collect_resource_checks and the all_ok variable) to match the draining branch or vice versa, e.g., change the resource failure return to HttpResponse::problem(...) or change the draining return to HttpResponse::json(...), and if the difference is intentional, add a brief comment next to the HttpResponse::problem and HttpResponse::json calls explaining why RFC7807 problem+json is used for draining but plain application/json is used for resource failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/adapters/server/config.rs`:
- Around line 1322-1324: Update the from_env rustdoc to document the new
TRUSS_HEALTH_CACHE_TTL_SECS environment variable: add an entry describing
TRUSS_HEALTH_CACHE_TTL_SECS (accepted range 0–300, default value taken from
super::handler::DEFAULT_HEALTH_CACHE_TTL_SECS) so operators know it controls the
health_cache_ttl_secs used when constructing the HealthCache via
super::handler::HealthCache::new; ensure the doc entry matches the
parse_env_u64_ranged call and notes units (seconds) and default behavior when
unset.
---
Nitpick comments:
In `@src/adapters/server/handler.rs`:
- Around line 663-678: The 503 responses use inconsistent content types: the
draining branch calls HttpResponse::problem("503 Service Unavailable", body)
while the resource-check failure branch calls HttpResponse::json("503 Service
Unavailable", body); decide on a consistent response shape and update the
resource-check branch (the code using collect_resource_checks and the all_ok
variable) to match the draining branch or vice versa, e.g., change the resource
failure return to HttpResponse::problem(...) or change the draining return to
HttpResponse::json(...), and if the difference is intentional, add a brief
comment next to the HttpResponse::problem and HttpResponse::json calls
explaining why RFC7807 problem+json is used for draining but plain
application/json is used for resource failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ebcff73a-01af-4a69-8541-e12dda39aaeb
📒 Files selected for processing (3)
src/adapters/server/config.rssrc/adapters/server/handler.rssrc/adapters/server/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/adapters/server/config.rs (2)
1329-1331: Add one integration test for the actual wiring.These tests stop at
parse_env_u64_ranged(). A regression inServerConfig::from_env()orServerConfig::new()would still pass, so please asserthealth_cache.ttl_nanoson the constructed config as well.Example test
#[test] #[serial] fn from_env_wires_health_cache_ttl_secs() { let _env = ScopedEnv::set("TRUSS_HEALTH_CACHE_TTL_SECS", "10"); let config = ServerConfig::from_env().unwrap(); assert_eq!(config.health_cache.ttl_nanos, 10 * 1_000_000_000); }Also applies to: 2341-2355
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/adapters/server/config.rs` around lines 1329 - 1331, Add an integration test that verifies the health cache TTL is actually wired into ServerConfig instead of stopping at parse_env_u64_ranged; call ServerConfig::from_env() (or ServerConfig::new() if used in that codepath) with a ScopedEnv setting like "TRUSS_HEALTH_CACHE_TTL_SECS" and assert the resulting config.health_cache.ttl_nanos equals the expected seconds * 1_000_000_000, referencing the HealthCache constructed via super::handler::HealthCache to ensure the field is propagated.
429-433: Expose TTL, not the cache instance.
ServerConfig::new()hardcodes the default cache TTL, butHealthCache::newispub(super), so embedders using the publicServerConfigAPI cannot override this programmatically. Awith_health_cache_ttl_secs()builder would keep the knob available without leakingHealthCacheas public runtime state.Possible API shape
pub fn with_health_cache_ttl_secs(mut self, ttl_secs: u64) -> Self { self.health_cache = Arc::new(super::handler::HealthCache::new(ttl_secs)); self }Also applies to: 807-809
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/adapters/server/config.rs` around lines 429 - 433, The ServerConfig currently exposes the HealthCache instance publicly and hardcodes its TTL in ServerConfig::new(), preventing callers from setting TTL; make the health_cache field non-public (remove pub) and add a builder method pub fn with_health_cache_ttl_secs(mut self, ttl_secs: u64) -> Self that replaces self.health_cache = Arc::new(super::handler::HealthCache::new(ttl_secs)) so callers can set the TTL without leaking HealthCache; also ensure ServerConfig::new initializes health_cache with the default TTL and update other sites that relied on the public field (e.g., the usages around the other referenced lines) to use the new API or internal accessors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/adapters/server/handler.rs`:
- Around line 770-771: The docstring claims a "writability" probe but the
implementation only calls config.storage_root.is_dir(); either update the
comment to remove "writability" or implement an actual write probe: in the same
handler where the storage health check runs (the function currently using
config.storage_root.is_dir()), attempt a safe write-and-remove (e.g., create a
temporary file under config.storage_root with OpenOptions create+write, flush,
then delete) and return a failure if that write fails; ensure you still check
is_dir() first and handle permission errors cleanly.
---
Nitpick comments:
In `@src/adapters/server/config.rs`:
- Around line 1329-1331: Add an integration test that verifies the health cache
TTL is actually wired into ServerConfig instead of stopping at
parse_env_u64_ranged; call ServerConfig::from_env() (or ServerConfig::new() if
used in that codepath) with a ScopedEnv setting like
"TRUSS_HEALTH_CACHE_TTL_SECS" and assert the resulting
config.health_cache.ttl_nanos equals the expected seconds * 1_000_000_000,
referencing the HealthCache constructed via super::handler::HealthCache to
ensure the field is propagated.
- Around line 429-433: The ServerConfig currently exposes the HealthCache
instance publicly and hardcodes its TTL in ServerConfig::new(), preventing
callers from setting TTL; make the health_cache field non-public (remove pub)
and add a builder method pub fn with_health_cache_ttl_secs(mut self, ttl_secs:
u64) -> Self that replaces self.health_cache =
Arc::new(super::handler::HealthCache::new(ttl_secs)) so callers can set the TTL
without leaking HealthCache; also ensure ServerConfig::new initializes
health_cache with the default TTL and update other sites that relied on the
public field (e.g., the usages around the other referenced lines) to use the new
API or internal accessors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 22527b32-6306-4ca5-9385-a09c59b38729
📒 Files selected for processing (2)
src/adapters/server/config.rssrc/adapters/server/handler.rs
Summary
HealthCachethat cachesdisk_free_bytes()andprocess_rss_bytes()results with a configurable TTL (default 5s), eliminating redundant kernel context switches under high-frequency health pollingTRUSS_HEALTH_CACHE_TTL_SECSenvironment variable (0–300, default 5) to control cache duration; set to 0 to disable cachingcollect_resource_checks()helper to deduplicate ~70 lines of identical check logic betweenhandle_health()andhandle_health_ready()TransformImageRequestPayloadimportCloses #74
Test plan
HealthCache: cached values within TTL, refresh after TTL, TTL=0 bypass, default TTL inServerConfigTRUSS_HEALTH_CACHE_TTL_SECSenv var parsing (valid value, zero)cargo clippy --testsreports zero warningsSummary by CodeRabbit
New Features
Tests