Skip to content

fix: prevent unbounded memory leak during concurrent validation under…#239

Open
KyleOps wants to merge 1 commit intohapifhir:masterfrom
KyleOps:fix/validation-service-memory-leak
Open

fix: prevent unbounded memory leak during concurrent validation under…#239
KyleOps wants to merge 1 commit intohapifhir:masterfrom
KyleOps:fix/validation-service-memory-leak

Conversation

@KyleOps
Copy link

@KyleOps KyleOps commented Feb 26, 2026

This PR fixes a memory and thread exhaustion vulnerability in ValidationServiceFactoryImpl.kt.

Under high concurrency, if the JVM's free memory dips below the engineReloadThreshold, the getValidationService() method re-initializes ValidationService. Because this check was not synchronized, a burst of incoming validation requests during a low-memory state would all evaluate the condition as true simultaneously.

This resulted in an unbounded loop where the application spawned dozens of massive ValidationEngine instances and hundreds of detached background downloader threads to reload presets. The JVM quickly spiraled into maxed CPU usage, gigabytes of leaked memory, and eventual crashes.

Impact & Context

We discovered this in while utilizing the validator API in conjunction with au-fhir-inferno. Under heavy testing load, the validator-wrapper would hit the memory threshold and trigger this un-synchronized cache reset. The true impact was three-fold:

  1. Severe API Latency: The heavy I/O of re-building the cache blocks threads for several seconds. When 5 requests initialize 5 engines simultaneously, the CPU maxes out and all incoming requests hang, causing massive latency spikes (5-20s+) and upstream application timeouts.
  2. Caching "Thrashing": Under load, if the threshold is reached frequently, the server spins up a death spiral of actively deleting and downloading large FHIR packages, completely negating the benefit of an in-memory cache.
  3. Thread, File & Network Exhaustion: Even if the environment's memory can survive the spike, the background threads consume vast network bandwidth pulling .tgz files from packages.fhir.org and leak file descriptors on the host container.

Changes

  • Applied @Volatile to validationService.
  • Added a ReentrantLock (reloadLock) to ensure only one thread can lock and initialize the cache.
  • Added a 5-minute cooldown (RELOAD_COOLDOWN_MS) so the engine isn't perpetually thrashed under sustained heavy load.
  • Concurrent threads that hit the low memory state while the lock is acquired will now safely log and skip initialization, returning the existing engine and preventing the death spiral.

… load

Introduced a ReentrantLock and rate-limiting to `getValidationService()` in ValidationServiceFactoryImpl.kt. When the JVM hits the low memory threshold during high concurrency, the application previously allowed all concurrent requests to spin up parallel initialization threads, spiraling memory usage and crashing the application.

This fix ensures only a single thread can trigger the engine reload within a 5-minute cooldown period, allowing concurrent requests to queue or utilize the existing engine rather than flooding the JVM with unbounded heavy cache initialization threads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant