Skip to content

Conversation

@pyranota
Copy link
Contributor

@pyranota pyranota commented Jan 6, 2026

Screenshot 2026-01-06 at 17-09-41 Edit Script u_admin_sumptuous_script Windmill

@claude
Copy link
Contributor

claude bot commented Jan 6, 2026

Claude finished @pyranota's task in 2m 6s —— View job


Review: OOM Restart Feature for Python Installation

  • Review PR changes for OOM restart feature
  • Check code quality and best practices
  • Identify potential bugs or issues
  • Evaluate performance and security implications
  • Provide detailed feedback with inline comments

Summary

This PR adds the ability to detect Out-of-Memory (OOM) kills during Python package installation (uv pip install) and automatically retry with reduced concurrency. When installations fail due to OOM, the worker halves the parallel download limit and recursively retries.

Overall Assessment

The feature is a good resilience improvement. The implementation is mostly correct but has one critical issue (cross-platform compatibility) and one minor suggestion that should be addressed before merging.


Issues Found

Severity Issue Location
🔴 Critical Unix-only import will break Windows builds python_executor.rs:10
🔴 Critical status.signal() is Unix-only python_executor.rs:1981
🟡 Minor Missing exit code 137 check for containerized OOM python_executor.rs:2085

See inline comments for detailed explanations and suggested fixes.


Code Quality Observations

Positive aspects:

  • ✅ Good use of recursive retry with Box::pin() for the async function
  • ✅ Clear user-facing log message explaining the OOM situation
  • ✅ Proper cleanup of the monitor task before retry (drop(done_tx))
  • ✅ Halving the concurrency is a reasonable backoff strategy
  • ✅ Consistent error type conversion from anyhow to Error::ExitStatus

Other changes:

  • The ruby_executor.rs change is just an import formatting cleanup ✅
  • The ansible_executor.rs and worker_lockfiles.rs changes correctly pass None for the new optional parameter ✅

Test Instructions for QA

To test this feature, navigate to the Windmill app and create a new Python script with memory-intensive dependencies. Configure a worker with limited memory (e.g., 512MB) and trigger the script execution. The worker will attempt parallel package installations via uv pip install. If memory constraints cause OOM kills (exit signal 9), the installation should automatically restart with halved concurrency, displaying a prominent "IMPORTANT!" warning in the job logs. You can simulate this by setting a very high PY_CONCURRENT_DOWNLOADS value on a constrained worker, or by using packages with heavy compilation requirements. The feature is transparent to users and only activates when OOM is detected during the package installation phase.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 6, 2026

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 006334c
Status: ✅  Deploy successful!
Preview URL: https://0035e5ae.windmill.pages.dev
Branch Preview URL: https://pyra-win-1604-it-should-be-p.windmill.pages.dev

View logs

Signed-off-by: pyranota <[email protected]>
…ble-to-catch-oom-for-uv-install-and-retry

Signed-off-by: pyranota <[email protected]>
@pyranota pyranota marked this pull request as draft January 6, 2026 17:02
Signed-off-by: pyranota <[email protected]>
@pyranota pyranota marked this pull request as ready for review January 7, 2026 12:42
@claude
Copy link
Contributor

claude bot commented Jan 7, 2026

Claude finished @pyranota's task in 1m 40s —— View job


Review: OOM Restart Feature for Python Installation

  • Read changed files and understand the implementation
  • Review code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance and security implications
  • Verify previous review comments have been addressed
  • Provide comprehensive feedback

Summary

This PR adds the ability to detect Out-of-Memory (OOM) kills during Python package installation (uv pip install) and automatically retry with reduced concurrency. When installations fail due to OOM (exit code 9 or 137), the worker halves the parallel download limit and recursively retries.

Previous Review Comments - Status

Issue Status Notes
🔴 Unix-only import Fixed Now uses #[cfg(unix)] conditional compilation on line 10-11
🔴 Unix-only status.signal() Fixed Now uses platform-specific code with #[cfg(unix)] and #[cfg(not(unix))] on lines 1997-2000
🟡 Exit code 137 check Fixed Line 2105 now checks for both codes: `Error::ExitStatus(_, 9

All previous review issues have been addressed correctly!


Code Quality Analysis

Positive aspects:

  • Proper cross-platform handling: The Unix-only imports and methods are now correctly guarded with #[cfg(unix)] and #[cfg(not(unix))]
  • Good use of recursive retry: The Box::pin() usage for the async function recursion is correct
  • Clear user-facing log message: The "IMPORTANT!" warning (lines 2155-2164) clearly explains the OOM situation to users
  • Proper cleanup: Monitor task is stopped before retry via drop(done_tx) on line 2147
  • Reasonable backoff strategy: Halving the concurrency (parallel_limit / 2) is a sensible approach
  • Consistent error handling: Both signal 9 (SIGKILL) and exit code 137 (128 + SIGKILL) are handled, covering both native and containerized OOM scenarios
  • OOM score adjustment: Setting oom_score_adj to 1000 (lines 2963-2976) makes the installation processes more likely to be killed by OOM killer rather than other system processes, which is a good practice

Other changes verified:

  • ruby_executor.rs (lines 1-2): Import formatting cleanup only
  • ansible_executor.rs (line 413): Correctly passes None for the new reduced_concurrent_downloads parameter
  • worker_lockfiles.rs (line 2450): Correctly passes None for the new parameter
  • wmill_integration_test_utils.py (line 59): Unrelated timeout increase (10s → 60s) for Go/Rust compilation

Minor Observations

  1. OOM retry is Unix-only (line 2144): The cfg!(unix) check means Windows won't get automatic OOM recovery. This is reasonable since Windows handles OOM differently, but worth documenting.

  2. Minimum concurrency of 1: The retry logic only triggers when parallel_limit > 1 (line 2144), which correctly prevents infinite loops when concurrency can't be reduced further.

  3. Log message formatting: The warning message at lines 2155-2164 uses a raw string with embedded newlines and box characters, which will render nicely in the job logs.


Overall Assessment

✅ Approved for merge. All previous review issues have been properly addressed. The implementation is solid, handles cross-platform concerns correctly, and provides a useful resilience feature for memory-constrained workers.


Test Instructions for QA

To test this feature, navigate to the Windmill app and create a new Python script with memory-intensive dependencies (e.g., packages with heavy compilation like numpy, pandas, or scipy). Configure a worker with limited memory (e.g., 512MB) or set a high PY_CONCURRENT_DOWNLOADS environment variable (e.g., 20-30) to increase parallel installation load. Trigger the script execution and observe the job logs. If memory constraints cause OOM kills during package installation, you should see a prominent "IMPORTANT!" warning indicating that installation was killed by OOM and is restarting with reduced concurrency (the parallel limit should be halved). The installation should then succeed with the lower concurrency setting. This can also be tested in containerized environments (Docker/Kubernetes) where OOM kills result in exit code 137.

@rubenfiszel rubenfiszel merged commit d30ef89 into main Jan 8, 2026
10 checks passed
@rubenfiszel rubenfiszel deleted the pyra/win-1604-it-should-be-possible-to-catch-oom-for-uv-install-and-retry branch January 8, 2026 06:30
@github-actions github-actions bot locked and limited conversation to collaborators Jan 8, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants