Implement flock-based bitmap GPU allocator for CI parallelization#370
Implement flock-based bitmap GPU allocator for CI parallelization#370
Conversation
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…idation Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…ic GPUs Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Introduces a lightweight, flock-based GPU allocation mechanism for CI so matrix jobs can dynamically acquire the number of GPUs they need, rather than relying on hardcoded GPU assignments in the workflow.
Changes:
- Added a shared-state GPU allocator script that blocks/retries until enough GPUs are available and exports
GPU_DEVICESforHIP_VISIBLE_DEVICES. - Updated the test runner script to automatically acquire/release GPUs when not explicitly provided.
- Simplified the CI workflow matrix by removing static
gpu_devicesentries and always letting the runner auto-allocate.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
.github/workflows/iris-tests.yml |
Removes hardcoded GPU device mappings from the test matrices and relies on dynamic allocation. |
.github/scripts/run_tests.sh |
Sources the allocator and acquires/releases GPUs automatically when devices aren’t provided. |
.github/scripts/gpu_allocator.sh |
Implements the flock-protected shared state allocator and cleanup logic. |
|
@copilot apply changes based on the comments in this thread |
… state validation - Remove top-level set -e to avoid side effects when sourced - Add state file validation in init_gpu_state (handles invalid/out-of-range values) - Fix race condition in acquire_gpus by capturing start index while holding lock - Change out-of-order release to preserve state instead of resetting to 0 (prevents GPU overlap) - Make cleanup trap optional and composable via enable_gpu_cleanup_trap() - Fix GPU_DEVICES env variable handling in run_tests.sh to honor environment - Add shellcheck disable for intentional trap expansion Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Applied all review feedback. Changes in commit eb6bae9:
All changes tested and verified. |
Changed error message from "Need GPUs 8-8 but only 0-7 available" to "Not enough GPUs: need 1, only 0 available (next free GPU: 8)" which is much clearer about how many GPUs are needed vs available. Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… references **Changes:** 1. Removed unused `enable_gpu_cleanup_trap()` function from gpu_allocator.sh (user request) 2. Fixed release_gpus.sh to check only variables actually set by acquire_gpus.sh: - Changed from checking ALLOCATED_GPU_START and ALLOCATED_GPU_COUNT - To checking ALLOCATED_GPU_BITMAP (the actual variable exported) - Removed logging of undefined variables These changes clean up unused code and fix bugs in release_gpus.sh where it was checking/logging variables that were never set by the allocator. Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
**Changes:** 1. Removed `--worker-mode` dummy argument from run_tests_distributed.py - The argument was not handled by pytest and could cause warnings - Detection via environment variables (RANK/WORLD_SIZE) is sufficient 2. Set timeout-minutes to 180 (3 hours) for all workflow jobs - iris-tests.yml: Added timeout to build-container-image, test-git, test-editable - iris-performance-regression-test.yml: Updated from 20/30 to 180 minutes - iris-external-validation-test.yml: Added timeout to build-container-image, external-validation-test - Aligns with GPU allocator's 3-hour wait time for consistency These changes ensure workflows have sufficient time to wait for GPUs and remove unnecessary dummy arguments from the test runner. Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Refactored run_tests_distributed.py to remove recursive launcher pattern: - Removed launcher logic that parsed arguments and spawned torchrun - Script now runs directly as a torchrun worker - Takes pytest arguments directly from command line (no environment variable) Updated run_tests.sh to invoke torchrun directly: - Changed from: python run_tests_distributed.py --num_ranks N <test> <args> - Changed to: torchrun --nproc_per_node=N --standalone run_tests_distributed.py <test> <args> Benefits: - Simpler, more direct execution path - No recursive script invocation - Easier to understand and debug - Eliminates dummy arguments and environment variable passing Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
| # torchrun sets these environment variables automatically | ||
| rank = int(os.environ.get("RANK", 0)) | ||
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | ||
| local_rank = int(os.environ.get("LOCAL_RANK", 0)) | ||
|
|
||
| # Set the correct GPU for this specific process | ||
| if torch.cuda.is_available(): | ||
| torch.cuda.set_device(local_rank) | ||
|
|
||
| # Initialize distributed - torchrun already set up the environment | ||
| dist.init_process_group( | ||
| backend="nccl", | ||
| rank=rank, | ||
| world_size=world_size, | ||
| device_id=torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else None, | ||
| ) |
There was a problem hiding this comment.
run_tests_distributed.py now assumes it is always launched via torchrun (env:// rendezvous). Running it directly (e.g., python tests/run_tests_distributed.py ...) will fail because MASTER_ADDR/MASTER_PORT are not set and there's no fallback. Consider adding a guard that either (a) errors with a clear message telling users to run via torchrun, or (b) retains the previous mp.spawn fallback when RANK/WORLD_SIZE/LOCAL_RANK are missing.
| # Initialize distributed - torchrun already set up the environment | ||
| dist.init_process_group( | ||
| backend="nccl", | ||
| rank=rank, | ||
| world_size=world_size, | ||
| device_id=torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else None, | ||
| ) |
There was a problem hiding this comment.
The process-group backend is hard-coded to nccl. If this script is ever executed in an environment without CUDA/HIP available, dist.init_process_group(backend="nccl") will fail. Consider selecting backend="nccl" if torch.cuda.is_available() else "gloo" (matching the pattern used elsewhere in the repo).
| # Run Iris tests in a container with automatic GPU allocation | ||
| # Usage: run_tests.sh <test_dir> <num_ranks> [gpu_devices] [install_method] | ||
| # test_dir: subdirectory under tests/ (e.g., examples, unittests, ccl) | ||
| # num_ranks: number of GPU ranks (1, 2, 4, or 8) | ||
| # gpu_devices: comma-separated GPU device IDs (optional) | ||
| # gpu_devices: comma-separated GPU device IDs (optional, if not provided will use allocator) | ||
| # install_method: pip install method - "git", "editable", or "install" (optional, default: "editable") |
There was a problem hiding this comment.
The header comment says this script provides "automatic GPU allocation" / "will use allocator" when gpu_devices isn't provided, but the implementation only warns when GPU_DEVICES is empty and never calls the allocator. Suggest updating the header/usage text to reflect the actual behavior (workflow-level acquire/release), or implement the documented fallback.
| @@ -40,10 +44,9 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | |||
| cd /iris_workspace | |||
| pip install -e . | |||
| python examples/${EXAMPLE_PATH}/benchmark.py \ | |||
| torchrun --nproc_per_node=8 examples/${EXAMPLE_PATH}/benchmark.py \ | |||
| --benchmark \ | |||
| --validate \ | |||
| -r 8 \ | |||
| ${BENCHMARK_ARGS} \ | |||
| --output_file perf_result.json | |||
There was a problem hiding this comment.
GPU_DEVICES can be overridden via the environment, but the benchmark always runs torchrun --nproc_per_node=8. If GPU_DEVICES is set to fewer than 8 devices, this will oversubscribe and fail. Consider deriving --nproc_per_node from the requested/available GPU count (e.g., count of comma-separated IDs), or enforce that GPU_DEVICES must contain 8 GPUs when using this script.
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | ||
| init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500") | ||
| _worker(rank, world_size, f"tcp://{init_url}", args) |
There was a problem hiding this comment.
In torchrun mode, this passes rank=int(os.environ["RANK"]) into _worker, but _worker uses that parameter as local_rank for rank= and device_id=torch.device(f"cuda:{local_rank}"). That’s only correct when global rank == local rank. Please use LOCAL_RANK for the first _worker argument / device selection, and keep RANK for the global rank.
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | |
| init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500") | |
| _worker(rank, world_size, f"tcp://{init_url}", args) | |
| local_rank = int(os.environ.get("LOCAL_RANK", 0)) | |
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | |
| init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500") | |
| _worker(local_rank, world_size, f"tcp://{init_url}", args) |
| # Ensure the bitmap is within valid range (0-255 for 8 GPUs) | ||
| elif [ "$current_state" -lt 0 ] || [ "$current_state" -gt 255 ]; then | ||
| echo "0" > "$GPU_STATE_FILE" | ||
| echo "[GPU-ALLOC] Detected out-of-range GPU bitmap ($current_state); reset to 0" >&2 |
There was a problem hiding this comment.
State validation hard-codes the max bitmap value to 255, but the allocator also supports overriding MAX_GPUS. If MAX_GPUS is changed, this range check becomes incorrect. Consider computing the max as (1 << MAX_GPUS) - 1 to keep validation consistent with configuration.
| - name: Acquire GPUs | ||
| run: | | ||
| bash .github/scripts/cleanup_ports.sh | ||
| bash .github/scripts/acquire_gpus.sh 2 | ||
|
|
There was a problem hiding this comment.
This job acquires GPUs (sets GPU_DEVICES), but the subsequent container invocation in this workflow does not pass --gpus "$GPU_DEVICES", so the container will still see all GPUs and can overlap with other jobs. Update the container step to consume the allocated GPU_DEVICES (same pattern as external-gluon-validation-test).
| rank = int(os.environ.get("RANK", 0)) | ||
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | ||
| init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500") | ||
| _worker(rank, world_size, f"tcp://{init_url}", args) |
There was a problem hiding this comment.
In torchrun mode, this uses RANK as the first argument to _worker, but _worker treats that parameter as the per-process GPU index (local_rank) and uses it for device_id. For multi-node torchrun (or any case where global rank != local rank), this will select the wrong device. Use LOCAL_RANK for device selection / _worker's first argument, and use RANK only for the global rank passed to the process group.
| rank = int(os.environ.get("RANK", 0)) | |
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | |
| init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500") | |
| _worker(rank, world_size, f"tcp://{init_url}", args) | |
| local_rank = int(os.environ.get("LOCAL_RANK", 0)) | |
| world_size = int(os.environ.get("WORLD_SIZE", 1)) | |
| init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500") | |
| _worker(local_rank, world_size, f"tcp://{init_url}", args) |
| if [ ! -f "$GPU_STATE_FILE" ]; then | ||
| # Initialize with all GPUs free (bitmap = 0) | ||
| echo "0" > "$GPU_STATE_FILE" | ||
| echo "[GPU-ALLOC] Initialized GPU bitmap: 0 (all GPUs free)" >&2 |
There was a problem hiding this comment.
The GPU state file is created in /tmp via echo "0" > "$GPU_STATE_FILE" without setting permissions. On multi-user runners this can allow unintended modification/corruption of allocator state. Consider creating it with restrictive permissions (e.g., umask 077 for the write, or chmod 600 "$GPU_STATE_FILE" after creation).
| echo "[RELEASE-GPUS] No GPU allocation found, nothing to release" | ||
| exit 0 | ||
| fi | ||
|
|
There was a problem hiding this comment.
release_gpus.sh proceeds even if ALLOCATED_GPU_BITMAP is missing (as long as GPU_DEVICES is set), but release_gpus uses only the bitmap mask. In that scenario, the script will log details and then effectively release nothing, leaking allocations in the shared bitmap. Consider either requiring ALLOCATED_GPU_BITMAP to be set (fail fast), or reconstructing the mask from GPU_DEVICES as a fallback.
| # Fail fast if GPU_DEVICES is set but the bitmap mask is missing | |
| if [ -n "$GPU_DEVICES" ] && [ -z "$ALLOCATED_GPU_BITMAP" ]; then | |
| echo "[RELEASE-GPUS] ERROR: GPU_DEVICES is set but ALLOCATED_GPU_BITMAP is missing; cannot safely release GPUs" | |
| exit 1 | |
| fi |
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Summary
Implements a lightweight, flock-based GPU allocator for CI parallelization on our 8-GPU runner. Enables dynamic GPU allocation with variable requests (1, 2, 4, 8 GPUs), eliminating hardcoded assignments and enabling true parallel execution.
Key Features
GPU Allocator Core (
.github/scripts/gpu_allocator.sh)flockon/tmp/iris_gpu_stateset -e: Works correctly in scripts with errexit enabledWorkflow-Level GPU Management
GPU_DEVICEStoGITHUB_ENVif: always()guaranteeDistributed Test Runner (
tests/run_tests_distributed.py)EADDRINUSEport conflicts through torchrun's automatic port managementPerformance Benchmarks (examples/)
torchrunandmp.spawnmodes:examples/07_gemm_all_scatter/benchmark.pyexamples/11_gemm_all_scatter_producer_consumer/benchmark.pyexamples/12_gemm_all_scatter_bulk_synchronous/benchmark.pyExternal Validation Tests
torchrun --nproc_per_node=2 <test_file>Test Runner Integration
GPU_DEVICESfrom environment (set by workflow)Workflow Simplification
gpu_devicesfrom all workflow matricesImplementation Details
Bitmap Allocator
ALLOCATED_GPU_BITMAPenv var stores allocation mask for cleanupExample bitmap states:
0b00000000(0): All GPUs free0b00000011(3): GPUs 0,1 allocated0b00011111(31): GPUs 0-4 allocated0b11111111(255): All 8 GPUs allocatedAdvantages over index-based allocation:
Impact
Before: Sequential job execution, static GPU assignment, port conflicts, potential deadlocks from out-of-order completion
After: Parallel execution constrained only by GPU availability. A 1-GPU job and 4-GPU job can run concurrently with proper isolation. Jobs complete in any order without blocking. All distributed components use torchrun for automatic port management.
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.