Skip to content

Implement flock-based bitmap GPU allocator for CI parallelization#370

Merged
mawad-amd merged 32 commits intomainfrom
copilot/implement-lightweight-gpu-allocator
Feb 15, 2026
Merged

Implement flock-based bitmap GPU allocator for CI parallelization#370
mawad-amd merged 32 commits intomainfrom
copilot/implement-lightweight-gpu-allocator

Conversation

Copy link
Contributor

Copilot AI commented Feb 14, 2026

Summary

Implements a lightweight, flock-based GPU allocator for CI parallelization on our 8-GPU runner. Enables dynamic GPU allocation with variable requests (1, 2, 4, 8 GPUs), eliminating hardcoded assignments and enabling true parallel execution.

Key Features

GPU Allocator Core (.github/scripts/gpu_allocator.sh)

  • Bitmap-based allocation: 8-bit bitmap (0-255) tracks GPU state atomically via flock on /tmp/iris_gpu_state
  • Non-contiguous allocation: Any N free GPUs can be used (no contiguous requirement)
  • Out-of-order release safe: Jobs can finish in any order without deadlocks or GPU waste
  • Retry loop: 1-minute backoff, up to 3 hours total wait time (180 retries × 60s)
  • State validation: Detects and resets corrupted or out-of-range bitmap values
  • Handles set -e: Works correctly in scripts with errexit enabled
  • Diagnostic logging: Configuration and allocation status for troubleshooting
  • Throughput-oriented: First-fit scheduling (non-FIFO acceptable)

Workflow-Level GPU Management

  • acquire_gpus.sh: Acquires GPUs and exports GPU_DEVICES to GITHUB_ENV
  • release_gpus.sh: Releases GPUs via bitmap clearing with if: always() guarantee
  • All workflows have dedicated acquire/release GPU steps with 3-hour timeouts
  • Release always runs regardless of test outcome

Distributed Test Runner (tests/run_tests_distributed.py)

  • Simple worker script that runs pytest under torchrun
  • Invoked directly by torchrun (no recursive launcher pattern)
  • Eliminates EADDRINUSE port conflicts through torchrun's automatic port management

Performance Benchmarks (examples/)

  • 3 benchmarks support both torchrun and mp.spawn modes:
    • examples/07_gemm_all_scatter/benchmark.py
    • examples/11_gemm_all_scatter_producer_consumer/benchmark.py
    • examples/12_gemm_all_scatter_bulk_synchronous/benchmark.py
  • Auto-detects execution mode via RANK and LOCAL_RANK environment variables
  • Maintains backward compatibility for standalone execution

External Validation Tests

  • Invoke torchrun directly (avoids nested process issues)
  • Both tests use: torchrun --nproc_per_node=2 <test_file>

Test Runner Integration

  • Reads GPU_DEVICES from environment (set by workflow)
  • Invokes torchrun directly for distributed tests
  • No internal GPU allocation logic

Workflow Simplification

  • Removed hardcoded gpu_devices from all workflow matrices
  • Removed port cleanup steps (obsolete with torchrun)
  • Jobs acquire GPUs dynamically for true parallel execution
  • All jobs have 3-hour timeouts matching GPU allocator wait time

Implementation Details

Bitmap Allocator

  • State file: 8-bit integer (0-255) representing GPU allocation bitmap
  • Bit position N = GPU N (bit 0 = GPU 0, bit 1 = GPU 1, etc.)
  • Bit value: 1 = allocated, 0 = free
  • Allocation: finds N free GPUs (bits=0), sets bits to 1
  • Release: clears bits using bitwise AND with inverse mask
  • ALLOCATED_GPU_BITMAP env var stores allocation mask for cleanup

Example bitmap states:

  • 0b00000000 (0): All GPUs free
  • 0b00000011 (3): GPUs 0,1 allocated
  • 0b00011111 (31): GPUs 0-4 allocated
  • 0b11111111 (255): All 8 GPUs allocated

Advantages over index-based allocation:

  • Out-of-order job completion safe (no deadlocks)
  • No GPU waste from out-of-order releases
  • Immediate GPU reuse when freed
  • Better utilization with variable GPU requests

Impact

Before: Sequential job execution, static GPU assignment, port conflicts, potential deadlocks from out-of-order completion

After: Parallel execution constrained only by GPU availability. A 1-GPU job and 4-GPU job can run concurrently with proper isolation. Jobs complete in any order without blocking. All distributed components use torchrun for automatic port management.

Original prompt

This section details on the original issue you should resolve

<issue_title>Implement Lightweight GPU Allocator for CI (Supports Variable GPU Requests)</issue_title>
<issue_description>We need a lightweight GPU isolation mechanism for CI jobs running on our single-node 8-GPU machine.

Requirements:

  • Support variable GPU requests (e.g., 1, 2, 4, 8 GPUs)
  • Ensure no overlapping GPU usage
  • Maximize machine utilization
  • Avoid heavy infrastructure (e.g., Slurm)
  • Throughput-oriented scheduling (non-FIFO is acceptable)

Design:

  • Use a single global flock to protect allocation state
  • Maintain a shared state file tracking free GPUs
  • When a job requests N GPUs:
    • If enough GPUs are free → allocate immediately
    • Otherwise → release lock, sleep, and retry
  • On job exit, GPUs are released back into the pool

This approach provides deterministic isolation and high utilization without introducing a full scheduler.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits February 14, 2026 02:15
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…idation

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement lightweight GPU allocator for CI with variable requests Implement flock-based GPU allocator for CI parallelization Feb 14, 2026
Copilot AI requested a review from mawad-amd February 14, 2026 02:21
…ic GPUs

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a lightweight, flock-based GPU allocation mechanism for CI so matrix jobs can dynamically acquire the number of GPUs they need, rather than relying on hardcoded GPU assignments in the workflow.

Changes:

  • Added a shared-state GPU allocator script that blocks/retries until enough GPUs are available and exports GPU_DEVICES for HIP_VISIBLE_DEVICES.
  • Updated the test runner script to automatically acquire/release GPUs when not explicitly provided.
  • Simplified the CI workflow matrix by removing static gpu_devices entries and always letting the runner auto-allocate.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
.github/workflows/iris-tests.yml Removes hardcoded GPU device mappings from the test matrices and relies on dynamic allocation.
.github/scripts/run_tests.sh Sources the allocator and acquires/releases GPUs automatically when devices aren’t provided.
.github/scripts/gpu_allocator.sh Implements the flock-protected shared state allocator and cleanup logic.

@mawad-amd
Copy link
Collaborator

@copilot apply changes based on the comments in this thread

… state validation

- Remove top-level set -e to avoid side effects when sourced
- Add state file validation in init_gpu_state (handles invalid/out-of-range values)
- Fix race condition in acquire_gpus by capturing start index while holding lock
- Change out-of-order release to preserve state instead of resetting to 0 (prevents GPU overlap)
- Make cleanup trap optional and composable via enable_gpu_cleanup_trap()
- Fix GPU_DEVICES env variable handling in run_tests.sh to honor environment
- Add shellcheck disable for intentional trap expansion

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 14, 2026

@copilot apply changes based on the comments in this thread

Applied all review feedback. Changes in commit eb6bae9:

  1. Removed set -e from sourced file (no side effects)
  2. Added state validation in init_gpu_state (detects/resets invalid/out-of-range values)
  3. Fixed race condition in acquire_gpus (captures start index atomically via temp file)
  4. Changed out-of-order release to preserve state instead of resetting to 0 (prevents GPU overlap)
  5. Made cleanup trap optional via enable_gpu_cleanup_trap() (composes with existing traps)
  6. Fixed GPU_DEVICES env variable handling in run_tests.sh (honors environment, arg overrides)

All changes tested and verified.

Changed error message from "Need GPUs 8-8 but only 0-7 available" to
"Not enough GPUs: need 1, only 0 available (next free GPU: 8)" which
is much clearer about how many GPUs are needed vs available.

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
mawad-amd and others added 3 commits February 14, 2026 20:42
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… references

**Changes:**
1. Removed unused `enable_gpu_cleanup_trap()` function from gpu_allocator.sh (user request)
2. Fixed release_gpus.sh to check only variables actually set by acquire_gpus.sh:
   - Changed from checking ALLOCATED_GPU_START and ALLOCATED_GPU_COUNT
   - To checking ALLOCATED_GPU_BITMAP (the actual variable exported)
   - Removed logging of undefined variables

These changes clean up unused code and fix bugs in release_gpus.sh where
it was checking/logging variables that were never set by the allocator.

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI requested a review from mawad-amd February 15, 2026 04:45
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
**Changes:**
1. Removed `--worker-mode` dummy argument from run_tests_distributed.py
   - The argument was not handled by pytest and could cause warnings
   - Detection via environment variables (RANK/WORLD_SIZE) is sufficient

2. Set timeout-minutes to 180 (3 hours) for all workflow jobs
   - iris-tests.yml: Added timeout to build-container-image, test-git, test-editable
   - iris-performance-regression-test.yml: Updated from 20/30 to 180 minutes
   - iris-external-validation-test.yml: Added timeout to build-container-image, external-validation-test
   - Aligns with GPU allocator's 3-hour wait time for consistency

These changes ensure workflows have sufficient time to wait for GPUs
and remove unnecessary dummy arguments from the test runner.

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Refactored run_tests_distributed.py to remove recursive launcher pattern:
- Removed launcher logic that parsed arguments and spawned torchrun
- Script now runs directly as a torchrun worker
- Takes pytest arguments directly from command line (no environment variable)

Updated run_tests.sh to invoke torchrun directly:
- Changed from: python run_tests_distributed.py --num_ranks N <test> <args>
- Changed to: torchrun --nproc_per_node=N --standalone run_tests_distributed.py <test> <args>

Benefits:
- Simpler, more direct execution path
- No recursive script invocation
- Easier to understand and debug
- Eliminates dummy arguments and environment variable passing

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 12 comments.

Comment on lines +19 to +34
# torchrun sets these environment variables automatically
rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
local_rank = int(os.environ.get("LOCAL_RANK", 0))

# Set the correct GPU for this specific process
if torch.cuda.is_available():
torch.cuda.set_device(local_rank)

# Initialize distributed - torchrun already set up the environment
dist.init_process_group(
backend="nccl",
rank=rank,
world_size=world_size,
device_id=torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else None,
)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_tests_distributed.py now assumes it is always launched via torchrun (env:// rendezvous). Running it directly (e.g., python tests/run_tests_distributed.py ...) will fail because MASTER_ADDR/MASTER_PORT are not set and there's no fallback. Consider adding a guard that either (a) errors with a clear message telling users to run via torchrun, or (b) retains the previous mp.spawn fallback when RANK/WORLD_SIZE/LOCAL_RANK are missing.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +34
# Initialize distributed - torchrun already set up the environment
dist.init_process_group(
backend="nccl",
rank=rank,
world_size=world_size,
device_id=torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else None,
)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process-group backend is hard-coded to nccl. If this script is ever executed in an environment without CUDA/HIP available, dist.init_process_group(backend="nccl") will fail. Consider selecting backend="nccl" if torch.cuda.is_available() else "gloo" (matching the pattern used elsewhere in the repo).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to 10
# Run Iris tests in a container with automatic GPU allocation
# Usage: run_tests.sh <test_dir> <num_ranks> [gpu_devices] [install_method]
# test_dir: subdirectory under tests/ (e.g., examples, unittests, ccl)
# num_ranks: number of GPU ranks (1, 2, 4, or 8)
# gpu_devices: comma-separated GPU device IDs (optional)
# gpu_devices: comma-separated GPU device IDs (optional, if not provided will use allocator)
# install_method: pip install method - "git", "editable", or "install" (optional, default: "editable")
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says this script provides "automatic GPU allocation" / "will use allocator" when gpu_devices isn't provided, but the implementation only warns when GPU_DEVICES is empty and never calls the allocator. Suggest updating the header/usage text to reflect the actual behavior (workflow-level acquire/release), or implement the documented fallback.

Copilot uses AI. Check for mistakes.
Comment on lines 23 to 51
@@ -40,10 +44,9 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd /iris_workspace
pip install -e .
python examples/${EXAMPLE_PATH}/benchmark.py \
torchrun --nproc_per_node=8 examples/${EXAMPLE_PATH}/benchmark.py \
--benchmark \
--validate \
-r 8 \
${BENCHMARK_ARGS} \
--output_file perf_result.json
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU_DEVICES can be overridden via the environment, but the benchmark always runs torchrun --nproc_per_node=8. If GPU_DEVICES is set to fewer than 8 devices, this will oversubscribe and fail. Consider deriving --nproc_per_node from the requested/available GPU count (e.g., count of comma-separated IDs), or enforce that GPU_DEVICES must contain 8 GPUs when using this script.

Copilot uses AI. Check for mistakes.
Comment on lines +317 to +319
world_size = int(os.environ.get("WORLD_SIZE", 1))
init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500")
_worker(rank, world_size, f"tcp://{init_url}", args)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In torchrun mode, this passes rank=int(os.environ["RANK"]) into _worker, but _worker uses that parameter as local_rank for rank= and device_id=torch.device(f"cuda:{local_rank}"). That’s only correct when global rank == local rank. Please use LOCAL_RANK for the first _worker argument / device selection, and keep RANK for the global rank.

Suggested change
world_size = int(os.environ.get("WORLD_SIZE", 1))
init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500")
_worker(rank, world_size, f"tcp://{init_url}", args)
local_rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500")
_worker(local_rank, world_size, f"tcp://{init_url}", args)

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +56
# Ensure the bitmap is within valid range (0-255 for 8 GPUs)
elif [ "$current_state" -lt 0 ] || [ "$current_state" -gt 255 ]; then
echo "0" > "$GPU_STATE_FILE"
echo "[GPU-ALLOC] Detected out-of-range GPU bitmap ($current_state); reset to 0" >&2
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State validation hard-codes the max bitmap value to 255, but the allocator also supports overriding MAX_GPUS. If MAX_GPUS is changed, this range check becomes incorrect. Consider computing the max as (1 << MAX_GPUS) - 1 to keep validation consistent with configuration.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to 55
- name: Acquire GPUs
run: |
bash .github/scripts/cleanup_ports.sh
bash .github/scripts/acquire_gpus.sh 2

Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job acquires GPUs (sets GPU_DEVICES), but the subsequent container invocation in this workflow does not pass --gpus "$GPU_DEVICES", so the container will still see all GPUs and can overlap with other jobs. Update the container step to consume the allocated GPU_DEVICES (same pattern as external-gluon-validation-test).

Copilot uses AI. Check for mistakes.
Comment on lines +272 to +275
rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500")
_worker(rank, world_size, f"tcp://{init_url}", args)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In torchrun mode, this uses RANK as the first argument to _worker, but _worker treats that parameter as the per-process GPU index (local_rank) and uses it for device_id. For multi-node torchrun (or any case where global rank != local rank), this will select the wrong device. Use LOCAL_RANK for device selection / _worker's first argument, and use RANK only for the global rank passed to the process group.

Suggested change
rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500")
_worker(rank, world_size, f"tcp://{init_url}", args)
local_rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
init_url = os.environ.get("MASTER_ADDR", "127.0.0.1") + ":" + os.environ.get("MASTER_PORT", "29500")
_worker(local_rank, world_size, f"tcp://{init_url}", args)

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +43
if [ ! -f "$GPU_STATE_FILE" ]; then
# Initialize with all GPUs free (bitmap = 0)
echo "0" > "$GPU_STATE_FILE"
echo "[GPU-ALLOC] Initialized GPU bitmap: 0 (all GPUs free)" >&2
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU state file is created in /tmp via echo "0" > "$GPU_STATE_FILE" without setting permissions. On multi-user runners this can allow unintended modification/corruption of allocator state. Consider creating it with restrictive permissions (e.g., umask 077 for the write, or chmod 600 "$GPU_STATE_FILE" after creation).

Copilot uses AI. Check for mistakes.
echo "[RELEASE-GPUS] No GPU allocation found, nothing to release"
exit 0
fi

Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release_gpus.sh proceeds even if ALLOCATED_GPU_BITMAP is missing (as long as GPU_DEVICES is set), but release_gpus uses only the bitmap mask. In that scenario, the script will log details and then effectively release nothing, leaking allocations in the shared bitmap. Consider either requiring ALLOCATED_GPU_BITMAP to be set (fail fast), or reconstructing the mask from GPU_DEVICES as a fallback.

Suggested change
# Fail fast if GPU_DEVICES is set but the bitmap mask is missing
if [ -n "$GPU_DEVICES" ] && [ -z "$ALLOCATED_GPU_BITMAP" ]; then
echo "[RELEASE-GPUS] ERROR: GPU_DEVICES is set but ALLOCATED_GPU_BITMAP is missing; cannot safely release GPUs"
exit 1
fi

Copilot uses AI. Check for mistakes.
@mawad-amd mawad-amd marked this pull request as ready for review February 15, 2026 07:55
@mawad-amd mawad-amd merged commit 75508f6 into main Feb 15, 2026
80 checks passed
@mawad-amd mawad-amd deleted the copilot/implement-lightweight-gpu-allocator branch February 15, 2026 07:55
Copilot AI added a commit that referenced this pull request Feb 15, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Lightweight GPU Allocator for CI (Supports Variable GPU Requests)

2 participants