Set device according to the local rank#1047
Set device according to the local rank#1047yangulei wants to merge 2 commits intovllm-project:mainfrom
Conversation
### Motivation For a typical node with 8xGaudi2E HPUs, the devices are break into two groups with 4 HPUs connected with top board each. Current random mapping between `local_rank` and `module_id` will cause HCCL failure for `world_size>4` cases. ### Changes - Set device according to local rank. - Use `pyhlml` to set `HABANA_VISIBLE_MODULES` to available modules. This is necessary if multiple cases with `world_size=1/2/4` wants to run on the same node simultaneously or the available `module_ids` are not start with 0. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
|
@iboiko-habana |
There was a problem hiding this comment.
Pull request overview
This pull request reimplements the device selection logic for Habana Gaudi HPUs to properly map local ranks to device modules, fixing a runtime error from PR #946 that occurred when reverting PR #788. The changes ensure that multiple vLLM worker processes on the same node correctly select different HPU devices based on their local rank.
Changes:
- Added
_configure_habana_visible_modules()method to dynamically configure HABANA_VISIBLE_MODULES based on available devices - Modified
init_device()to calltorch.hpu.set_device(self.local_rank)for explicit device selection - Updated shell scripts to support hl-smi for device enumeration and set HABANA_VISIBLE_MODULES with device distribution
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_worker.py | Adds device configuration logic to query available Habana modules via pyhlml and set HABANA_VISIBLE_MODULES before device initialization; adds explicit device selection via torch.hpu.set_device(local_rank) |
| tests/unit_tests/run_accuracy_test.sh | Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES to segregate prefill (modules 0-3) and decode (modules 4-7) instances |
| examples/nixl/run_hpu_disagg_accuracy_test.sh | Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup |
| examples/nixl/run_benchmark_test.sh | Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup |
| examples/nixl/run_benchmark_profile.sh | Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup |
| examples/nixl/run_accuracy_test.sh | Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup |
Comments suppressed due to low confidence (19)
examples/nixl/run_accuracy_test.sh:106
- The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
vllm_gaudi/v1/worker/hpu_worker.py:114
- The method imports pyhlml and calls pyhlml.hlmlInit() but doesn't handle the case where pyhlml might not be installed or fails to initialize. If pyhlml is not available or hlmlInit() fails, it would raise an exception that would propagate up and prevent the worker from initializing. Consider adding appropriate error handling with a clear error message indicating that pyhlml is required for device configuration.
import pyhlml
pyhlml.hlmlInit()
vllm_gaudi/v1/worker/hpu_worker.py:101
- Spelling error in the docstring. The word "avalible" should be spelled "available".
The first avalible HPU (with minimum module ID) is assigned to the current worker if `set_device()` is not
vllm_gaudi/v1/worker/hpu_worker.py:140
- The validation logic for HABANA_VISIBLE_MODULES assumes that comma-separated values will always have valid digits, but it doesn't handle whitespace around commas. For example, "0, 1, 2, 3" would fail validation because "c.isdigit()" would return False for " 1". Consider stripping whitespace from each part before validation.
if not all(c.isdigit() for c in env_visible_modules.split(",")):
examples/nixl/run_hpu_disagg_accuracy_test.sh:130
- The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
examples/nixl/run_benchmark_test.sh:140
- The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
vllm_gaudi/v1/worker/hpu_worker.py:138
- The method queries torch.hpu.device_count() before setting HABANA_VISIBLE_MODULES. This is actually correct behavior since we need to enumerate all devices to determine which ones are available. However, there's a potential issue: after setting HABANA_VISIBLE_MODULES to a subset of modules, subsequent calls to torch.hpu.device_count() will return the count of visible modules, not all modules. Later at line 202, torch.hpu.set_device(self.local_rank) is called. Since HABANA_VISIBLE_MODULES remaps device indices, if local_rank is within the count of visible modules, this should work. However, the logic assumes the environment variable takes effect immediately and that the HPU runtime respects it. According to PR #946, the original implementation had issues with "Module IDs should be between 0 and 0" errors, which the PR description says is fixed by removing range checking. Ensure that setting HABANA_VISIBLE_MODULES before any device initialization actually takes effect for subsequent torch.hpu operations.
device_count = torch.hpu.device_count()
if device_count < 1:
raise RuntimeError("No Habana devices found.")
for i in range(device_count):
try:
device = pyhlml.hlmlDeviceGetHandleByIndex(i)
utility = pyhlml.hlmlDeviceGetUtilizationRates(device)
if utility.aip == 0 and utility.memory == 0:
module_id = pyhlml.hlmlDeviceGetModuleID(device)
available_module_ids.append(module_id)
except Exception:
continue
if len(available_module_ids) < 1:
raise RuntimeError("No available Habana modules found. All modules are currently in use.")
env_visible_modules = os.getenv("HABANA_VISIBLE_MODULES")
if env_visible_modules is None:
if len(available_module_ids) < self.parallel_config.world_size:
raise RuntimeError(
f"Not enough available modules for world_size={self.parallel_config.world_size}.")
available_modules_str = ",".join(map(str, sorted(available_module_ids)))
logger.info("HABANA_VISIBLE_MODULES is not set, using all available modules: %s", available_modules_str)
os.environ["HABANA_VISIBLE_MODULES"] = available_modules_str
vllm_gaudi/v1/worker/hpu_worker.py:105
- Grammar issue in the docstring. The phrase "the HABANA_VISIBLE_MODULES environment variable need to be set" should use "needs" instead of "need" to match the singular subject "variable".
`HABANA_VISIBLE_MODULES` environment variable need to be set to include only the available modules explicitly
vllm_gaudi/v1/worker/hpu_worker.py:143
- The validation logic doesn't handle empty strings within the comma-separated list. For example, "0,1,,2" would pass the digit check but would fail when trying to convert an empty string to an integer on line 143. Consider adding validation to filter out empty strings after splitting.
if not all(c.isdigit() for c in env_visible_modules.split(",")):
raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
"It should be a comma-separated list of integers.")
env_module_ids = list(map(int, env_visible_modules.split(",")))
examples/nixl/run_benchmark_test.sh:106
- The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
vllm_gaudi/v1/worker/hpu_worker.py:128
- The method catches all exceptions when querying device information and silently continues. While this provides robustness, it may hide important errors that should be reported. Consider logging the exception at a debug or warning level so that issues with specific devices are visible during troubleshooting, rather than silently skipping them.
except Exception:
continue
tests/unit_tests/run_accuracy_test.sh:124
- The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=$UCX_TLS VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
tests/unit_tests/run_accuracy_test.sh:157
- The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=$UCX_TLS VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
examples/nixl/run_accuracy_test.sh:139
- The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
vllm_gaudi/v1/worker/hpu_worker.py:106
- The docstring mentions that the method needs to be called "before initializing the device," but it doesn't clearly specify what "initializing the device" means in this context. Since the method itself calls torch.hpu.device_count(), which is a device operation, it would be helpful to clarify that this method should be called before any device selection or allocation operations (like torch.hpu.set_device or model loading), but may need to query device information as part of its operation.
'''
The first avalible HPU (with minimum module ID) is assigned to the current worker if `set_device()` is not
called or `set_device("hpu")` is called. This allows the auto device selection for multiple processes on the
same node. While vLLM spawns multiple worker processes on the same node, each worker needs to select a
different HPU device based on its local rank by calling `set_device(local_rank)`. To achieve this, the
`HABANA_VISIBLE_MODULES` environment variable need to be set to include only the available modules explicitly
before initializing the device.
vllm_gaudi/v1/worker/hpu_worker.py:145
- The error message uses "device" where it should probably say "devices". The message "Some device for HABANA_VISIBLE_MODULES=%s are not available" should be "Some devices for HABANA_VISIBLE_MODULES=%s are not available" to match the plural "are".
logger.warning("Some device for HABANA_VISIBLE_MODULES=%s are not available.", env_visible_modules)
examples/nixl/run_hpu_disagg_accuracy_test.sh:97
- The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
examples/nixl/run_benchmark_profile.sh:109
- The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
examples/nixl/run_benchmark_profile.sh:143
- The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Select available Habana modules before initializing the device. | ||
| self._configure_habana_visible_modules() | ||
|
|
||
| def _configure_habana_visible_modules(self): |
There was a problem hiding this comment.
Please add UTs for this, for instance test that will verify what in the original PR was failing, (1, 2, 4, 8 devices)
Reimplement #788 and fix the runtime error reported in #946 by removing the range checking for module IDs.