Skip to content

Set device according to the local rank#1047

Open
yangulei wants to merge 2 commits intovllm-project:mainfrom
yangulei:set_device_main
Open

Set device according to the local rank#1047
yangulei wants to merge 2 commits intovllm-project:mainfrom
yangulei:set_device_main

Conversation

@yangulei
Copy link
Collaborator

Reimplement #788 and fix the runtime error reported in #946 by removing the range checking for module IDs.

### Motivation
For a typical node with 8xGaudi2E HPUs, the devices are break into two
groups with 4 HPUs connected with top board each. Current random mapping
between `local_rank` and `module_id` will cause HCCL failure for
`world_size>4` cases.

### Changes
- Set device according to local rank.
- Use `pyhlml` to set `HABANA_VISIBLE_MODULES` to available modules.
This is necessary if multiple cases with `world_size=1/2/4` wants to run
on the same node simultaneously or the available `module_ids` are not
start with 0.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei
Copy link
Collaborator Author

@iboiko-habana
Please help to review and check if this PR fix the issue mentioned in #946.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request reimplements the device selection logic for Habana Gaudi HPUs to properly map local ranks to device modules, fixing a runtime error from PR #946 that occurred when reverting PR #788. The changes ensure that multiple vLLM worker processes on the same node correctly select different HPU devices based on their local rank.

Changes:

  • Added _configure_habana_visible_modules() method to dynamically configure HABANA_VISIBLE_MODULES based on available devices
  • Modified init_device() to call torch.hpu.set_device(self.local_rank) for explicit device selection
  • Updated shell scripts to support hl-smi for device enumeration and set HABANA_VISIBLE_MODULES with device distribution

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_worker.py Adds device configuration logic to query available Habana modules via pyhlml and set HABANA_VISIBLE_MODULES before device initialization; adds explicit device selection via torch.hpu.set_device(local_rank)
tests/unit_tests/run_accuracy_test.sh Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES to segregate prefill (modules 0-3) and decode (modules 4-7) instances
examples/nixl/run_hpu_disagg_accuracy_test.sh Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
examples/nixl/run_benchmark_test.sh Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
examples/nixl/run_benchmark_profile.sh Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
examples/nixl/run_accuracy_test.sh Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
Comments suppressed due to low confidence (19)

examples/nixl/run_accuracy_test.sh:106

  • The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:114

  • The method imports pyhlml and calls pyhlml.hlmlInit() but doesn't handle the case where pyhlml might not be installed or fails to initialize. If pyhlml is not available or hlmlInit() fails, it would raise an exception that would propagate up and prevent the worker from initializing. Consider adding appropriate error handling with a clear error message indicating that pyhlml is required for device configuration.
        import pyhlml
        pyhlml.hlmlInit()

vllm_gaudi/v1/worker/hpu_worker.py:101

  • Spelling error in the docstring. The word "avalible" should be spelled "available".
        The first avalible HPU (with minimum module ID) is assigned to the current worker if `set_device()` is not 

vllm_gaudi/v1/worker/hpu_worker.py:140

  • The validation logic for HABANA_VISIBLE_MODULES assumes that comma-separated values will always have valid digits, but it doesn't handle whitespace around commas. For example, "0, 1, 2, 3" would fail validation because "c.isdigit()" would return False for " 1". Consider stripping whitespace from each part before validation.
                if not all(c.isdigit() for c in env_visible_modules.split(",")):

examples/nixl/run_hpu_disagg_accuracy_test.sh:130

  • The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_benchmark_test.sh:140

  • The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:138

  • The method queries torch.hpu.device_count() before setting HABANA_VISIBLE_MODULES. This is actually correct behavior since we need to enumerate all devices to determine which ones are available. However, there's a potential issue: after setting HABANA_VISIBLE_MODULES to a subset of modules, subsequent calls to torch.hpu.device_count() will return the count of visible modules, not all modules. Later at line 202, torch.hpu.set_device(self.local_rank) is called. Since HABANA_VISIBLE_MODULES remaps device indices, if local_rank is within the count of visible modules, this should work. However, the logic assumes the environment variable takes effect immediately and that the HPU runtime respects it. According to PR #946, the original implementation had issues with "Module IDs should be between 0 and 0" errors, which the PR description says is fixed by removing range checking. Ensure that setting HABANA_VISIBLE_MODULES before any device initialization actually takes effect for subsequent torch.hpu operations.
            device_count = torch.hpu.device_count()
            if device_count < 1:
                raise RuntimeError("No Habana devices found.")
            for i in range(device_count):
                try:
                    device = pyhlml.hlmlDeviceGetHandleByIndex(i)
                    utility = pyhlml.hlmlDeviceGetUtilizationRates(device)
                    if utility.aip == 0 and utility.memory == 0:
                        module_id = pyhlml.hlmlDeviceGetModuleID(device)
                        available_module_ids.append(module_id)
                except Exception:
                    continue
            if len(available_module_ids) < 1:
                raise RuntimeError("No available Habana modules found. All modules are currently in use.")
            env_visible_modules = os.getenv("HABANA_VISIBLE_MODULES")
            if env_visible_modules is None:
                if len(available_module_ids) < self.parallel_config.world_size:
                    raise RuntimeError(
                        f"Not enough available modules for world_size={self.parallel_config.world_size}.")
                available_modules_str = ",".join(map(str, sorted(available_module_ids)))
                logger.info("HABANA_VISIBLE_MODULES is not set, using all available modules: %s", available_modules_str)
                os.environ["HABANA_VISIBLE_MODULES"] = available_modules_str

vllm_gaudi/v1/worker/hpu_worker.py:105

  • Grammar issue in the docstring. The phrase "the HABANA_VISIBLE_MODULES environment variable need to be set" should use "needs" instead of "need" to match the singular subject "variable".
        `HABANA_VISIBLE_MODULES` environment variable need to be set to include only the available modules explicitly 

vllm_gaudi/v1/worker/hpu_worker.py:143

  • The validation logic doesn't handle empty strings within the comma-separated list. For example, "0,1,,2" would pass the digit check but would fail when trying to convert an empty string to an integer on line 143. Consider adding validation to filter out empty strings after splitting.
                if not all(c.isdigit() for c in env_visible_modules.split(",")):
                    raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
                                       "It should be a comma-separated list of integers.")
                env_module_ids = list(map(int, env_visible_modules.split(",")))

examples/nixl/run_benchmark_test.sh:106

  • The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:128

  • The method catches all exceptions when querying device information and silently continues. While this provides robustness, it may hide important errors that should be reported. Consider logging the exception at a debug or warning level so that issues with specific devices are visible during troubleshooting, rather than silently skipping them.
                except Exception:
                    continue

tests/unit_tests/run_accuracy_test.sh:124

  • The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=$UCX_TLS VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

tests/unit_tests/run_accuracy_test.sh:157

  • The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=$UCX_TLS VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_accuracy_test.sh:139

  • The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:106

  • The docstring mentions that the method needs to be called "before initializing the device," but it doesn't clearly specify what "initializing the device" means in this context. Since the method itself calls torch.hpu.device_count(), which is a device operation, it would be helpful to clarify that this method should be called before any device selection or allocation operations (like torch.hpu.set_device or model loading), but may need to query device information as part of its operation.
        '''
        The first avalible HPU (with minimum module ID) is assigned to the current worker if `set_device()` is not 
        called or `set_device("hpu")` is called. This allows the auto device selection for multiple processes on the 
        same node. While vLLM spawns multiple worker processes on the same node, each worker needs to select a 
        different HPU device based on its local rank by calling `set_device(local_rank)`. To achieve this, the 
        `HABANA_VISIBLE_MODULES` environment variable need to be set to include only the available modules explicitly 
        before initializing the device.

vllm_gaudi/v1/worker/hpu_worker.py:145

  • The error message uses "device" where it should probably say "devices". The message "Some device for HABANA_VISIBLE_MODULES=%s are not available" should be "Some devices for HABANA_VISIBLE_MODULES=%s are not available" to match the plural "are".
                    logger.warning("Some device for HABANA_VISIBLE_MODULES=%s are not available.", env_visible_modules)

examples/nixl/run_hpu_disagg_accuracy_test.sh:97

  • The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_benchmark_profile.sh:109

  • The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_benchmark_profile.sh:143

  • The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.
    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Select available Habana modules before initializing the device.
self._configure_habana_visible_modules()

def _configure_habana_visible_modules(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UTs for this, for instance test that will verify what in the original PR was failing, (1, 2, 4, 8 devices)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants