Set device according to the local rank by yangulei · Pull Request #1047 · vllm-project/vllm-gaudi

yangulei · 2026-02-26T07:56:03Z

Reimplement #788 and fix the runtime error reported in #946 by removing the range checking for module IDs.

### Motivation For a typical node with 8xGaudi2E HPUs, the devices are break into two groups with 4 HPUs connected with top board each. Current random mapping between `local_rank` and `module_id` will cause HCCL failure for `world_size>4` cases. ### Changes - Set device according to local rank. - Use `pyhlml` to set `HABANA_VISIBLE_MODULES` to available modules. This is necessary if multiple cases with `world_size=1/2/4` wants to run on the same node simultaneously or the available `module_ids` are not start with 0. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei · 2026-02-26T07:57:46Z

@iboiko-habana
Please help to review and check if this PR fix the issue mentioned in #946.

Copilot

Pull request overview

This pull request reimplements the device selection logic for Habana Gaudi HPUs to properly map local ranks to device modules, fixing a runtime error from PR #946 that occurred when reverting PR #788. The changes ensure that multiple vLLM worker processes on the same node correctly select different HPU devices based on their local rank.

Changes:

Added _configure_habana_visible_modules() method to dynamically configure HABANA_VISIBLE_MODULES based on available devices
Modified init_device() to call torch.hpu.set_device(self.local_rank) for explicit device selection
Updated shell scripts to support hl-smi for device enumeration and set HABANA_VISIBLE_MODULES with device distribution

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_worker.py	Adds device configuration logic to query available Habana modules via pyhlml and set HABANA_VISIBLE_MODULES before device initialization; adds explicit device selection via torch.hpu.set_device(local_rank)
tests/unit_tests/run_accuracy_test.sh	Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES to segregate prefill (modules 0-3) and decode (modules 4-7) instances
examples/nixl/run_hpu_disagg_accuracy_test.sh	Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
examples/nixl/run_benchmark_test.sh	Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
examples/nixl/run_benchmark_profile.sh	Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup
examples/nixl/run_accuracy_test.sh	Adds hl-smi support, enables dynamic GPU_ID calculation, and sets HABANA_VISIBLE_MODULES for disaggregated prefill/decode setup

Comments suppressed due to low confidence (19)

examples/nixl/run_accuracy_test.sh:106

The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:114

The method imports pyhlml and calls pyhlml.hlmlInit() but doesn't handle the case where pyhlml might not be installed or fails to initialize. If pyhlml is not available or hlmlInit() fails, it would raise an exception that would propagate up and prevent the worker from initializing. Consider adding appropriate error handling with a clear error message indicating that pyhlml is required for device configuration.

        import pyhlml
        pyhlml.hlmlInit()

vllm_gaudi/v1/worker/hpu_worker.py:101

Spelling error in the docstring. The word "avalible" should be spelled "available".

        The first avalible HPU (with minimum module ID) is assigned to the current worker if `set_device()` is not

vllm_gaudi/v1/worker/hpu_worker.py:140

The validation logic for HABANA_VISIBLE_MODULES assumes that comma-separated values will always have valid digits, but it doesn't handle whitespace around commas. For example, "0, 1, 2, 3" would fail validation because "c.isdigit()" would return False for " 1". Consider stripping whitespace from each part before validation.

                if not all(c.isdigit() for c in env_visible_modules.split(",")):

examples/nixl/run_hpu_disagg_accuracy_test.sh:130

The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_benchmark_test.sh:140

The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:138

The method queries torch.hpu.device_count() before setting HABANA_VISIBLE_MODULES. This is actually correct behavior since we need to enumerate all devices to determine which ones are available. However, there's a potential issue: after setting HABANA_VISIBLE_MODULES to a subset of modules, subsequent calls to torch.hpu.device_count() will return the count of visible modules, not all modules. Later at line 202, torch.hpu.set_device(self.local_rank) is called. Since HABANA_VISIBLE_MODULES remaps device indices, if local_rank is within the count of visible modules, this should work. However, the logic assumes the environment variable takes effect immediately and that the HPU runtime respects it. According to PR #946, the original implementation had issues with "Module IDs should be between 0 and 0" errors, which the PR description says is fixed by removing range checking. Ensure that setting HABANA_VISIBLE_MODULES before any device initialization actually takes effect for subsequent torch.hpu operations.

            device_count = torch.hpu.device_count()
            if device_count < 1:
                raise RuntimeError("No Habana devices found.")
            for i in range(device_count):
                try:
                    device = pyhlml.hlmlDeviceGetHandleByIndex(i)
                    utility = pyhlml.hlmlDeviceGetUtilizationRates(device)
                    if utility.aip == 0 and utility.memory == 0:
                        module_id = pyhlml.hlmlDeviceGetModuleID(device)
                        available_module_ids.append(module_id)
                except Exception:
                    continue
            if len(available_module_ids) < 1:
                raise RuntimeError("No available Habana modules found. All modules are currently in use.")
            env_visible_modules = os.getenv("HABANA_VISIBLE_MODULES")
            if env_visible_modules is None:
                if len(available_module_ids) < self.parallel_config.world_size:
                    raise RuntimeError(
                        f"Not enough available modules for world_size={self.parallel_config.world_size}.")
                available_modules_str = ",".join(map(str, sorted(available_module_ids)))
                logger.info("HABANA_VISIBLE_MODULES is not set, using all available modules: %s", available_modules_str)
                os.environ["HABANA_VISIBLE_MODULES"] = available_modules_str

vllm_gaudi/v1/worker/hpu_worker.py:105

Grammar issue in the docstring. The phrase "the HABANA_VISIBLE_MODULES environment variable need to be set" should use "needs" instead of "need" to match the singular subject "variable".

        `HABANA_VISIBLE_MODULES` environment variable need to be set to include only the available modules explicitly

vllm_gaudi/v1/worker/hpu_worker.py:143

The validation logic doesn't handle empty strings within the comma-separated list. For example, "0,1,,2" would pass the digit check but would fail when trying to convert an empty string to an integer on line 143. Consider adding validation to filter out empty strings after splitting.

                if not all(c.isdigit() for c in env_visible_modules.split(",")):
                    raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
                                       "It should be a comma-separated list of integers.")
                env_module_ids = list(map(int, env_visible_modules.split(",")))

examples/nixl/run_benchmark_test.sh:106

The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:128

The method catches all exceptions when querying device information and silently continues. While this provides robustness, it may hide important errors that should be reported. Consider logging the exception at a debug or warning level so that issues with specific devices are visible during troubleshooting, rather than silently skipping them.

                except Exception:
                    continue

tests/unit_tests/run_accuracy_test.sh:124

The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=$UCX_TLS VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

tests/unit_tests/run_accuracy_test.sh:157

The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=$UCX_TLS VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_accuracy_test.sh:139

The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

vllm_gaudi/v1/worker/hpu_worker.py:106

The docstring mentions that the method needs to be called "before initializing the device," but it doesn't clearly specify what "initializing the device" means in this context. Since the method itself calls torch.hpu.device_count(), which is a device operation, it would be helpful to clarify that this method should be called before any device selection or allocation operations (like torch.hpu.set_device or model loading), but may need to query device information as part of its operation.

        '''
        The first avalible HPU (with minimum module ID) is assigned to the current worker if `set_device()` is not 
        called or `set_device("hpu")` is called. This allows the auto device selection for multiple processes on the 
        same node. While vLLM spawns multiple worker processes on the same node, each worker needs to select a 
        different HPU device based on its local rank by calling `set_device(local_rank)`. To achieve this, the 
        `HABANA_VISIBLE_MODULES` environment variable need to be set to include only the available modules explicitly 
        before initializing the device.

vllm_gaudi/v1/worker/hpu_worker.py:145

The error message uses "device" where it should probably say "devices". The message "Some device for HABANA_VISIBLE_MODULES=%s are not available" should be "Some devices for HABANA_VISIBLE_MODULES=%s are not available" to match the plural "are".

                    logger.warning("Some device for HABANA_VISIBLE_MODULES=%s are not available.", env_visible_modules)

examples/nixl/run_hpu_disagg_accuracy_test.sh:97

The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_benchmark_profile.sh:109

The RANK environment variable is set to GPU_ID, which is calculated as i % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='0,1,2,3' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

examples/nixl/run_benchmark_profile.sh:143

The RANK environment variable is set to GPU_ID, which is calculated as (i + NUM_PREFILL_INSTANCES) % get_num_gpus(). However, RANK is typically used for distributed training to identify the global rank of a process, and setting it to GPU_ID may not be appropriate. In the Python code, the worker's init receives both rank and local_rank as separate parameters, and at line 202 of hpu_worker.py, torch.hpu.set_device(self.local_rank) is called. Setting RANK to GPU_ID in the shell script may cause confusion between the global rank and local rank concepts. Consider whether RANK should be set differently or if the intended value is actually local_rank.

    BASE_CMD="HABANA_VISIBLE_MODULES='4,5,6,7' RANK=$GPU_ID UCX_TLS=rc,ud,ib VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/v1/worker/hpu_worker.py

michalkuligowski · 2026-02-26T10:46:25Z

vllm_gaudi/v1/worker/hpu_worker.py

+        # Select available Habana modules before initializing the device.
+        self._configure_habana_visible_modules()
+
+    def _configure_habana_visible_modules(self):


Please add UTs for this, for instance test that will verify what in the original PR was failing, (1, 2, 4, 8 devices)

yangulei added 2 commits February 26, 2026 15:45

remove the range checking for module id

b3df108

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Copilot AI review requested due to automatic review settings February 26, 2026 07:56

yangulei requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners February 26, 2026 07:56

Copilot started reviewing on behalf of yangulei February 26, 2026 07:57 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Show resolved Hide resolved

github-actions bot mentioned this pull request Feb 26, 2026

🚦 Team Review Dashboard #701

Open

michalkuligowski reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set device according to the local rank#1047

Set device according to the local rank#1047
yangulei wants to merge 2 commits intovllm-project:mainfrom
yangulei:set_device_main

yangulei commented Feb 26, 2026

Uh oh!

yangulei commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

michalkuligowski Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yangulei commented Feb 26, 2026

Uh oh!

yangulei commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

michalkuligowski Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants