Skip to content

Commit e3a0e43

Browse files
authored
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code (#21032)
Signed-off-by: jiang1.li <[email protected]>
1 parent b3d8210 commit e3a0e43

File tree

7 files changed

+144
-150
lines changed

7 files changed

+144
-150
lines changed

.buildkite/scripts/hardware_ci/run-cpu-test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE
2424
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu .
2525

2626
# Run the image, setting --shm-size=4g for tensor parallel.
27-
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
28-
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2
27+
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
28+
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2
2929

3030
function cpu_tests() {
3131
set -e

docs/getting_started/installation/cpu.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,8 @@ Currently, there are no pre-built CPU wheels.
9494
## Related runtime environment variables
9595

9696
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
97-
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`.
98-
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`.
97+
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists or `auto` (by default). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively.
98+
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`.
9999
- `VLLM_CPU_MOE_PREPACK` (x86 only): whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
100100
- `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
101101

@@ -123,9 +123,13 @@ export VLLM_CPU_NUM_OF_RESERVED_CPU=1
123123
vllm serve facebook/opt-125m --dtype=bfloat16
124124
```
125125

126+
Note, it is recommended to manually reserve 1 CPU for vLLM front-end process when `world_size == 1`.
127+
126128
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
127129

128-
- Bind each OpenMP thread to a dedicated physical CPU core respectively, or use auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
130+
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to a same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If have any performance problems or unexpected binding behaviours, please try to bind threads as following.
131+
132+
- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
129133

130134
??? console "Commands"
131135

requirements/cpu.txt

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,4 @@ datasets # for benchmark scripts
2424
# Intel Extension for PyTorch, only for x86_64 CPUs
2525
intel-openmp==2024.2.1; platform_machine == "x86_64"
2626
intel_extension_for_pytorch==2.6.0; platform_machine == "x86_64" # torch>2.6.0+cpu has performance regression on x86 platform, see https://github.com/pytorch/pytorch/pull/151218
27-
py-libnuma; platform_system != "Darwin"
28-
psutil; platform_system != "Darwin"
2927
triton==3.2.0; platform_machine == "x86_64" # Triton is required for torch 2.6+cpu, as it is imported in torch.compile.

vllm/envs.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
VLLM_PP_LAYER_PARTITION: Optional[str] = None
4545
VLLM_CPU_KVCACHE_SPACE: int = 0
4646
VLLM_CPU_OMP_THREADS_BIND: str = ""
47-
VLLM_CPU_NUM_OF_RESERVED_CPU: int = 0
47+
VLLM_CPU_NUM_OF_RESERVED_CPU: Optional[int] = None
4848
VLLM_CPU_MOE_PREPACK: bool = True
4949
VLLM_CPU_SGL_KERNEL: bool = False
5050
VLLM_XLA_CACHE_PATH: str = os.path.join(VLLM_CACHE_ROOT, "xla_cache")
@@ -442,7 +442,8 @@ def get_vllm_port() -> Optional[int]:
442442
# (CPU backend only) CPU cores not used by OMP threads .
443443
# Those CPU cores will not be used by OMP threads of a rank.
444444
"VLLM_CPU_NUM_OF_RESERVED_CPU":
445-
lambda: int(os.getenv("VLLM_CPU_NUM_OF_RESERVED_CPU", "0")),
445+
lambda: int(os.getenv("VLLM_CPU_NUM_OF_RESERVED_CPU", "0"))
446+
if "VLLM_CPU_NUM_OF_RESERVED_CPU" in os.environ else None,
446447

447448
# (CPU backend only) whether to use prepack for MoE layer. This will be
448449
# passed to ipex.llm.modules.GatedMLPMOE. On unsupported CPUs, you might

vllm/platforms/cpu.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# SPDX-License-Identifier: Apache-2.0
22
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
33

4+
import json
45
import os
56
import platform
7+
import subprocess
68
import sys
9+
from dataclasses import dataclass
710
from importlib.util import find_spec
811
from typing import TYPE_CHECKING, Optional
912

@@ -31,6 +34,35 @@ def get_max_threads(pid=0):
3134
raise NotImplementedError("Unsupported OS")
3235

3336

37+
@dataclass
38+
class LogicalCPUInfo:
39+
id: int = -1
40+
physical_core: int = -1
41+
numa_node: int = -1
42+
43+
@classmethod
44+
def _int(cls, value: str) -> int:
45+
try:
46+
int_value = int(value)
47+
except Exception:
48+
int_value = -1
49+
return int_value
50+
51+
@staticmethod
52+
def json_decoder(obj_dict: dict):
53+
id = obj_dict.get("cpu")
54+
physical_core = obj_dict.get("core")
55+
numa_node = obj_dict.get("node")
56+
57+
if not (id is None or physical_core is None or numa_node is None):
58+
return LogicalCPUInfo(
59+
id=LogicalCPUInfo._int(id),
60+
physical_core=LogicalCPUInfo._int(physical_core),
61+
numa_node=LogicalCPUInfo._int(numa_node))
62+
else:
63+
return obj_dict
64+
65+
3466
class CpuPlatform(Platform):
3567
_enum = PlatformEnum.CPU
3668
device_name: str = "cpu"
@@ -240,6 +272,38 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None:
240272
vllm_config.scheduler_config.max_model_len,
241273
DEFAULT_MAX_NUM_BATCHED_TOKENS)
242274

275+
@classmethod
276+
def get_allowed_cpu_memory_node_list(
277+
cls) -> tuple[list[int], list[LogicalCPUInfo]]:
278+
assert platform.system() == "Linux"
279+
280+
# Init LogicalCPUInfo from lscpu
281+
lscpu_output = subprocess.check_output("lscpu -J -e=CPU,CORE,NODE",
282+
shell=True,
283+
text=True)
284+
logical_cpu_list: list[LogicalCPUInfo] = json.loads(
285+
lscpu_output, object_hook=LogicalCPUInfo.json_decoder)['cpus']
286+
287+
# Filter CPUs with invalid attributes
288+
logical_cpu_list = [
289+
x for x in logical_cpu_list
290+
if -1 not in (x.id, x.physical_core, x.numa_node)
291+
]
292+
293+
# Filter allowed CPUs
294+
allowed_cpu_id_list = os.sched_getaffinity(0)
295+
logical_cpu_list = [
296+
x for x in logical_cpu_list if x.id in allowed_cpu_id_list
297+
]
298+
299+
# Get allowed NUMA nodes
300+
allowed_numa_nodes = set()
301+
for x in logical_cpu_list:
302+
allowed_numa_nodes.add(x.numa_node) # type: ignore
303+
allowed_numa_nodes_list = sorted(allowed_numa_nodes)
304+
305+
return allowed_numa_nodes_list, logical_cpu_list
306+
243307
@classmethod
244308
def is_pin_memory_available(cls) -> bool:
245309
logger.warning("Pin memory is not supported on CPU.")

vllm/v1/worker/cpu_model_runner.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,10 @@ def replace_tensor(obj: Any, cpu_attr_name: str,
4545
if k.endswith("_cpu_tensor") and isinstance(v, torch.Tensor):
4646
replace_tensor(self.input_batch, k, k[:-11])
4747

48-
for k, v in vars(self.input_batch.block_table).items():
49-
if k.endswith("_cpu") and isinstance(v, torch.Tensor):
50-
replace_tensor(self.input_batch.block_table, k, k[:-4])
48+
for block_table in self.input_batch.block_table.block_tables:
49+
for k, v in vars(block_table).items():
50+
if k.endswith("_cpu") and isinstance(v, torch.Tensor):
51+
replace_tensor(block_table, k, k[:-4])
5152

5253
def load_model(self, eep_scale_up: bool = False) -> None:
5354
logger.info("Starting to load model %s...", self.model_config.model)

0 commit comments

Comments
 (0)