Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
251d173
merge ace w/o merged_prefill_for_v1 changes
shepark Jul 1, 2025
35c22a2
first nixl version
bukeao Jul 23, 2025
5605422
add logger.info to indicate the completion of kv transfer
bukeao Jul 24, 2025
f1fd3a0
add perf_counter to kv cache transfer
bukeao Jul 31, 2025
5702cc8
disable caching in kv_manager to fix second identical request failed …
bukeao Aug 4, 2025
e4e1418
Revert "disable caching in kv_manager to fix second identical request…
bukeao Aug 5, 2025
03a7284
Merge branch 'habana_main' into nixl2_buke
skaulintel Aug 5, 2025
39f9409
remove lmcache folder
skaulintel Aug 5, 2025
188ae95
add nixl connector from upstream
skaulintel Aug 6, 2025
c6ac725
add KVTransferParams class back to base.py and change block handling …
skaulintel Aug 6, 2025
e1404f3
modify kv buffer for hpu
skaulintel Aug 7, 2025
12949c0
add hpu functions
skaulintel Aug 7, 2025
094c136
adding indexing function
bukeao Aug 7, 2025
d015bbe
add block_size
bukeao Aug 7, 2025
b52daf0
change position of block_size
bukeao Aug 7, 2025
7cdd181
change position of block_size
bukeao Aug 7, 2025
a5f845a
change dtype of indices
bukeao Aug 7, 2025
b2065a2
change enumerate dict to dict.items
skaulintel Aug 8, 2025
e2e76ab
add test hpu disagg accuracy script
skaulintel Aug 8, 2025
79b7a2f
modify hpu accuracy test
skaulintel Aug 8, 2025
e7b89b5
original nixl acc script
skaulintel Aug 8, 2025
28430ac
hpu nixl accuracy script changes
skaulintel Aug 8, 2025
39ffb32
remove unnecessary stuff
skaulintel Aug 8, 2025
641af13
fix accuracy issue
bukeao Aug 11, 2025
cec85fd
move the interface functions from nixl to the their appropriate locat…
bukeao Aug 13, 2025
31ba1df
temporarily comment out asserts
skaulintel Aug 13, 2025
f0eb4d2
temperary code to fix the input_batch.req_type has the right flag; te…
bukeao Aug 15, 2025
0786b65
Fix D node processing prompt
libinta Aug 16, 2025
61cb50d
change the chunked prefill max to 8k
libinta Aug 19, 2025
390cae5
add requirement and modification to accuracy.sh
libinta Aug 19, 2025
9ce78b8
improve copy func and revert get_prompts_and_decode
libinta Aug 20, 2025
753b5a4
revert benchmark test with more prompts
libinta Aug 20, 2025
69a10eb
modify log
libinta Aug 20, 2025
a025fcf
change input to 8k
libinta Aug 20, 2025
5c0116e
fix run_accuracy issue with allowing prefill on decode for case that …
libinta Aug 21, 2025
4ac12be
support pin memory
skaulintel Aug 21, 2025
16fd4a7
rewrite copy function and add benchmark profile
libinta Aug 22, 2025
d47a14e
test
libinta Aug 27, 2025
347f779
fix log
libinta Aug 27, 2025
ada1316
add more log
libinta Aug 28, 2025
dc4cf20
add more logs
libinta Aug 29, 2025
1be8753
test
libinta Aug 29, 2025
f1131b1
Added missing import
yeonsily Aug 29, 2025
3c103da
enable GDR
bukeao Sep 5, 2025
4594ca0
indent typo
bukeao Sep 5, 2025
ed864c1
fix assertion for non-pd case
libinta Sep 10, 2025
029f09d
add hpu p2d4
libinta Sep 18, 2025
b9d502b
clean up copy_kv_blocks
skaulintel Sep 24, 2025
9b56cc9
Update requirements.txt
libinta Sep 27, 2025
eb4a13f
pull PR 1711 and 1738 for performance improvement
libinta Sep 29, 2025
8eb428a
add fetch_from_cache_prompt
libinta Oct 6, 2025
1f7f0b8
Disable debug message
yeonsily Oct 8, 2025
0382e2d
Cache the blocks for the request only if enabled.
yeonsily Oct 8, 2025
bd3f65e
Disable debugging log
yeonsily Oct 9, 2025
7446c3b
Disable debug message in the test
yeonsily Oct 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,7 @@ async def async_request_openai_completions(
url=api_url, json=payload, headers=headers
) as response:
if response.status == 200:

first_chunk_received = False
async for chunk_bytes in response.content:
chunk_bytes = chunk_bytes.strip()
Expand All @@ -318,7 +319,8 @@ async def async_request_openai_completions(
first_chunk_received = True
ttft = time.perf_counter() - st
output.ttft = ttft

#print(f'libin debug backend request {ttft=}')
sys.stdout.flush()
# Decoding phase
else:
output.itl.append(timestamp - most_recent_timestamp)
Expand Down
2 changes: 1 addition & 1 deletion requirements/hpu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ ray
triton==3.1.0
setuptools>=77.0.3
setuptools-scm>=8
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@6b2f6fb
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@ce0e48

# Dependencies for HPU vllm docker image
datasets
Expand Down
4 changes: 4 additions & 0 deletions tests/v1/kv_connector/nixl_integration/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
pytest
nixl==0.5.0
lm-eval
lm-eval[api]
75 changes: 56 additions & 19 deletions tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@ set -xe
MODELS=(
"Qwen/Qwen3-0.6B"
)
#MODELS=(
# "meta-llama/Llama-3.1-8B"
#)

export VLLM_USE_V1=1
export VLLM_SKIP_WARMUP="true"
export PT_HPU_LAZY_MODE=1

# Number of prefill and decode instances to create
NUM_PREFILL_INSTANCES=${NUM_PREFILL_INSTANCES:-1} # Default to 1
Expand All @@ -13,9 +20,10 @@ PREFILLER_TP_SIZE=${PREFILLER_TP_SIZE:-1}
DECODER_TP_SIZE=${DECODER_TP_SIZE:-1}

# Find the git repository root directory
GIT_ROOT=$(git rev-parse --show-toplevel)
#GIT_ROOT=$(git rev-parse --show-toplevel)
GIT_ROOT="/home/vllm-nixl/vllm"

SMI_BIN=$(which nvidia-smi || which rocm-smi)
#SMI_BIN=$(which nvidia-smi || which rocm-smi)

# Trap the SIGINT signal (triggered by Ctrl+C)
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
Expand All @@ -25,7 +33,7 @@ wait_for_server() {
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
sleep 1
done" && return 0 || return 1
}

Expand Down Expand Up @@ -75,23 +83,24 @@ run_tests_for_model() {
# Start prefill instances
for i in $(seq 0 $((NUM_PREFILL_INSTANCES-1))); do
# Calculate GPU ID - we'll distribute across available GPUs
GPU_ID=$((i % $(get_num_gpus)))
#GPU_ID=$((i % $(get_num_gpus)))
GPU_ID=2

# Calculate port number (base port + instance number)
PORT=$((8100 + i))
PORT=$((8300 + i))
# Calculate side channel port. Avoid clash with with TP workers.
SIDE_CHANNEL_PORT=$((5559 + i))

echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"

# Build the command with or without model-specific args
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
BASE_CMD="RANK=0 UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
--port $PORT \
--enforce-eager \
--disable-log-requests \
--gpu-memory-utilization 0.2 \
--max_num_batched_tokens 8192 \
--gpu-memory-utilization 0.3 \
--tensor-parallel-size $PREFILLER_TP_SIZE \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"cpu\"}'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
Expand All @@ -109,22 +118,22 @@ run_tests_for_model() {
# Start decode instances
for i in $(seq 0 $((NUM_DECODE_INSTANCES-1))); do
# Calculate GPU ID - we'll distribute across available GPUs, starting from after prefill GPUs
GPU_ID=$(((i + NUM_PREFILL_INSTANCES) % $(get_num_gpus)))
#GPU_ID=$(((i + NUM_PREFILL_INSTANCES) % $(get_num_gpus)))
# Calculate port number (base port + instance number)
PORT=$((8200 + i))
PORT=$((8400 + i))
# Calculate side channel port
SIDE_CHANNEL_PORT=$((5659 + i * $DECODER_TP_SIZE))

echo "Starting decode instance $i on GPU $GPU_ID, port $PORT"

# Build the command with or without model-specific args
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
BASE_CMD="RANK=1 UCX_TLS=tcp VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
--port $PORT \
--enforce-eager \
--disable-log-requests \
--gpu-memory-utilization 0.2 \
--max_num_batched_tokens 8192 \
--gpu-memory-utilization 0.3 \
--tensor-parallel-size $DECODER_TP_SIZE \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"cpu\"}'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
Expand All @@ -151,7 +160,7 @@ run_tests_for_model() {
done

# Build the command for the proxy server with all the hosts and ports
PROXY_CMD="python ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --port 8192"
PROXY_CMD="python toy_proxy_server.py --port 9192"

# Add all prefill hosts and ports
PROXY_CMD+=" --prefiller-hosts ${PREFILL_HOSTS[@]}"
Expand All @@ -166,11 +175,39 @@ run_tests_for_model() {
$PROXY_CMD &

# Wait for the proxy to start
sleep 5

sleep 10

# curl -X POST -s http://localhost:9192/v1/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "meta-llama/Llama-3.1-8B",
# "prompt": "Mark Elliot Zuckerberg is an American businessman who co-founded the social media service Facebook and its parent company Meta Platforms, of which he is the chairman, chief executive officer, and controlling shareholder. Zuckerberg has been the subject of multiple lawsuits regarding the creation and ownership of the website as well as issues such as user privacy. Born in White Plains, New York, Zuckerberg briefly attended Harvard College, where he launched Facebook in February 2004 with his roommates Eduardo Saverin, Andrew McCollum, Dustin Moskovitz and Chris Hughes. Zuckerberg took the company public in May 2012 with majority shares. He became the worlds youngest self-made billionaire[a] in 2008, at age 23, and has consistently ranked among the worlds wealthiest individuals. According to Forbes, Zuckerbergs estimated net worth stood at US$221.2 billion as of May 2025, making him the second-richest individual in the world.[2]",
# "max_tokens": 5,
# "temperature": 0
# }'
sleep 5
echo "--------------------===================-------------"
#curl -X POST -s http://localhost:9192/v1/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "meta-llama/Llama-3.1-8B",
# "prompt": "Mark Elliot Zuckerberg is an American businessman who co-founded the social media service Facebook and its parent company Meta Platforms, of which he is the chairman, chief executive officer, and controlling shareholder. Zuckerberg has been the subject of multiple lawsuits regarding the creation and ownership of the website as well as issues such as user privacy. Born in White Plains, New York, Zuckerberg briefly attended Harvard College, where he launched Facebook in February 2004 with his roommates Eduardo Saverin, Andrew McCollum, Dustin Moskovitz and Chris Hughes. Zuckerberg took the company public in May 2012 with majority shares. He became the worlds youngest self-made billionaire[a] in 2008, at age 23, and has consistently ranked among the worlds wealthiest individuals. According to Forbes, Zuckerbergs estimated net worth stood at US$221.2 billion as of May 2025, making him the second-richest individual in the world.[2] Intel opened its first international manufacturing facility in 1972, in Malaysia, which would host multiple Intel operations, before opening assembly facilities and semiconductor plants in Singapore and Jerusalem in the early 1980s, and manufacturing and development centers in China, India, and Costa Rica in the 1990s.[31] By the early 1980s, its business was dominated by DRAM chips. However, increased competition from Japanese semiconductor manufacturers had, by 1983, dramatically reduced the profitability of this market. The growing success of the IBM personal computer, based on an Intel microprocessor, was among factors that convinced Gordon Moore (CEO since 1975) to shift the companys focus to microprocessors and to change fundamental aspects of that business model. Moores decision to sole-source Intels 386 chip played into the companys continuing success.",
# "max_tokens": 5,
# "temperature": 0
# }'
# curl -X POST -s http://localhost:9192/v1/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "meta-llama/Llama-3.1-8B",
# "prompt": ["This was a few months ago. It was my day off and the only thing I had to do was pick my girlfriend up from work at 9:00 pm. Other than that, I was free to loaf on the couch from morning to night, which is what I did. Around 8:00, I decided to shower before I left the house. Now, I have short hair that dries pretty quickly, but I am deeply vain about it, so I always dry it with the hairdryer right after I shower to ensure my hair doesnt get flat and weird. I never skip this step. So, I get out of the shower, start drying my hair... And then I wake up in bed. Its half an hour later. I feel like garbage, my entire body mysteriously hurts, and I am slowly realizing that I dont remember exiting the bathroom. My only clear thought is: oh shit, its 9:00! I have to pick up my girlfriend! Better shake myself awake. I dragged my aching carcass back to the bathroom, and this was when I noticed the massive blisters forming all over my hand. I was still pretty out of it, but I knew that this was a hospital visit kind of burn. My girlfriend then called to check in because I was running late and, despite my undoubtedly convincing argument that I was still perfectly fine to drive, she immediately knew something was wrong. She cabbed home and we got a ride to the ER. Turns out, I had my first ever seizure! It seems like during the seizure, I clenched the hairdryer in my fist and had it pointed at my other hand long enough to thoroughly cook it. The tissue loss is pretty deep in some areas and there was concerns about me retaining my mobility, but its been healing well so far.",
# "Mark Elliot Zuckerberg is an American businessman who co-founded the social media service Facebook and its parent company Meta Platforms, of which he is the chairman, chief executive officer, and controlling shareholder. Zuckerberg has been the subject of multiple lawsuits regarding the creation and ownership of the website as well as issues such as user privacy. Born in White Plains, New York, Zuckerberg briefly attended Harvard College, where he launched Facebook in February 2004 with his roommates Eduardo Saverin, Andrew McCollum, Dustin Moskovitz and Chris Hughes. Zuckerberg took the company public in May 2012 with majority shares. He became the worlds youngest self-made billionaire[a] in 2008, at age 23, and has consistently ranked among the worlds wealthiest individuals. According to Forbes, Zuckerbergs estimated net worth stood at US$221.2 billion as of May 2025, making him the second-richest individual in the world.[2]"],
# "max_tokens": 2,
# "temperature": 0
# }'
#sleep 10000
# Run lm eval for this model
echo "Running tests for $model_name"
TEST_MODEL=$model_name python -m pytest -s -x ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_accuracy.py
TEST_MODEL=$model_name python -m pytest -s -x test_accuracy.py

# Clean up before running next model
cleanup_instances
Expand Down
Loading