Skip to content

Commit 19db319

Browse files
Merge branch 'yeonsily/integ_internVL' into ovis-padding
2 parents 2fabbb1 + df93a5f commit 19db319

File tree

15 files changed

+807
-175
lines changed

15 files changed

+807
-175
lines changed

.cd/README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,21 +28,30 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
2828

2929
## How to Use
3030

31+
### 0. Clone the Repository
32+
33+
Before proceeding with any of the steps below, make sure to clone the vLLM fork repository and navigate to the `.cd` directory. This ensures you have all necessary files and scripts for running the server or benchmarks.
34+
35+
```bash
36+
git clone https://github.com/HabanaAI/vllm-fork.git
37+
cd vllm-fork/.cd/
38+
```
39+
3140
### 1. Run the server using Docker Compose
3241

3342
The recommended and easiest way to start the vLLM server is with Docker Compose. At a minimum, set the following environment variables:
3443

3544
- `MODEL` - Select a model from the table above.
3645
- `HF_TOKEN` - Your Hugging Face token (generate one at <https://huggingface.co>).
37-
- `DOCKER_IMAGE` - The vLLM Docker image URL from Gaudi or local repository.
46+
- `DOCKER_IMAGE` - The vLLM Docker image URL from Gaudi or local repository. When using the Gaudi repository, please select Docker images with the vllm-installer* prefix in the file name.
3847

3948
**Example usage:**
4049

4150
```bash
4251
cd vllm-fork/.cd/
4352
MODEL="Qwen/Qwen2.5-14B-Instruct" \
4453
HF_TOKEN="<your huggingface token>" \
45-
DOCKER_IMAGE="<docker image url>" \
54+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
4655
docker compose up
4756
```
4857

@@ -54,7 +63,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
5463
cd vllm-fork/.cd/
5564
MODEL="Qwen/Qwen2.5-14B-Instruct" \
5665
HF_TOKEN="<your huggingface token>" \
57-
DOCKER_IMAGE="<docker image url>" \
66+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
5867
docker compose --profile benchmark up
5968
```
6069

@@ -81,7 +90,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
8190
cd vllm-fork/.cd/
8291
MODEL="Qwen/Qwen2.5-14B-Instruct" \
8392
HF_TOKEN="<your huggingface token>" \
84-
DOCKER_IMAGE="<docker image url>" \
93+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
8594
TENSOR_PARALLEL_SIZE=1 \
8695
MAX_MODEL_LEN=2048 \
8796
docker compose up
@@ -102,7 +111,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
102111
cd vllm-fork/.cd/
103112
MODEL="Qwen/Qwen2.5-14B-Instruct" \
104113
HF_TOKEN="<your huggingface token>" \
105-
DOCKER_IMAGE="<docker image url>" \
114+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
106115
INPUT_TOK=128 \
107116
OUTPUT_TOK=128 \
108117
CON_REQ=16 \
@@ -122,7 +131,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
122131
cd vllm-fork/.cd/
123132
MODEL="Qwen/Qwen2.5-14B-Instruct" \
124133
HF_TOKEN="<your huggingface token>" \
125-
DOCKER_IMAGE="<docker image url>" \
134+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
126135
TENSOR_PARALLEL_SIZE=1 \
127136
MAX_MODEL_LEN=2048 \
128137
INPUT_TOK=128 \
@@ -147,7 +156,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
147156

148157
```bash
149158
HF_TOKEN=<your huggingface token> \
150-
DOCKER_IMAGE="<docker image url>" \
159+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
151160
VLLM_SERVER_CONFIG_FILE=server/server_scenarios_text.yaml \
152161
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
153162
VLLM_BENCHMARK_CONFIG_FILE=benchmark/benchmark_scenarios_text.yaml \
@@ -178,7 +187,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
178187
-p 8000:8000 \
179188
-e HF_HOME='mnt/hf_cache'
180189
--name vllm-server \
181-
<docker image name>
190+
vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest
182191
```
183192

184193
This method gives you full flexibility over Docker runtime options.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
model_name: "/mnt/weka/data/pytorch/Qwen/Qwen2.5-VL-7B-Instruct/"
22
dtype: "bfloat16"
3-
max_model_len: 32768
3+
max_model_len: 35840
44
max_num_seqs: 32
55
num_prompts: 4

requirements/common.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,3 +48,4 @@ opentelemetry-sdk>=1.26.0 # vllm.tracing
4848
opentelemetry-api>=1.26.0 # vllm.tracing
4949
opentelemetry-exporter-otlp>=1.26.0 # vllm.tracing
5050
opentelemetry-semantic-conventions-ai>=0.4.1 # vllm.tracing
51+
modelscope # required to support VLLM_USE_MODELSCOPE env

requirements/hpu.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@
33

44
# Dependencies for HPU code
55
accelerate
6-
ray
6+
ray<2.49.0
77
triton==3.1.0
88
setuptools>=77.0.3
99
setuptools-scm>=8
10-
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@b7ce4ba
10+
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@048015b
1111

1212
# Dependencies for HPU vllm docker image
1313
datasets

vllm/entrypoints/chat_utils.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -405,8 +405,14 @@ def _resolve_chat_template_content_format(
405405
jinja_text = (hf_chat_template if isinstance(hf_chat_template, str)
406406
else load_chat_template(chat_template, is_literal=True))
407407

408-
detected_format = ("string" if jinja_text is None else
409-
_detect_content_format(jinja_text, default="string"))
408+
# The InternVL template has mixed content access patterns that fail with automatic detection.
409+
# Set string format for proper operation if InternVL is used.
410+
model_type = getattr(model_config.hf_config, 'model_type', '')
411+
if model_type == 'internvl_chat' or 'internvl' in model_config.model.lower():
412+
detected_format = "string"
413+
else:
414+
detected_format = ("string" if jinja_text is None else
415+
_detect_content_format(jinja_text, default="string"))
410416

411417
return detected_format
412418

vllm/model_executor/layers/sampler.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,10 @@ def __init__(self):
197197
# speculative decoding and when prompt embeddings are specified.
198198
self.include_gpu_probs_tensor = False
199199
self.should_modify_greedy_probs_inplace = False
200+
# Add HPU cache class variables
201+
self._prompt_tokens_hpu_cache: Optional[torch.Tensor] = None
202+
self._output_tokens_hpu_cache: Optional[torch.Tensor] = None
203+
self._cached_seq_ids: Optional[set] = None
200204

201205
def _init_sampling_tensors(
202206
self,
@@ -216,8 +220,10 @@ def _init_sampling_tensors(
216220

217221
# Initialize new sampling tensors
218222
(sampling_tensors, do_penalties, do_top_p_top_k, do_min_p,
219-
top_k_scalar, top_p_scalar) = SamplingTensors.from_sampling_metadata(
220-
sampling_metadata, vocab_size, logits.device, logits.dtype)
223+
top_k_scalar, top_p_scalar, current_seq_ids) = \
224+
SamplingTensors.from_sampling_metadata(
225+
sampling_metadata, vocab_size, logits.device, logits.dtype, \
226+
self._prompt_tokens_hpu_cache, self._output_tokens_hpu_cache, self._cached_seq_ids)
221227

222228
self._sampling_tensors = sampling_tensors
223229
self._do_penalties = do_penalties
@@ -227,6 +233,13 @@ def _init_sampling_tensors(
227233
self._top_p_scalar = top_p_scalar
228234

229235
self._apply_top_k_top_p_opt = ApplyToppTopkScalar(5)
236+
# Check if batch composition changed - if so, invalidate prompt cache
237+
238+
# After tensors are created, update cache
239+
if self._cached_seq_ids != current_seq_ids:
240+
self._prompt_tokens_hpu_cache = None
241+
self._output_tokens_hpu_cache = None
242+
self._cached_seq_ids = current_seq_ids
230243

231244
def forward(
232245
self,

vllm/model_executor/models/gemma3_mm.py

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -569,11 +569,6 @@ def _process_image_input(
569569
pixel_values = image_input["pixel_values"]
570570
num_patches = image_input["num_patches"]
571571

572-
image_features = self._image_pixels_to_features(
573-
self.vision_tower,
574-
pixel_values,
575-
)
576-
577572
if is_hpu:
578573
batch_breakdown = greedy_plan(pixel_values.shape[0], \
579574
self.vision_buckets.multimodal_buckets)
@@ -582,22 +577,25 @@ def _process_image_input(
582577

583578
for i in batch_breakdown:
584579
end_idx = start_idx + i
585-
batch_sliced_image_features = \
586-
image_features[start_idx:end_idx, ...]
587-
if is_lazy:
588-
image_embeds_multibatches += \
589-
[self.multi_modal_projector(
590-
batch_sliced_image_features,
591-
bypass_hpu_graphs=i
592-
not in self.graphed_multimodal_buckets
593-
and len(self.graphed_multimodal_buckets) > 0)]
594-
else:
595-
image_embeds_multibatches += \
596-
[self.multi_modal_projector( \
597-
batch_sliced_image_features)]
580+
indices = torch.arange(start_idx,
581+
end_idx).to(pixel_values.device)
582+
batch_sliced_pixel_values = torch.index_select(pixel_values,
583+
dim=0,
584+
index=indices)
585+
586+
image_features = self._image_pixels_to_features(
587+
self.vision_tower,
588+
batch_sliced_pixel_values,
589+
)
590+
image_embeds = self.multi_modal_projector(image_features)
591+
image_embeds_multibatches += [image_embeds.clone()]
598592
start_idx = end_idx
599593
image_embeds = torch.cat(image_embeds_multibatches, dim=0)
600594
else:
595+
image_features = self._image_pixels_to_features(
596+
self.vision_tower,
597+
pixel_values,
598+
)
601599
image_embeds = self.multi_modal_projector(image_features)
602600
return [
603601
e.flatten(0, 1) for e in image_embeds.split(num_patches.tolist())

0 commit comments

Comments
 (0)