Skip to content

Commit 042af0c

Browse files
[Model][1/N] Support multiple poolers at model level (#21227)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 378d33c commit 042af0c

File tree

22 files changed

+550
-414
lines changed

22 files changed

+550
-414
lines changed

docs/models/pooling_models.md

Lines changed: 39 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,26 +11,51 @@ before returning them.
1111
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
1212
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
1313

14-
For pooling models, we support the following `--task` options.
15-
The selected option sets the default pooler used to extract the final hidden states:
14+
If the model doesn't implement this interface, you can set `--task` which tells vLLM
15+
to convert the model into a pooling model.
1616

17-
| Task | Pooling Type | Normalization | Softmax |
18-
|---------------------------------|----------------|-----------------|-----------|
19-
| Embedding (`embed`) | `LAST` | ✅︎ | |
20-
| Classification (`classify`) | `LAST` || ✅︎ |
21-
| Sentence Pair Scoring (`score`) | \* | \* | \* |
17+
| `--task` | Model type | Supported pooling tasks |
18+
|------------|----------------------|-------------------------------|
19+
| `embed` | Embedding model | `encode`, `embed` |
20+
| `classify` | Classification model | `encode`, `classify`, `score` |
21+
| `reward` | Reward model | `encode` |
2222

23-
\*The default pooler is always defined by the model.
23+
## Pooling Tasks
2424

25-
!!! note
26-
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
25+
In vLLM, we define the following pooling tasks and corresponding APIs:
26+
27+
| Task | APIs |
28+
|------------|--------------------|
29+
| `encode` | `encode` |
30+
| `embed` | `embed`, `score`\* |
31+
| `classify` | `classify` |
32+
| `score` | `score` |
33+
34+
\*The `score` API falls back to `embed` task if the model does not support `score` task.
35+
36+
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
37+
38+
By default, the pooler assigned to each task has the following attributes:
39+
40+
| Task | Pooling Type | Normalization | Softmax |
41+
|------------|----------------|---------------|---------|
42+
| `encode` | `ALL` |||
43+
| `embed` | `LAST` | ✅︎ ||
44+
| `classify` | `LAST` || ✅︎ |
45+
46+
These defaults may be overridden by the model's implementation in vLLM.
2747

2848
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
29-
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
49+
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
50+
which takes priority over the model's defaults.
51+
52+
You can further customize this via the `--override-pooler-config` option,
53+
which takes priority over both the model's and Sentence Transformers's defaults.
54+
55+
!!! note
3056

31-
!!! tip
32-
You can customize the model's pooling method via the `--override-pooler-config` option,
33-
which takes priority over both the model's and Sentence Transformers's defaults.
57+
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
58+
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
3459

3560
## Offline Inference
3661

tests/models/test_transformers.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ def test_quantization(
144144
"model",
145145
["jason9693/Qwen2.5-1.5B-apeach"],
146146
)
147-
@pytest.mark.parametrize("dtype", ["half"])
147+
@pytest.mark.parametrize("dtype", ["float"])
148148
def test_classify(
149149
hf_runner,
150150
vllm_runner,

tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
import torch.nn as nn
99

1010
from vllm.config import VllmConfig
11-
from vllm.model_executor.layers.pooler import Pooler, PoolingType
11+
from vllm.model_executor.layers.pooler import DispatchPooler, Pooler
1212
from vllm.model_executor.models.gemma2 import Gemma2Model
1313
from vllm.model_executor.models.utils import WeightsMapper, maybe_prefix
1414
from vllm.sequence import IntermediateTensors
@@ -26,12 +26,13 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
2626
self.model = Gemma2Model(vllm_config=vllm_config,
2727
prefix=maybe_prefix(prefix, "model"))
2828

29-
self.pooler = Pooler.from_config_with_defaults(
30-
vllm_config.model_config.pooler_config,
31-
pooling_type=PoolingType.LAST,
32-
normalize=True,
33-
softmax=False,
34-
)
29+
pooler_config = vllm_config.model_config.pooler_config
30+
assert pooler_config is not None
31+
32+
self.pooler = DispatchPooler({
33+
"encode": Pooler.for_encode(pooler_config),
34+
"embed": Pooler.for_embed(pooler_config),
35+
})
3536

3637
self.make_empty_intermediate_tensors = (
3738
self.model.make_empty_intermediate_tensors)

vllm/config.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@
9494
TaskOption = Literal["auto", "generate", "embedding", "embed", "classify",
9595
"score", "reward", "transcription", "draft"]
9696

97-
_ResolvedTask = Literal["generate", "transcription", "pooling", "embed",
97+
_ResolvedTask = Literal["generate", "transcription", "encode", "embed",
9898
"classify", "reward", "draft"]
9999

100100
RunnerOption = Literal["auto", "generate", "pooling", "draft"]
@@ -103,7 +103,7 @@
103103

104104
_RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = {
105105
"generate": ["generate", "transcription"],
106-
"pooling": ["pooling", "embed", "classify", "reward"],
106+
"pooling": ["encode", "embed", "classify", "reward"],
107107
"draft": [],
108108
}
109109

@@ -579,7 +579,7 @@ def __post_init__(self) -> None:
579579
# user-selected task
580580
if runner_type == "pooling" and self.task == "auto":
581581
selected_task = all_supported_tasks[runner_type][-1]
582-
assert selected_task != "pooling"
582+
assert selected_task != "encode"
583583
self.task = selected_task
584584
self.supported_runner_types = supported_runner_types
585585
self.runner_type = runner_type
@@ -884,7 +884,7 @@ def _get_supported_pooling_tasks(
884884

885885
supported_tasks = list[_ResolvedTask]()
886886
if registry.is_pooling_model(architectures):
887-
supported_tasks.append("pooling")
887+
supported_tasks.append("encode")
888888

889889
# For now, users must specify the task (other than "pooling")
890890
# to use for pooling models

vllm/entrypoints/openai/api_server.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1668,7 +1668,7 @@ async def init_app_state(
16681668
request_logger=request_logger,
16691669
chat_template=resolved_chat_template,
16701670
chat_template_content_format=args.chat_template_content_format,
1671-
) if "pooling" in model_config.supported_tasks else None
1671+
) if "encode" in model_config.supported_tasks else None
16721672
state.openai_serving_embedding = OpenAIServingEmbedding(
16731673
engine_client,
16741674
model_config,

0 commit comments

Comments
 (0)