Skip to content

Commit aab0102

Browse files
[V0 deprecation] Remove more V0 references (#29088)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent b34129b commit aab0102

File tree

15 files changed

+31
-75
lines changed

15 files changed

+31
-75
lines changed

docs/contributing/model/basic.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,6 @@ We consider 3 different scenarios:
133133
For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](../../../vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](../../../vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
134134
The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config.
135135
For the mamba layers themselves, please use the [`MambaMixer`](../../../vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](../../../vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
136-
Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations.
137-
V0-only classes and code will be removed in the very near future.
138136
The model should also be added to the `MODELS_CONFIG_MAP` dictionary in [vllm/model_executor/models/config.py](../../../vllm/model_executor/models/config.py) to ensure that the runtime defaults are optimized.
139137

140138
For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](../../../vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](../../../vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).

docs/design/prefix_caching.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,6 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache
9494

9595
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
9696

97-
!!! note
98-
Cache isolation is not supported in engine V0.
99-
10097
## Data Structure
10198

10299
The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):

docs/usage/reproducibility.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
11
# Reproducibility
22

3-
vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. You need to do the following to achieve
4-
reproducible results:
5-
6-
- For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
7-
- For V0: Set the global seed (see below).
3+
vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. To achieve
4+
reproducible results, you need to turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
85

96
Example: [examples/offline_inference/reproducibility.py](../../examples/offline_inference/reproducibility.py)
107

@@ -30,8 +27,6 @@ However, in some cases, setting the seed will also [change the random state in u
3027

3128
### Default Behavior
3229

33-
In V0, the `seed` parameter defaults to `None`. When the `seed` parameter is `None`, the random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that each run of vLLM will produce different results if `temperature > 0`, as expected.
34-
3530
In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`.
3631

3732
!!! note

docs/usage/v1_guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
!!! announcement
44

5-
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
5+
We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
66

77
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
88

examples/offline_inference/reproducibility.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,9 @@
1111

1212
from vllm import LLM, SamplingParams
1313

14-
# V1 only: Turn off multiprocessing to make the scheduling deterministic.
14+
# Turn off multiprocessing to make the scheduling deterministic.
1515
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
1616

17-
# V0 only: Set the global seed. The default seed is None, which is
18-
# not reproducible.
19-
SEED = 42
20-
2117
prompts = [
2218
"Hello, my name is",
2319
"The president of the United States is",
@@ -28,7 +24,7 @@
2824

2925

3026
def main():
31-
llm = LLM(model="facebook/opt-125m", seed=SEED)
27+
llm = LLM(model="facebook/opt-125m")
3228
outputs = llm.generate(prompts, sampling_params)
3329
print("-" * 50)
3430
for output in outputs:

examples/offline_inference/rlhf_utils.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ class WorkerExtension:
3030
"""
3131
The class for vLLM's worker to inherit from.
3232
By defining an extension class, the code can work no matter what is
33-
the underlying worker class. This way, the code can be compatible
34-
with both vLLM V0 and V1.
33+
the underlying worker class.
34+
3535
NOTE: we define this class in a separate module, and the main module
3636
should pass the full qualified name as `worker_extension_cls` argument.
3737
"""
@@ -96,8 +96,8 @@ class ColocateWorkerExtension:
9696
"""
9797
The class for vLLM's worker to inherit from, in the colocate setting.
9898
By defining an extension class, the code can work no matter what is
99-
the underlying worker class. This way, the code can be compatible
100-
with both vLLM V0 and V1.
99+
the underlying worker class.
100+
101101
NOTE: we define this class in a separate module, and the main module
102102
should pass the full qualified name as `worker_extension_cls` argument.
103103
"""

examples/offline_inference/save_sharded_state.py

Lines changed: 3 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -67,22 +67,9 @@ def main(args):
6767
Path(args.output).mkdir(exist_ok=True)
6868
# Dump worker states to output directory
6969

70-
# Check which engine version is being used
71-
is_v1_engine = hasattr(llm.llm_engine, "engine_core")
72-
73-
if is_v1_engine:
74-
# For V1 engine, we need to use engine_core.save_sharded_state
75-
print("Using V1 engine save path")
76-
llm.llm_engine.engine_core.save_sharded_state(
77-
path=args.output, pattern=args.file_pattern, max_size=args.max_file_size
78-
)
79-
else:
80-
# For V0 engine
81-
print("Using V0 engine save path")
82-
model_executor = llm.llm_engine.model_executor
83-
model_executor.save_sharded_state(
84-
path=args.output, pattern=args.file_pattern, max_size=args.max_file_size
85-
)
70+
llm.llm_engine.engine_core.save_sharded_state(
71+
path=args.output, pattern=args.file_pattern, max_size=args.max_file_size
72+
)
8673

8774
# Copy metadata files to output directory
8875
for file in os.listdir(model_path):

examples/offline_inference/spec_decode.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -158,11 +158,7 @@ def main(args):
158158
print(f"generated text: {output.outputs[0].text}")
159159
print("-" * 50)
160160

161-
try:
162-
metrics = llm.get_metrics()
163-
except AssertionError:
164-
print("Metrics are not supported in the V0 engine.")
165-
return
161+
metrics = llm.get_metrics()
166162

167163
total_num_output_tokens = sum(
168164
len(output.outputs[0].token_ids) for output in outputs

tests/model_executor/model_loader/test_sharded_state_loader.py

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,18 +60,9 @@ def llama_3p2_1b_files():
6060

6161
def _run_writer(input_dir, output_dir, weights_patterns, **kwargs):
6262
llm_sharded_writer = LLM(model=input_dir, **kwargs)
63-
# Check which engine version is being used
64-
is_v1_engine = hasattr(llm_sharded_writer.llm_engine, "engine_core")
63+
6564
# Dump worker states to output directory
66-
if is_v1_engine:
67-
# For V1 engine, we need to use engine_core.save_sharded_state
68-
print("Using V1 engine save path")
69-
llm_sharded_writer.llm_engine.engine_core.save_sharded_state(path=output_dir)
70-
else:
71-
# For V0 engine
72-
print("Using V0 engine save path")
73-
model_executor = llm_sharded_writer.llm_engine.model_executor
74-
model_executor.save_sharded_state(path=output_dir)
65+
llm_sharded_writer.llm_engine.engine_core.save_sharded_state(path=output_dir)
7566

7667
# Copy metadata files to output directory
7768
for file in os.listdir(input_dir):

tests/tool_use/utils.py

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -140,21 +140,22 @@ def ensure_system_prompt(
140140
"without calling a tool. DO NOT CALL A TOOL THAT IS IRRELEVANT "
141141
"to the user's question - just respond to it normally.",
142142
},
143-
# V1 Test: Passing locally but failing in CI. This runs the
144-
# V0 Engine because of CPU offloading. Need to debug why.
143+
# FIXME: This test currently fails, need to debug why.
145144
# "granite20b": {
146-
# "model":
147-
# "mbayser/granite-20b-functioncalling-FP8-KV",
145+
# "model": "mbayser/granite-20b-functioncalling-FP8-KV",
148146
# "arguments": [
149-
# "--tool-call-parser", "granite-20b-fc", "--chat-template",
150-
# str(VLLM_PATH /
151-
# "examples/tool_chat_template_granite_20b_fc.jinja"),
152-
# "--max_num_seqs", "1", "--enforce-eager", "--cpu-offload-gb", "20"
147+
# "--tool-call-parser",
148+
# "granite-20b-fc",
149+
# "--chat-template",
150+
# str(VLLM_PATH / "examples/tool_chat_template_granite_20b_fc.jinja"),
151+
# "--max_num_seqs",
152+
# "1",
153+
# "--enforce-eager",
154+
# "--cpu-offload-gb",
155+
# "20",
153156
# ],
154-
# "supports_parallel":
155-
# False,
156-
# "supports_rocm":
157-
# False,
157+
# "supports_parallel": False,
158+
# "supports_rocm": False,
158159
# },
159160
"granite-3.0-8b": {
160161
"model": "ibm-granite/granite-3.0-8b-instruct",

0 commit comments

Comments
 (0)