data type update

chrismoroney · chrismoroney · commit 769922675f6a · 2025-10-16T11:09:38.000-07:00
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-run.md b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-run.md
@@ -31,21 +31,28 @@ To run inference with multiple prompts, you can create a simple Python script to
 Use a text editor to save the Python script below in a file called `batch.py`:
 
 ```python
+import os
 import json
 from vllm import LLM, SamplingParams
 
+# Force CPU-only execution
+os.environ["CUDA_VISIBLE_DEVICES"] = ""
+
 # Sample prompts.
 prompts = [
     "Write a hello world program in C",
     "Write a hello world program in Java",
     "Write a hello world program in Rust",
 ]
 
+# Modify model here
+MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
+
 # Create a sampling params object.
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
 
 # Create an LLM.
-llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", dtype="bfloat16")
+llm = LLM(model=MODEL, dtype="float32", enforce_eager=True, tensor_parallel_size=1)
 
 # Generate texts from the prompts. The output is a list of RequestOutput objects
 # that contain the prompt, generated text, and other information.
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-server.md b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-server.md
@@ -19,7 +19,7 @@ OpenAI compatibility means that you can reuse existing software which was design
 Run vLLM with the same `Qwen/Qwen2.5-0.5B-Instruct` model:
 
 ```bash
-python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float16
+python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float32
 ```
 
 The server output displays that it is ready for requests: