Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

hrithiksagar-tih · 2025-08-11T06:11:27Z

Summary

This PR updates run-vllm.md to include a tested, offline-serve workflow for the OSS 20B model using vLLM. The added snippet replaces the previous example that failed on GPUs with compute capability <8.0 (“Required flash-infer sm7+” error). With the new code, users can load the 20B checkpoint locally—without flash-infer—and obtain correct responses on common data-center GPUs such as A100, H100 and V100,-series.

Motivation

The current cookbook example for running OSS 20B via vLLM does not execute on many setups because:

vLLM defaults to flash-infer kernels, which require NVIDIA sm80+ GPUs.
Most researchers running older A100/H100 or consumer RTX cards hit a runtime import error and cannot proceed.
By providing a drop-in replacement that disables flash-infer and switches to offline-serve, this PR:
Restores out-of-the-box functionality for a widely used 20B checkpoint.
Saves new users hours of debugging and forum searching.
Keeps the cookbook authoritative and production-ready.

For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
I have conducted a self-review of my content based on the contribution guidelines:
- Relevance: This content is related to building with OpenAI technologies and is useful to others.
- Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
- Spelling and Grammar: I have checked for spelling or grammatical mistakes.
- Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
- Correctness: The information I include is correct and all of my code executes successfully.
- Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

Update run-vllm.md

1c32222

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

hrithiksagar-tih commented Aug 11, 2025

Uh oh!

Uh oh!

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

Are you sure you want to change the base?

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

Conversation

hrithiksagar-tih commented Aug 11, 2025

Summary

Motivation

For new content

Uh oh!

Uh oh!