You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker.
Variable
Default
Type/Choices
Description
DEFAULT_BATCH_SIZE
50
int
Default and Maximum batch size for token streaming to reduce HTTP calls.
DEFAULT_MIN_BATCH_SIZE
1
int
Batch size for the first request, which will be multiplied by the growth factor every subsequent request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR
3
float
Growth factor for dynamic batch size.
OpenAI Compatibility Settings
Variable
Default
Type/Choices
Description
RAW_OPENAI_OUTPUT
1
boolean as int
Enables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility.
OPENAI_SERVED_MODEL_NAME_OVERRIDE
None
str
Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model parameter when making OpenAI requests
OPENAI_RESPONSE_ROLE
assistant
str
Role of the LLM's Response in OpenAI Chat Completions.
ENABLE_AUTO_TOOL_CHOICE
false
bool
Enables automatic tool selection for supported models. Set to true to activate.
TOOL_CALL_PARSER
None
str
Specifies the parser for tool calls. Options: mistral, hermes, llama3_json, llama4_json, llama4_pythonic, granite, granite-20b-fc, deepseek_v3, internlm, jamba, phi4_mini_json, pythonic
REASONING_PARSER
None
str
Parser for reasoning-capable models (enables reasoning mode). Examples: deepseek_r1, qwen3, granite, hunyuan_a13b. Leave unset to disable.
Serverless & Concurrency Settings
Variable
Default
Type/Choices
Description
MAX_CONCURRENCY
30
int
Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency
DISABLE_LOG_STATS
False
bool
Enables or disables vLLM stats logging.
DISABLE_LOG_REQUESTS
False
bool
Enables or disables vLLM request logging.
Advanced Settings
Variable
Default
Type
Description
MODEL_LOADER_EXTRA_CONFIG
None
dict
Extra config for model loader.
PREEMPTION_MODE
None
str
If 'recompute', the engine performs preemption-aware recomputation. If 'save', the engine saves activations into the CPU memory as preemption happens.
PREEMPTION_CHECK_PERIOD
1.0
float
How frequently the engine checks if a preemption happens.
PREEMPTION_CPU_CAPACITY
2
float
The percentage of CPU memory used for the saved activations.
DISABLE_LOGGING_REQUEST
False
bool
Disable logging requests.
MAX_LOG_LEN
None
int
Max number of prompt characters or prompt ID numbers being printed in log.
Docker Build Arguments
These variables are used when building custom Docker images with models baked in:
Variable
Default
Type
Description
BASE_PATH
/runpod-volume
str
Storage directory for huggingface cache and model
WORKER_CUDA_VERSION
12.1.0
str
CUDA version for the worker image
Deprecated Variables
⚠️The following variables are deprecated and will be removed in future versions: