Merge pull request #269 from runpod-workers/feat/allow-engine-args-env

velaraptor-runpod · web-flow · commit 13fa71878e21 · 2026-02-27T15:34:23.000-06:00
feat: allow all AsyncEngineArgs as env vars
diff --git a/.runpod/README.md b/.runpod/README.md
@@ -28,6 +28,8 @@ All behaviour is controlled through environment variables:
 | `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | Override served model name in API                 |                     | String                                                             |
 | `MAX_CONCURRENCY`                   | Maximum concurrent requests                       | 300                 | Integer                                                            |
 
+**Pass any vLLM engine arg** not listed above by setting an env var with the **UPPERCASED** field name (e.g. `MAX_MODEL_LEN=4096`, `ENABLE_CHUNKED_PREFILL=true`). The worker auto-discovers all `AsyncEngineArgs` fields from env. See the [vLLM engine args docs](https://docs.vllm.ai/en/latest/configuration/engine_args) for all available options.
+
 For complete configuration options, see the [full configuration documentation](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md).
 
 ## API Usage
diff --git a/README.md b/README.md
@@ -59,6 +59,16 @@ Configure worker-vllm using environment variables:
 | `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | Override served model name in API                 |                     | String                                                             |
 | `MAX_CONCURRENCY`                   | Maximum concurrent requests                       | 30                  | Integer                                                            |
 
+**Pass any vLLM engine arg** not listed above by setting an environment variable with the **UPPERCASED** field name (same names vLLM uses). The worker auto-discovers all `AsyncEngineArgs` fields from env. For example:
+
+| Environment Variable      | vLLM Engine Arg          | Example Value |
+| ------------------------- | ------------------------ | ------------- |
+| `MAX_MODEL_LEN`           | `max_model_len`          | `4096`        |
+| `ENFORCE_EAGER`           | `enforce_eager`          | `true`        |
+| `ENABLE_CHUNKED_PREFILL`  | `enable_chunked_prefill` | `true`        |
+
+Any env var whose name matches a valid `AsyncEngineArgs` field (uppercased) is applied automatically. Backward-compat aliases: `MODEL_NAME`, `TOKENIZER_NAME`, `MAX_CONTEXT_LEN_TO_CAPTURE`. This lets you configure any vLLM option without waiting for explicit worker support.
+
 For the complete list of all available environment variables, examples, and detailed descriptions: **[Configuration](docs/configuration.md)**
 
 ## Option 2: Build Docker Image with Model Inside
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -156,6 +156,29 @@ The way this works is that the first request will have a batch size of `DEFAULT_
 | `DISABLE_LOGGING_REQUEST`   | False   | `bool`  | Disable logging requests.                                                                                                                              |
 | `MAX_LOG_LEN`               | None    | `int`   | Max number of prompt characters or prompt ID numbers being printed in log.                                                                             |
 
+## UPPERCASED env vars: Pass any engine arg
+
+Any vLLM `AsyncEngineArgs` field can be set via an environment variable using the **UPPERCASED** field name (the same names vLLM uses). The worker auto-discovers all fields from env — no prefix.
+
+**Format:** `<FIELD_NAME_UPPERCASED>=<value>` (e.g. `MAX_MODEL_LEN=4096`)
+
+**Examples:**
+
+| Environment Variable     | vLLM Engine Arg          | Value Example |
+| ------------------------ | ------------------------ | ------------- |
+| `MAX_MODEL_LEN`          | `max_model_len`          | `4096`        |
+| `ENFORCE_EAGER`          | `enforce_eager`          | `true`        |
+| `ENABLE_CHUNKED_PREFILL` | `enable_chunked_prefill` | `true`        |
+| `NUM_SCHEDULER_STEPS`    | `num_scheduler_steps`    | `8`           |
+| `TOKENIZER_POOL_SIZE`    | `tokenizer_pool_size`   | `4`           |
+
+**Backward-compat aliases:** `MODEL_NAME` → `model`, `TOKENIZER_NAME` → `tokenizer`, `MAX_CONTEXT_LEN_TO_CAPTURE` → `max_seq_len_to_capture`, `MODEL_REVISION` → `revision`.
+
+**Notes:**
+- Only valid `AsyncEngineArgs` fields are applied. Unknown keys are silently ignored.
+- Values are automatically cast to the correct type (`int`, `float`, `bool`, `str`, or JSON for `dict`/`list`/`tuple`).
+- For a full list of available engine args, see the [vLLM AsyncEngineArgs documentation](https://docs.vllm.ai/en/latest/configuration/engine_args/).
+
 ## Docker Build Arguments
 
 These variables are used when building custom Docker images with models baked in:
diff --git a/src/engine_args.py b/src/engine_args.py