-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
What feature would you like to see?
I propose to add a small but useful extension to LocalProvider.launch() that lets users pass through arbitrary SGLang server flags via:
lm.launch_kwargs = {"extra_args": "..."}
This enables use cases like controlling CUDA graph batch sizes, enabling/disabling graph capture, tuning KV cache behavior, etc., without modifying dspy internals.
Motivation
SGLang exposes many important runtime configuration flags (e.g. --cuda-graph-max-bs, --cuda-graph-bs, --disable-cuda-graph, etc.).
Currently, dspy does not offer a way to pass these flags through when launching a local model. This limits advanced performance tuning (especially for GPU memory–constrained deployments such as Modal instances).
Implementation Details
A small patch inside LocalProvider.launch():
timeout = launch_kwargs.get("timeout", 1800)
command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0"
-
# Allow user to supply extra CLI arguments, e.g. CUDA graph flags -
extra_args = launch_kwargs.get("extra_args") -
if extra_args: -
command = f"{command} {extra_args}" # We will manually stream & capture logs. process = subprocess.Popen( command.replace("\\\n", " ").replace("\\", " ").split(),
Example Usage
lm.launch_kwargs = {
"timeout": 1800,
"extra_args": "--cuda-graph-max-bs 96 --cuda-graph-bs 1 8 32 64 96"
}
lm.launch()
This minimal change should be completely backwards compatible and not impact existing users in any way.
Would you like to contribute?
- Yes, I'd like to help implement this.
- No, I just want to request it.
Additional Context
No response