33<h1 > SgLang Worker</h1 >
44
55🚀 | SGLang is fast serving framework for large language models and vision language models.
6+
67</div >
78
89## RunPod Worker Images
910
1011Below is a summary of the available RunPod Worker images, categorized by image stability
1112
12- | Stable Image Tag | Development Image Tag |
13- -----------------------------------|-----------------------------------|
14- ` runpod/worker-sglang:v0.4.1stable ` | ` runpod/worker-sglang:v0.4.1dev ` |
15-
13+ | Stable Image Tag | Development Image Tag |
14+ | ----------------------------------- | -------------------------------- |
15+ | ` runpod/worker-sglang:v0.4.1stable ` | ` runpod/worker-sglang:v0.4.1dev ` |
1616
1717## 📖 | Getting Started
1818
19191 . Clone this repository.
20- 2 . Build a docker image - ``` docker build -t <your_username>:worker-sglang:v1 . `` `
21- 3 . ``` docker push <your_username>:worker-sglang:v1 `` `
20+ 2 . Build a docker image - ` docker build -t <your_username>:worker-sglang:v1 . `
21+ 3 . ` docker push <your_username>:worker-sglang:v1 `
2222
23-
24- *** Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint*** :
23+ ** _ Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint_ ** :
2524
2625```
2726import runpod
@@ -38,10 +37,11 @@ run_request = endpoint.run({"your_model_input_key": "your_model_input_value"})
3837print(run_request.status())
3938
4039# Get the output of the endpoint run request, blocking until the run is complete
41- print(run_request.output())
40+ print(run_request.output())
4241```
4342
4443### OpenAI compatible API
44+
4545``` python
4646from openai import OpenAI
4747import os
@@ -54,34 +54,35 @@ client = OpenAI(
5454```
5555
5656` Chat Completions (Non-Streaming) `
57+
5758``` python
5859response = client.chat.completions.create(
5960 model = " meta-llama/Meta-Llama-3-8B-Instruct" ,
6061 messages = [{" role" : " user" , " content" : " Give a two lines on Planet Earth ?" }],
6162 temperature = 0 ,
6263 max_tokens = 100 ,
63-
64+
6465)
6566print (f " Response: { response} " )
6667```
6768
6869` Chat Completions (Streaming) `
70+
6971``` python
7072response_stream = client.chat.completions.create(
7173 model = " meta-llama/Meta-Llama-3-8B-Instruct" ,
7274 messages = [{" role" : " user" , " content" : " Give a two lines on Planet Earth ?" }],
7375 temperature = 0 ,
7476 max_tokens = 100 ,
7577 stream = True
76-
78+
7779)
7880for response in response_stream:
7981 print (response.choices[0 ].delta.content or " " , end = " " , flush = True )
8082```
8183
82-
83-
8484## SGLang Server Configuration
85+
8586When launching an endpoint, you can configure the SGLang server using environment variables. These variables allow you to customize various aspects of the server's behavior without modifying the code.
8687
8788### How to Use
@@ -91,63 +92,64 @@ The SGLang server will read these variables at startup and configure itself acco
9192If a variable is not set, the server will use its default value.
9293
9394### Available Environment Variables
94- The following table lists all available environment variables for configuring the SGLang server:
9595
96+ The following table lists all available environment variables for configuring the SGLang server:
9697
97- | Environment Variable | Description | Default | Options |
98- | ----------------------| -------------| ---------| ---------|
99- | ` MODEL_PATH ` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
100- | ` HOST ` | Host of the server | "0.0.0.0" | |
101- | ` PORT ` | Port of the server | 30000 | |
102- | ` TOKENIZER_PATH ` | Path of the tokenizer | | |
103- | ` ADDITIONAL_PORTS ` | Additional ports for the server | | |
104- | ` TOKENIZER_MODE ` | Tokenizer mode | "auto" | "auto", "slow" |
105- | ` LOAD_FORMAT ` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
106- | ` DTYPE ` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
107- | ` CONTEXT_LENGTH ` | Model's maximum context length | | |
108- | ` QUANTIZATION ` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
109- | ` SERVED_MODEL_NAME ` | Override model name in API | | |
110- | ` CHAT_TEMPLATE ` | Chat template name or path | | |
111- | ` MEM_FRACTION_STATIC ` | Fraction of memory for static allocation | | |
112- | ` MAX_RUNNING_REQUESTS ` | Maximum number of running requests | | |
113- | ` MAX_NUM_REQS ` | Maximum requests in memory pool | | |
114- | ` MAX_TOTAL_TOKENS ` | Maximum tokens in memory pool | | |
115- | ` CHUNKED_PREFILL_SIZE ` | Max tokens in chunk for chunked prefill | | |
116- | ` MAX_PREFILL_TOKENS ` | Max tokens in prefill batch | | |
117- | ` SCHEDULE_POLICY ` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
118- | ` SCHEDULE_CONSERVATIVENESS ` | Conservativeness of schedule policy | | |
119- | ` TENSOR_PARALLEL_SIZE ` | Tensor parallelism size | | |
120- | ` STREAM_INTERVAL ` | Streaming interval in token length | | |
121- | ` RANDOM_SEED ` | Random seed | | |
122- | ` LOG_LEVEL ` | Logging level for all loggers | | |
123- | ` LOG_LEVEL_HTTP ` | Logging level for HTTP server | | |
124- | ` API_KEY ` | API key for the server | | |
125- | ` FILE_STORAGE_PTH ` | Path of file storage in backend | | |
126- | ` DATA_PARALLEL_SIZE ` | Data parallelism size | | |
127- | ` LOAD_BALANCE_METHOD ` | Load balancing strategy | | "round_robin", "shortest_queue" |
128- | ` NCCL_INIT_ADDR ` | NCCL init address for multi-node | | |
129- | ` NNODES ` | Number of nodes | | |
130- | ` NODE_RANK ` | Node rank | | |
98+ | Environment Variable | Description | Default | Options |
99+ | --------------------------- | ---------------------------------------- | ------------------------------------- | ----------------------------------------------------------------------------------------- |
100+ | ` MODEL_PATH ` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
101+ | ` HOST ` | Host of the server | "0.0.0.0" | |
102+ | ` PORT ` | Port of the server | 30000 | |
103+ | ` TOKENIZER_PATH ` | Path of the tokenizer | | |
104+ | ` ADDITIONAL_PORTS ` | Additional ports for the server | | |
105+ | ` TOKENIZER_MODE ` | Tokenizer mode | "auto" | "auto", "slow" |
106+ | ` LOAD_FORMAT ` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
107+ | ` DTYPE ` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
108+ | ` CONTEXT_LENGTH ` | Model's maximum context length | | |
109+ | ` QUANTIZATION ` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
110+ | ` SERVED_MODEL_NAME ` | Override model name in API | | |
111+ | ` CHAT_TEMPLATE ` | Chat template name or path | | |
112+ | ` MEM_FRACTION_STATIC ` | Fraction of memory for static allocation | | |
113+ | ` MAX_RUNNING_REQUESTS ` | Maximum number of running requests | | |
114+ | ` MAX_NUM_REQS ` | Maximum requests in memory pool | | |
115+ | ` MAX_TOTAL_TOKENS ` | Maximum tokens in memory pool | | |
116+ | ` CHUNKED_PREFILL_SIZE ` | Max tokens in chunk for chunked prefill | | |
117+ | ` MAX_PREFILL_TOKENS ` | Max tokens in prefill batch | | |
118+ | ` SCHEDULE_POLICY ` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
119+ | ` SCHEDULE_CONSERVATIVENESS ` | Conservativeness of schedule policy | | |
120+ | ` TENSOR_PARALLEL_SIZE ` | Tensor parallelism size | | |
121+ | ` STREAM_INTERVAL ` | Streaming interval in token length | | |
122+ | ` RANDOM_SEED ` | Random seed | | |
123+ | ` LOG_LEVEL ` | Logging level for all loggers | | |
124+ | ` LOG_LEVEL_HTTP ` | Logging level for HTTP server | | |
125+ | ` API_KEY ` | API key for the server | | |
126+ | ` FILE_STORAGE_PTH ` | Path of file storage in backend | | |
127+ | ` DATA_PARALLEL_SIZE ` | Data parallelism size | | |
128+ | ` LOAD_BALANCE_METHOD ` | Load balancing strategy | | "round_robin", "shortest_queue" |
129+ | ` NCCL_INIT_ADDR ` | NCCL init address for multi-node | | |
130+ | ` NNODES ` | Number of nodes | | |
131+ | ` NODE_RANK ` | Node rank | | |
131132
132133** Boolean Flags** (set to "true", "1", or "yes" to enable):
133134
134- | Flag | Description |
135- | ------| -------------|
136- | ` SKIP_TOKENIZER_INIT ` | Skip tokenizer init |
137- | ` TRUST_REMOTE_CODE ` | Allow custom models from Hub |
138- | ` LOG_REQUESTS ` | Log inputs and outputs of requests |
139- | ` SHOW_TIME_COST ` | Show time cost of custom marks |
140- | ` DISABLE_FLASHINFER ` | Disable flashinfer attention kernels |
141- | ` DISABLE_FLASHINFER_SAMPLING ` | Disable flashinfer sampling kernels |
142- | ` DISABLE_RADIX_CACHE ` | Disable RadixAttention for prefix caching |
143- | ` DISABLE_REGEX_JUMP_FORWARD ` | Disable regex jump-forward |
144- | ` DISABLE_CUDA_GRAPH ` | Disable cuda graph |
145- | ` DISABLE_DISK_CACHE ` | Disable disk cache |
146- | ` ENABLE_TORCH_COMPILE ` | Optimize model with torch.compile |
147- | ` ENABLE_P2P_CHECK ` | Enable P2P check for GPU access |
148- | ` ENABLE_MLA ` | Enable Multi-head Latent Attention |
149- | ` ATTENTION_REDUCE_IN_FP32 ` | Cast attention results to fp32 |
150- | ` EFFICIENT_WEIGHT_LOAD ` | Enable memory efficient weight loading |
151-
152- ## 💡 | Note:
153- This is an initial and preview phase of the worker's development.
135+ | Flag | Description |
136+ | ----------------------------- | ----------------------------------------- |
137+ | ` SKIP_TOKENIZER_INIT ` | Skip tokenizer init |
138+ | ` TRUST_REMOTE_CODE ` | Allow custom models from Hub |
139+ | ` LOG_REQUESTS ` | Log inputs and outputs of requests |
140+ | ` SHOW_TIME_COST ` | Show time cost of custom marks |
141+ | ` DISABLE_FLASHINFER ` | Disable flashinfer attention kernels |
142+ | ` DISABLE_FLASHINFER_SAMPLING ` | Disable flashinfer sampling kernels |
143+ | ` DISABLE_RADIX_CACHE ` | Disable RadixAttention for prefix caching |
144+ | ` DISABLE_REGEX_JUMP_FORWARD ` | Disable regex jump-forward |
145+ | ` DISABLE_CUDA_GRAPH ` | Disable cuda graph |
146+ | ` DISABLE_DISK_CACHE ` | Disable disk cache |
147+ | ` ENABLE_TORCH_COMPILE ` | Optimize model with torch.compile |
148+ | ` ENABLE_P2P_CHECK ` | Enable P2P check for GPU access |
149+ | ` ENABLE_MLA ` | Enable Multi-head Latent Attention |
150+ | ` ATTENTION_REDUCE_IN_FP32 ` | Cast attention results to fp32 |
151+ | ` EFFICIENT_WEIGHT_LOAD ` | Enable memory efficient weight loading |
152+
153+ ## 💡 | Note:
154+
155+ This is an initial and preview phase of the worker's development.
0 commit comments