Skip to content

Commit d1c92f2

Browse files
committed
get rid of like 80% of the code
1 parent 9b901b9 commit d1c92f2

File tree

10 files changed

+394
-487
lines changed

10 files changed

+394
-487
lines changed

Dockerfile

Lines changed: 31 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,42 @@
11
FROM nvidia/cuda:12.1.0-base-ubuntu22.04
22

3+
ENV DEBIAN_FRONTEND=noninteractive
34
RUN apt-get update -y \
5+
&& apt-get dist-upgrade -y \
46
&& apt-get install -y python3-pip
57

68
RUN ldconfig /usr/local/cuda-12.1/compat/
79

8-
# Install Python dependencies
9-
COPY builder/requirements.txt /requirements.txt
10+
# install sglang's dependencies
11+
12+
# EFRON:
13+
# these guys are unbelivably huge - >80GiB. Took well over ten minutes to install on my machine and used 28GiB(!) of RAM.
14+
# we should consider having a base image with them pre-installed or seeing if we can knock it down a little bit.
15+
RUN python3 -m pip install "sglang[all]"
16+
RUN python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3
17+
18+
19+
# install _our_ dependencies
1020
RUN --mount=type=cache,target=/root/.cache/pip \
1121
python3 -m pip install --upgrade pip && \
1222
python3 -m pip install --upgrade -r /requirements.txt
1323

14-
RUN python3 -m pip install "sglang[all]" && \
15-
python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3
16-
17-
# Setup for Option 2: Building the Image with the Model included
18-
ARG MODEL_NAME=""
19-
ARG TOKENIZER_NAME=""
20-
ARG BASE_PATH="/runpod-volume"
21-
ARG QUANTIZATION=""
22-
ARG MODEL_REVISION=""
23-
ARG TOKENIZER_REVISION=""
24-
25-
ENV MODEL_NAME=$MODEL_NAME \
26-
MODEL_REVISION=$MODEL_REVISION \
27-
TOKENIZER_NAME=$TOKENIZER_NAME \
28-
TOKENIZER_REVISION=$TOKENIZER_REVISION \
29-
BASE_PATH=$BASE_PATH \
30-
QUANTIZATION=$QUANTIZATION \
31-
HF_DATASETS_CACHE="${BASE_PATH}/huggingface-cache/datasets" \
32-
HUGGINGFACE_HUB_CACHE="${BASE_PATH}/huggingface-cache/hub" \
33-
HF_HOME="${BASE_PATH}/huggingface-cache/hub" \
34-
HF_HUB_ENABLE_HF_TRANSFER=1
35-
36-
ENV PYTHONPATH="/:/vllm-workspace"
37-
38-
39-
COPY src /src
40-
RUN --mount=type=secret,id=HF_TOKEN,required=false \
41-
if [ -f /run/secrets/HF_TOKEN ]; then \
42-
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
43-
fi && \
44-
if [ -n "$MODEL_NAME" ]; then \
45-
python3 /src/download_model.py; \
46-
fi
47-
48-
# Start the handler
49-
CMD ["python3", "/src/handler.py"]
24+
RUN mkdir app
25+
COPY requirements.txt ./app/requirements.txt
26+
27+
# EFRON: no idea what this is doing: leaving it in in case it's important
28+
ENV BASE_PATH=$BASE_PATH
29+
ENV HF_DATASETS_CACHE="${BASE_PATH}/huggingface-cache/datasets"
30+
ENV HF_HOME="${BASE_PATH}/huggingface-cache/hub"
31+
ENV HF_HUB_ENABLE_HF_TRANSFER=1
32+
ENV HUGGINGFACE_HUB_CACHE="${BASE_PATH}/huggingface-cache/hub"
33+
ENV MODEL_NAME=$MODEL_NAME
34+
ENV MODEL_REVISION=$MODEL_REVISION
35+
ENV QUANTIZATION=$QUANTIZATION
36+
ENV TOKENIZER_NAME=$TOKENIZER_NAME
37+
ENV TOKENIZER_REVISION=$TOKENIZER_REVISION
38+
39+
# not sure why this is here: is a vllm-workspace even in our image?
40+
ENV PYTHONPATH="/:/vllm-workspace"
41+
COPY ./src/handler.py ./app/handler.py
42+
CMD ["python3", "./app/handler.py"] # actually run the handler

README.md

Lines changed: 70 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -3,25 +3,24 @@
33
<h1> SgLang Worker</h1>
44

55
🚀 | SGLang is fast serving framework for large language models and vision language models.
6+
67
</div>
78

89
## RunPod Worker Images
910

1011
Below is a summary of the available RunPod Worker images, categorized by image stability
1112

12-
| Stable Image Tag | Development Image Tag |
13-
-----------------------------------|-----------------------------------|
14-
`runpod/worker-sglang:v0.4.1stable` | `runpod/worker-sglang:v0.4.1dev` |
15-
13+
| Stable Image Tag | Development Image Tag |
14+
| ----------------------------------- | -------------------------------- |
15+
| `runpod/worker-sglang:v0.4.1stable` | `runpod/worker-sglang:v0.4.1dev` |
1616

1717
## 📖 | Getting Started
1818

1919
1. Clone this repository.
20-
2. Build a docker image - ```docker build -t <your_username>:worker-sglang:v1 .```
21-
3. ```docker push <your_username>:worker-sglang:v1```
20+
2. Build a docker image - `docker build -t <your_username>:worker-sglang:v1 .`
21+
3. `docker push <your_username>:worker-sglang:v1`
2222

23-
24-
***Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint***:
23+
**_Once you have built the Docker image and deployed the endpoint, you can use the code below to interact with the endpoint_**:
2524

2625
```
2726
import runpod
@@ -38,10 +37,11 @@ run_request = endpoint.run({"your_model_input_key": "your_model_input_value"})
3837
print(run_request.status())
3938
4039
# Get the output of the endpoint run request, blocking until the run is complete
41-
print(run_request.output())
40+
print(run_request.output())
4241
```
4342

4443
### OpenAI compatible API
44+
4545
```python
4646
from openai import OpenAI
4747
import os
@@ -54,34 +54,35 @@ client = OpenAI(
5454
```
5555

5656
`Chat Completions (Non-Streaming)`
57+
5758
```python
5859
response = client.chat.completions.create(
5960
model="meta-llama/Meta-Llama-3-8B-Instruct",
6061
messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
6162
temperature=0,
6263
max_tokens=100,
63-
64+
6465
)
6566
print(f"Response: {response}")
6667
```
6768

6869
`Chat Completions (Streaming)`
70+
6971
```python
7072
response_stream = client.chat.completions.create(
7173
model="meta-llama/Meta-Llama-3-8B-Instruct",
7274
messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
7375
temperature=0,
7476
max_tokens=100,
7577
stream=True
76-
78+
7779
)
7880
for response in response_stream:
7981
print(response.choices[0].delta.content or "", end="", flush=True)
8082
```
8183

82-
83-
8484
## SGLang Server Configuration
85+
8586
When launching an endpoint, you can configure the SGLang server using environment variables. These variables allow you to customize various aspects of the server's behavior without modifying the code.
8687

8788
### How to Use
@@ -91,63 +92,64 @@ The SGLang server will read these variables at startup and configure itself acco
9192
If a variable is not set, the server will use its default value.
9293

9394
### Available Environment Variables
94-
The following table lists all available environment variables for configuring the SGLang server:
9595

96+
The following table lists all available environment variables for configuring the SGLang server:
9697

97-
| Environment Variable | Description | Default | Options |
98-
|----------------------|-------------|---------|---------|
99-
| `MODEL_PATH` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
100-
| `HOST` | Host of the server | "0.0.0.0" | |
101-
| `PORT` | Port of the server | 30000 | |
102-
| `TOKENIZER_PATH` | Path of the tokenizer | | |
103-
| `ADDITIONAL_PORTS` | Additional ports for the server | | |
104-
| `TOKENIZER_MODE` | Tokenizer mode | "auto" | "auto", "slow" |
105-
| `LOAD_FORMAT` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
106-
| `DTYPE` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
107-
| `CONTEXT_LENGTH` | Model's maximum context length | | |
108-
| `QUANTIZATION` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
109-
| `SERVED_MODEL_NAME` | Override model name in API | | |
110-
| `CHAT_TEMPLATE` | Chat template name or path | | |
111-
| `MEM_FRACTION_STATIC` | Fraction of memory for static allocation | | |
112-
| `MAX_RUNNING_REQUESTS` | Maximum number of running requests | | |
113-
| `MAX_NUM_REQS` | Maximum requests in memory pool | | |
114-
| `MAX_TOTAL_TOKENS` | Maximum tokens in memory pool | | |
115-
| `CHUNKED_PREFILL_SIZE` | Max tokens in chunk for chunked prefill | | |
116-
| `MAX_PREFILL_TOKENS` | Max tokens in prefill batch | | |
117-
| `SCHEDULE_POLICY` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
118-
| `SCHEDULE_CONSERVATIVENESS` | Conservativeness of schedule policy | | |
119-
| `TENSOR_PARALLEL_SIZE` | Tensor parallelism size | | |
120-
| `STREAM_INTERVAL` | Streaming interval in token length | | |
121-
| `RANDOM_SEED` | Random seed | | |
122-
| `LOG_LEVEL` | Logging level for all loggers | | |
123-
| `LOG_LEVEL_HTTP` | Logging level for HTTP server | | |
124-
| `API_KEY` | API key for the server | | |
125-
| `FILE_STORAGE_PTH` | Path of file storage in backend | | |
126-
| `DATA_PARALLEL_SIZE` | Data parallelism size | | |
127-
| `LOAD_BALANCE_METHOD` | Load balancing strategy | | "round_robin", "shortest_queue" |
128-
| `NCCL_INIT_ADDR` | NCCL init address for multi-node | | |
129-
| `NNODES` | Number of nodes | | |
130-
| `NODE_RANK` | Node rank | | |
98+
| Environment Variable | Description | Default | Options |
99+
| --------------------------- | ---------------------------------------- | ------------------------------------- | ----------------------------------------------------------------------------------------- |
100+
| `MODEL_PATH` | Path of the model weights | "meta-llama/Meta-Llama-3-8B-Instruct" | Local folder or Hugging Face repo ID |
101+
| `HOST` | Host of the server | "0.0.0.0" | |
102+
| `PORT` | Port of the server | 30000 | |
103+
| `TOKENIZER_PATH` | Path of the tokenizer | | |
104+
| `ADDITIONAL_PORTS` | Additional ports for the server | | |
105+
| `TOKENIZER_MODE` | Tokenizer mode | "auto" | "auto", "slow" |
106+
| `LOAD_FORMAT` | Format of model weights to load | "auto" | "auto", "pt", "safetensors", "npcache", "dummy" |
107+
| `DTYPE` | Data type for weights and activations | "auto" | "auto", "half", "float16", "bfloat16", "float", "float32" |
108+
| `CONTEXT_LENGTH` | Model's maximum context length | | |
109+
| `QUANTIZATION` | Quantization method | | "awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes" |
110+
| `SERVED_MODEL_NAME` | Override model name in API | | |
111+
| `CHAT_TEMPLATE` | Chat template name or path | | |
112+
| `MEM_FRACTION_STATIC` | Fraction of memory for static allocation | | |
113+
| `MAX_RUNNING_REQUESTS` | Maximum number of running requests | | |
114+
| `MAX_NUM_REQS` | Maximum requests in memory pool | | |
115+
| `MAX_TOTAL_TOKENS` | Maximum tokens in memory pool | | |
116+
| `CHUNKED_PREFILL_SIZE` | Max tokens in chunk for chunked prefill | | |
117+
| `MAX_PREFILL_TOKENS` | Max tokens in prefill batch | | |
118+
| `SCHEDULE_POLICY` | Request scheduling policy | | "lpm", "random", "fcfs", "dfs-weight" |
119+
| `SCHEDULE_CONSERVATIVENESS` | Conservativeness of schedule policy | | |
120+
| `TENSOR_PARALLEL_SIZE` | Tensor parallelism size | | |
121+
| `STREAM_INTERVAL` | Streaming interval in token length | | |
122+
| `RANDOM_SEED` | Random seed | | |
123+
| `LOG_LEVEL` | Logging level for all loggers | | |
124+
| `LOG_LEVEL_HTTP` | Logging level for HTTP server | | |
125+
| `API_KEY` | API key for the server | | |
126+
| `FILE_STORAGE_PTH` | Path of file storage in backend | | |
127+
| `DATA_PARALLEL_SIZE` | Data parallelism size | | |
128+
| `LOAD_BALANCE_METHOD` | Load balancing strategy | | "round_robin", "shortest_queue" |
129+
| `NCCL_INIT_ADDR` | NCCL init address for multi-node | | |
130+
| `NNODES` | Number of nodes | | |
131+
| `NODE_RANK` | Node rank | | |
131132

132133
**Boolean Flags** (set to "true", "1", or "yes" to enable):
133134

134-
| Flag | Description |
135-
|------|-------------|
136-
| `SKIP_TOKENIZER_INIT` | Skip tokenizer init |
137-
| `TRUST_REMOTE_CODE` | Allow custom models from Hub |
138-
| `LOG_REQUESTS` | Log inputs and outputs of requests |
139-
| `SHOW_TIME_COST` | Show time cost of custom marks |
140-
| `DISABLE_FLASHINFER` | Disable flashinfer attention kernels |
141-
| `DISABLE_FLASHINFER_SAMPLING` | Disable flashinfer sampling kernels |
142-
| `DISABLE_RADIX_CACHE` | Disable RadixAttention for prefix caching |
143-
| `DISABLE_REGEX_JUMP_FORWARD` | Disable regex jump-forward |
144-
| `DISABLE_CUDA_GRAPH` | Disable cuda graph |
145-
| `DISABLE_DISK_CACHE` | Disable disk cache |
146-
| `ENABLE_TORCH_COMPILE` | Optimize model with torch.compile |
147-
| `ENABLE_P2P_CHECK` | Enable P2P check for GPU access |
148-
| `ENABLE_MLA` | Enable Multi-head Latent Attention |
149-
| `ATTENTION_REDUCE_IN_FP32` | Cast attention results to fp32 |
150-
| `EFFICIENT_WEIGHT_LOAD` | Enable memory efficient weight loading |
151-
152-
## 💡 | Note:
153-
This is an initial and preview phase of the worker's development.
135+
| Flag | Description |
136+
| ----------------------------- | ----------------------------------------- |
137+
| `SKIP_TOKENIZER_INIT` | Skip tokenizer init |
138+
| `TRUST_REMOTE_CODE` | Allow custom models from Hub |
139+
| `LOG_REQUESTS` | Log inputs and outputs of requests |
140+
| `SHOW_TIME_COST` | Show time cost of custom marks |
141+
| `DISABLE_FLASHINFER` | Disable flashinfer attention kernels |
142+
| `DISABLE_FLASHINFER_SAMPLING` | Disable flashinfer sampling kernels |
143+
| `DISABLE_RADIX_CACHE` | Disable RadixAttention for prefix caching |
144+
| `DISABLE_REGEX_JUMP_FORWARD` | Disable regex jump-forward |
145+
| `DISABLE_CUDA_GRAPH` | Disable cuda graph |
146+
| `DISABLE_DISK_CACHE` | Disable disk cache |
147+
| `ENABLE_TORCH_COMPILE` | Optimize model with torch.compile |
148+
| `ENABLE_P2P_CHECK` | Enable P2P check for GPU access |
149+
| `ENABLE_MLA` | Enable Multi-head Latent Attention |
150+
| `ATTENTION_REDUCE_IN_FP32` | Cast attention results to fp32 |
151+
| `EFFICIENT_WEIGHT_LOAD` | Enable memory efficient weight loading |
152+
153+
## 💡 | Note:
154+
155+
This is an initial and preview phase of the worker's development.

builder/setup.sh

Lines changed: 0 additions & 23 deletions
This file was deleted.

docker-bake.hcl

Lines changed: 0 additions & 32 deletions
This file was deleted.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
ray
22
pandas
33
pyarrow
4-
runpod~=1.7.0
4+
runpod>=1.7.7
55
huggingface-hub
66
packaging
77
typing-extensions==4.7.1

0 commit comments

Comments
 (0)