You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Initial support for TP
* Use random initialization
* Fix PP forward
* Downgrade to torch 2.6.0
* Fix env setting for MAX_JOBS
* Downgrade to torch 2.5.1
* Fix TP group init
* Fix annotation
* Make llama compatible for tp
* Make chatglm compatible for TP
* Make Qwen3 compatible for TP
* Remove weight_loader in fused_moe
* Make fused_moe compatible for TP; Abstract weight load function
* Make qwen_moe compatible for tp
* Make mixtral compatible for TP
* Update readme
* Abstract module attention; Clean up code for TP attention; Clean up code for model weights loading for glm
* Add MoE tuing config for A100 PCIE 40GB
* Refactor scheduler.py and AllocatorID
* Refactor IDAllocator
* Refactor worker scheduler
* Update readme
* Make embed_tokens and lm_head compatible for TP
* Fix multi-node zmq_comm
* Bump version to 0.1.0
Copy file name to clipboardExpand all lines: README.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,9 +15,10 @@ Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Tok
15
15
<img src=doc/pic/overview.svg width=500>
16
16
</p>
17
17
18
-
Integreted with features like **continuous batching**, **paged attention**, **chunked prefill**, **prefix caching**, **token throttling**and **pipeline parallelism**, gLLM provides basic functionality (offline/online inference and interactive chat) to support large language model inference. gLLM provides **equivalent or superior** offline/online inference speed with mainstream inference engine and **minimal** (~4k loc) code base. You can also see gLLM as a LLM inference playground for doing experiment or academic research.
18
+
Integreted with features like **continuous batching**, **paged attention**, **chunked prefill**, **prefix caching**, **token throttling**, **pipeline parallelism**and **tensor parallelism**, gLLM provides basic functionality (**offline/online inference and interactive chat**) to deploy distributed LLMs (**supported in huggingface**) inference. gLLM provides **equivalent or superior** offline/online inference speed with mainstream inference engine and **minimal** (~6k loc) code base. You can also see gLLM as a LLM inference playground for doing experiment or academic research.
19
19
20
20
*Latest News*:fire:
21
+
-[2025/06/14]: Tensor parallelism is now integrated, allowing joint deploying with pipeline parallelism :sunglasses:
21
22
-[2025/05/05]: MoE architecture is supported. Try Qwen2/3 MoE models :star_struck:
22
23
-[2025/04/29]: Qwen3 day 1 support. Come and try Qwen3 :tada:
23
24
-[2025/04/27]: gLLM is open sourced :earth_asia:
@@ -43,7 +44,7 @@ Integreted with features like **continuous batching**, **paged attention**, **ch
Copy file name to clipboardExpand all lines: gllm/entrypoints/api_server.py
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -100,6 +100,7 @@ async def run_server(args):
100
100
parser.add_argument('--use-naive-schedule', help='Use scheduling policy in Sarathi-Serve', action='store_true')
101
101
parser.add_argument('--enable-prefix-caching', help='Enable KV cache reuse across requests', action='store_true')
102
102
parser.add_argument('--pp', type=int, help='Number of pipeline stages', default=1)
103
+
parser.add_argument('--tp', type=int, help='Number of tensor parallel degrees', default=1)
103
104
parser.add_argument('--load-format', type=str, choices=['auto','dummy'], help='auto: actually load model weights; dummy: initialize the model with random values', default='auto')
104
105
parser.add_argument('--assigned-layers', type=str, help='If the model have 64 layers, we can set it to 16,16,16,16 or 16,16,17,15', default=None)
105
106
parser.add_argument('--use-async-worker', help='Experimental feature for worker implemented by async', action='store_true')
0 commit comments