Releases · gty111/gLLM

18 Dec 10:06

gty111

v0.0.5

a17b1b0

v0.0.5 Latest

Latest

What's Changed

Try to assign layers evenly by @gty111 in #121
Use vllm precompiled wheel to simplify build by @gty111 in #122
Use max_completion_tokens by @gty111 in #123
Fix mrope for qwen vl by @gty111 in #124
Use send/recv_pyobj and fix bugs by @gty111 in #125
Fix is_finish by @gty111 in #126
Add install.sh to simplify build by @gty111 in #127
Support fp8 moe by @gty111 in #128
Fix fp8 moe by @gty111 in #129
Support Deepseek V3 by @gty111 in #130
Make install.sh robust by @gty111 in #131
Remove hinting "experimental feature for multi-node deployment" by @gty111 in #132
Update Readme by @gty111 in #133
[1/N] Cuda graph: create input buffer for model runner by @gty111 in #134
[2/N] Cuda graph: output buffer by @gty111 in #135
[3/N] Cuda graph: delay memory manager/KV cache init by @gty111 in #136
[4/N] Cuda graph: add profile run by @gty111 in #137
Separate sample operation by @gty111 in #139
Simplify pp data transmission by @gty111 in #140
[5/N] Cuda graph: support cuda graph by @gty111 in #141
Vl fix by @gty111 in #142
Fix image_grid_thw by @gty111 in #143
Rename scheduler file by @gty111 in #144
Add schedule_method and split_pd by @gty111 in #145
Format by @gty111 in #146
Bump up to version 0.0.5 by @gty111 in #147

Full Changelog: v0.0.4...v0.0.5

Contributors

gty111

Assets 2

15 Sep 02:54

gty111

v0.0.4

db34cb6

v0.0.4

What's Changed

Update cuda supported archs by @gty111 in #82
Remove AsyncLLM by @gty111 in #83
Refactor LLM to use worker by @gty111 in #84
Add arguments to chat.py by @gty111 in #85
Truely use maxd; Optimize query start loc; Make evaluation use LLM by @gty111 in #86
Update README.md by @gty111 in #87
Change evaluation to online requests by @gty111 in #88
Update README.md by @gty111 in #89
Refactor LLM init by @gty111 in #90
Fix decode token budget by @gty111 in #92
Add handling for quant by @gty111 in #93
Add support for quantization method fp8 (qwen3) by @gty111 in #94
Add FP8 device check by @gty111 in #95
Fix weights loading for Qwen3 fp8 by @gty111 in #96
Fix weights loading by @gty111 in #97
Refactor weights loading by @gty111 in #98
Pipeline prefill chunks by @gty111 in #99
Fix #run in naive schedule by @gty111 in #100
Refactor utils by @gty111 in #101
Refactor moe import by @gty111 in #102
Add support for Deepseek V2/3 by @gty111 in #103
Support mla for deepseek V2/3 by @gty111 in #104
Update ChatCompletionRequest by @gty111 in #105
Upgrade torch to 2.7.1; Build fa3 by default by @gty111 in #106
Upgrade flashattn by @gty111 in #107
Use sync recv and modify sampler by @gty111 in #109
Rename schedule method chunked prefill by @gty111 in #110
Minor fix for use_mla by @gty111 in #111
Support for qwen2_5_vl by @gty111 in #108
Set top_p and top_k uniformly by @gty111 in #112
Fix Mrope position by @gty111 in #113
Fix input_embeds and other bugs related to vl support by @gty111 in #114
Fix VL TP by @gty111 in #115
Clean up unused code by @gty111 in #116
Fix minor bugs about model_max_length by @gty111 in #117
Update README.md by @gty111 in #118
Update requirements.txt by @gty111 in #119
Add torchvision in requirements.txt by @gty111 in #120

Full Changelog: v0.0.3...v0.0.4

Contributors

gty111

Assets 2

22 Jun 02:35

gty111

v0.0.3

d037af0

v0.0.3

What's Changed

Refactor Sequence by @gty111 in #73
Fix MoE model weights loading by @gty111 in #74
Optimize input_data creation by @gty111 in #75
Optimize get_slot_mapping by @gty111 in #76
Refactor Sequence by @gty111 in #77
Prepare for overlap scheduling by @gty111 in #78
Support Expert Parallelism by @gty111 in #79
Update readme for EP by @gty111 in #80
Bump up to version 0.0.3 by @gty111 in #81

Full Changelog: v0.0.2...v0.0.3

Contributors

gty111

Assets 2

15 Jun 07:42

gty111

v0.0.2

577104b

v0.0.2

What's Changed

Fix float bugs by @gty111 in #63
Decouple sampler from model by @gty111 in #64
Remove redundant dtype setting by @gty111 in #65
Improve logging info by @gty111 in #66
Use fused rmsnorm kernel for qwen3 by @gty111 in #67
Support aborting requests by @gty111 in #68
Fix naive schedule by @gty111 in #70
Upgrade torch to 2.7.0 and flashattention by @gty111 in #71
Support TP by @gty111 in #72

Full Changelog: v0.0.1...v0.0.2

Contributors

gty111

Assets 2

31 May 09:15

gty111

v0.0.1

2487e89

v0.0.1

What's Changed

Refactor Memory manager by @gty111 in #1
Refactor Sampler;Choose KV store func according to batch size by @gty111 in #3
Support ChatGLM3 by @gty111 in #4
Add support for online serving by @gty111 in #5
Add pipeline schedule which overlaps the overhead of schedule and output process by @gty111 in #6
Update version of torch and vllm-flash-attn by @gty111 in #7
Enable sample for each sequence by @gty111 in #8
Add Prefix Caching by @gty111 in #9
Update build logic for vllm flash attn and gllm by @gty111 in #10
Merge schedule and output process by @gty111 in #11
Incremental pass seqs information to decrease CPU overhead by @gty111 in #12
Change message passing backend to zeromq by @gty111 in #13
Make GPU process can run without schedule by @gty111 in #14
Introduce PP to gLLM by @gty111 in #15
Change PP communication by @gty111 in #16
Optimize PP by @gty111 in #17
Test feature Interleaved pp by @gty111 in #18
Initial preparation for chunked prefill by @gty111 in #19
PP schedule Optimize by @gty111 in #21
Optimize PP schedule by @gty111 in #22
Implement Chunked prefill 🙌 by @gty111 in #23
Improve schedule policy by @gty111 in #26
Supplementary Parameters by @gty111 in #27
Revise by @gty111 in #28
Implement AsyncWorker based on worker by @gty111 in #29
Fix host of mmlu-pro by @gty111 in #30
Enhance model loading progress by @gty111 in #31
Add feature (sending requests in different stages) in benchmark_serving by @gty111 in #32
Remove run_batch by @gty111 in #33
Zmqcomm by @gty111 in #35
Refactor zmqcomm by @gty111 in #36
Decouple pp_schedule from worker by @gty111 in #37
Refactor nccl init by @gty111 in #38
Preparation for multi-node support by @gty111 in #39
Add Multi node support by @gty111 in #40
Update Readme and Fix offline inference by @gty111 in #41
Update readme by @gty111 in #42
Minor Fix by @gty111 in #43
update reame by @gty111 in #44
Support Qwen3 by @gty111 in #46
Unify encode decode; Update args by @gty111 in #47
Fix and warning multi-node support by @gty111 in #48
Create LICENSE by @gty111 in #51
Support MoE models by @gty111 in #52
Fix Qwen2 moe by @gty111 in #53
Fix torch.cuda.set_device by @gty111 in #54
Optimize error handling logic by @gty111 in #55
Add Acknowledgment by @gty111 in #56
Simplify dtype and device settings by @gty111 in #57
Add function is_pp_last_rank by @gty111 in #58
Add support for Mixtral by @gty111 in #59
Uniformly set the index of the PP layer by @gty111 in #60
Remove the redundant dtype setting by @gty111 in #61
Add pyproject.toml by @gty111 in #62

New Contributors

@gty111 made their first contribution in #1

Full Changelog: https://github.com/gty111/gLLM/commits/v0.0.1

Contributors

gty111

Assets 2

Releases: gty111/gLLM

v0.0.5

What's Changed

Contributors

Uh oh!

v0.0.4

What's Changed

Contributors

Uh oh!

v0.0.3

What's Changed

Contributors

Uh oh!

v0.0.2

What's Changed

Contributors

Uh oh!

v0.0.1

What's Changed

New Contributors

Contributors

Uh oh!