Releases: gty111/gLLM
Releases · gty111/gLLM
v0.0.5
What's Changed
- Try to assign layers evenly by @gty111 in #121
- Use vllm precompiled wheel to simplify build by @gty111 in #122
- Use max_completion_tokens by @gty111 in #123
- Fix mrope for qwen vl by @gty111 in #124
- Use send/recv_pyobj and fix bugs by @gty111 in #125
- Fix is_finish by @gty111 in #126
- Add install.sh to simplify build by @gty111 in #127
- Support fp8 moe by @gty111 in #128
- Fix fp8 moe by @gty111 in #129
- Support Deepseek V3 by @gty111 in #130
- Make install.sh robust by @gty111 in #131
- Remove hinting "experimental feature for multi-node deployment" by @gty111 in #132
- Update Readme by @gty111 in #133
- [1/N] Cuda graph: create input buffer for model runner by @gty111 in #134
- [2/N] Cuda graph: output buffer by @gty111 in #135
- [3/N] Cuda graph: delay memory manager/KV cache init by @gty111 in #136
- [4/N] Cuda graph: add profile run by @gty111 in #137
- Separate sample operation by @gty111 in #139
- Simplify pp data transmission by @gty111 in #140
- [5/N] Cuda graph: support cuda graph by @gty111 in #141
- Vl fix by @gty111 in #142
- Fix image_grid_thw by @gty111 in #143
- Rename scheduler file by @gty111 in #144
- Add schedule_method and split_pd by @gty111 in #145
- Format by @gty111 in #146
- Bump up to version 0.0.5 by @gty111 in #147
Full Changelog: v0.0.4...v0.0.5
v0.0.4
What's Changed
- Update cuda supported archs by @gty111 in #82
- Remove AsyncLLM by @gty111 in #83
- Refactor LLM to use worker by @gty111 in #84
- Add arguments to chat.py by @gty111 in #85
- Truely use maxd; Optimize query start loc; Make evaluation use LLM by @gty111 in #86
- Update README.md by @gty111 in #87
- Change evaluation to online requests by @gty111 in #88
- Update README.md by @gty111 in #89
- Refactor LLM init by @gty111 in #90
- Fix decode token budget by @gty111 in #92
- Add handling for quant by @gty111 in #93
- Add support for quantization method fp8 (qwen3) by @gty111 in #94
- Add FP8 device check by @gty111 in #95
- Fix weights loading for Qwen3 fp8 by @gty111 in #96
- Fix weights loading by @gty111 in #97
- Refactor weights loading by @gty111 in #98
- Pipeline prefill chunks by @gty111 in #99
- Fix #run in naive schedule by @gty111 in #100
- Refactor utils by @gty111 in #101
- Refactor moe import by @gty111 in #102
- Add support for Deepseek V2/3 by @gty111 in #103
- Support mla for deepseek V2/3 by @gty111 in #104
- Update ChatCompletionRequest by @gty111 in #105
- Upgrade torch to 2.7.1; Build fa3 by default by @gty111 in #106
- Upgrade flashattn by @gty111 in #107
- Use sync recv and modify sampler by @gty111 in #109
- Rename schedule method chunked prefill by @gty111 in #110
- Minor fix for use_mla by @gty111 in #111
- Support for qwen2_5_vl by @gty111 in #108
- Set top_p and top_k uniformly by @gty111 in #112
- Fix Mrope position by @gty111 in #113
- Fix input_embeds and other bugs related to vl support by @gty111 in #114
- Fix VL TP by @gty111 in #115
- Clean up unused code by @gty111 in #116
- Fix minor bugs about model_max_length by @gty111 in #117
- Update README.md by @gty111 in #118
- Update requirements.txt by @gty111 in #119
- Add torchvision in requirements.txt by @gty111 in #120
Full Changelog: v0.0.3...v0.0.4
v0.0.3
What's Changed
- Refactor Sequence by @gty111 in #73
- Fix MoE model weights loading by @gty111 in #74
- Optimize input_data creation by @gty111 in #75
- Optimize get_slot_mapping by @gty111 in #76
- Refactor Sequence by @gty111 in #77
- Prepare for overlap scheduling by @gty111 in #78
- Support Expert Parallelism by @gty111 in #79
- Update readme for EP by @gty111 in #80
- Bump up to version 0.0.3 by @gty111 in #81
Full Changelog: v0.0.2...v0.0.3
v0.0.2
What's Changed
- Fix float bugs by @gty111 in #63
- Decouple sampler from model by @gty111 in #64
- Remove redundant dtype setting by @gty111 in #65
- Improve logging info by @gty111 in #66
- Use fused rmsnorm kernel for qwen3 by @gty111 in #67
- Support aborting requests by @gty111 in #68
- Fix naive schedule by @gty111 in #70
- Upgrade torch to 2.7.0 and flashattention by @gty111 in #71
- Support TP by @gty111 in #72
Full Changelog: v0.0.1...v0.0.2
v0.0.1
What's Changed
- Refactor Memory manager by @gty111 in #1
- Refactor Sampler;Choose KV store func according to batch size by @gty111 in #3
- Support ChatGLM3 by @gty111 in #4
- Add support for online serving by @gty111 in #5
- Add pipeline schedule which overlaps the overhead of schedule and output process by @gty111 in #6
- Update version of torch and vllm-flash-attn by @gty111 in #7
- Enable sample for each sequence by @gty111 in #8
- Add Prefix Caching by @gty111 in #9
- Update build logic for vllm flash attn and gllm by @gty111 in #10
- Merge schedule and output process by @gty111 in #11
- Incremental pass seqs information to decrease CPU overhead by @gty111 in #12
- Change message passing backend to zeromq by @gty111 in #13
- Make GPU process can run without schedule by @gty111 in #14
- Introduce PP to gLLM by @gty111 in #15
- Change PP communication by @gty111 in #16
- Optimize PP by @gty111 in #17
- Test feature Interleaved pp by @gty111 in #18
- Initial preparation for chunked prefill by @gty111 in #19
- PP schedule Optimize by @gty111 in #21
- Optimize PP schedule by @gty111 in #22
- Implement Chunked prefill 🙌 by @gty111 in #23
- Improve schedule policy by @gty111 in #26
- Supplementary Parameters by @gty111 in #27
- Revise by @gty111 in #28
- Implement AsyncWorker based on worker by @gty111 in #29
- Fix host of mmlu-pro by @gty111 in #30
- Enhance model loading progress by @gty111 in #31
- Add feature (sending requests in different stages) in benchmark_serving by @gty111 in #32
- Remove run_batch by @gty111 in #33
- Zmqcomm by @gty111 in #35
- Refactor zmqcomm by @gty111 in #36
- Decouple pp_schedule from worker by @gty111 in #37
- Refactor nccl init by @gty111 in #38
- Preparation for multi-node support by @gty111 in #39
- Add Multi node support by @gty111 in #40
- Update Readme and Fix offline inference by @gty111 in #41
- Update readme by @gty111 in #42
- Minor Fix by @gty111 in #43
- update reame by @gty111 in #44
- Support Qwen3 by @gty111 in #46
- Unify encode decode; Update args by @gty111 in #47
- Fix and warning multi-node support by @gty111 in #48
- Create LICENSE by @gty111 in #51
- Support MoE models by @gty111 in #52
- Fix Qwen2 moe by @gty111 in #53
- Fix torch.cuda.set_device by @gty111 in #54
- Optimize error handling logic by @gty111 in #55
- Add Acknowledgment by @gty111 in #56
- Simplify dtype and device settings by @gty111 in #57
- Add function is_pp_last_rank by @gty111 in #58
- Add support for Mixtral by @gty111 in #59
- Uniformly set the index of the PP layer by @gty111 in #60
- Remove the redundant dtype setting by @gty111 in #61
- Add pyproject.toml by @gty111 in #62
New Contributors
Full Changelog: https://github.com/gty111/gLLM/commits/v0.0.1