Releases: llm-d/llm-d-inference-sim
Releases · llm-d/llm-d-inference-sim
v0.6.0
What's Changed
- New requests queue by @irar2 in #214
- Make writing to channels non-blocking by @irar2 in #225
- Change packages' dependencies by @irar2 in #229
- Added port header to response by @irar2 in #232
- Test fix: number of running requests can be one request less when scheduling requests by @irar2 in #231
- fix occasional ttft and tpot metrics test failures by @mayabar in #233
- Configure the tool_choice option to use a specific tool by @MondayCha in #234
- Additional latency related metrics by @mayabar in #237
- Changed random from static to a field in the simulator by @irar2 in #238
- Made workers' requests channel non-blocking by @irar2 in #239
New Contributors
- @MondayCha made their first contribution in #234
Full Changelog: v0.5.2...v0.6.0
v0.5.2
What's Changed
- Use custom dataset as response source by @pancak3 in #200
- Add vllm:time_per_output_token_seconds and vllm:time_to_first_token_seconds metrics by @mayabar in #217
- Use openai-go v3.6.1 in the tests by @irar2 in #223
- feat(metrics): add request prompt, generation, max_tokens and success metrics by @googs1025 in #202
Full Changelog: v0.5.1...v0.5.2
v0.5.1
New Features
- The llm-d-inference-sim server can be run in TLS mode with the certificate and key supplied by the user or automatically generated.
What's Changed
- Add golangci-lint version check by @npolshakova in #160
- feat(server): enables TLS mode by @bartoszmajsak in #205
- fix(make): properly resolves package manager for ZMQ installation by @bartoszmajsak in #204
- feat(make): simplifies local tooling installation by @bartoszmajsak in #203
New Contributors
- @bartoszmajsak made their first contribution in #205
Full Changelog: v0.5.0...v0.5.1
v0.5.0
New features
- Processing time is affected by server load
- Change TTFT parameter to be based on number of request tokens
- KV cache affects prefill time
- Support failure injection
- Implement kv-cache usage and waiting loras Prometheus metrics
- Randomize response length based when max-tokens is defined in the request
- Support DP (data parallel)
- Support /tokenize endpoint
What's Changed
- Fix server interrupt by @npolshakova in #161
- Show final config in simulaor default logger at Info lvel by @pancak3 in #154
- Cast bounds type in tests to func def: latency, interToken, and timeToFirst (to int) by @pancak3 in #163
- Remvoe unnecessary deferal of server close by @pancak3 in #162
- Fix: Rand generator is not set in a test suite which result in accessing nil pointer during runtime if run the only test suite by @pancak3 in #166
- Use channels for metrics updates, added metrics tests by @irar2 in #171
- Remove rerun on comment action by @irar2 in #174
- Add failure injection mode to simulator by @smarunich in #131
- Add waiting loras list to loraInfo metrics by @mayabar in #175
- feat: generate response length based on a histogram when max_tokens is defined in the request by @mayabar in #169
- extend response length buckets calculation to have not necessary equally sized buckets by @mayabar in #176
- Use dynamic ports in zmq tests by @pancak3 in #170
- Change time-to-first-token parameter to be based on number of request tokens #137 by @pancak3 in #165
- Bugfix: was accessing number of tokens from nil var; getting it from req instead by @pancak3 in #177
- feat: add helm charts for Kubernetes deployment by @Blackoutta in #182
- chore: Make the image smaller by @shmuelk in #183
- Take cached prompt tokens into account in prefill time calculation by @irar2 in #184
- Add ignore eos in request by @pancak3 in #187
- Support DP by @irar2 in #188
- Change RandomNorm from float types to int by @pancak3 in #190
- KV cache usage metric by @irar2 in #192
- Adjust request "processing time" to current load by @pancak3 in #189
- Updates for the new release of kv-cache-manager by @irar2 in #194
- DP bug fix: wait after starting rank 0 sim by @irar2 in #193
- Support /tokenize endpoint by @irar2 in #198
- add Service to expose vLLM deployment and update doc by @googs1025 in #201
- Split simulator.go into several files by @irar2 in #199
New Contributors
- @smarunich made their first contribution in #131
- @Blackoutta made their first contribution in #182
- @googs1025 made their first contribution in #201
Full Changelog: v0.4.0...v0.5.0
v0.4.0
New Features
- KV Cache support: request prompts are tokenized, divided to blocks, hash values are calculated and stored in a cache. KV events batches are published for store/remove of a block from the cache.
- Fake metrics: the configuration can contain a predefined set of metrics to be sent to Prometheus as a substitute for the actual data. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together.
- Adds headers for pod name and namespace to help in tests to be able to check which vLLM instance actually received the request.
What's Changed
- Publish kv-cache events by @irar2 in #126
- Use same version of tokenizer in both Dockerfile and Makefile by @mayabar in #132
- use newer version of kvcache-manager, update code accordingly by @mayabar in #133
- Add support to echo the sim's pod name and namespace by @npolshakova in #128
- Create UUID string under a lock by @irar2 in #143
- Support fake metrics by @irar2 in #144
- fix: Makefile fixes for MacOS by @shmuelk in #146
- Use kv-cache-manager version v0.2.1 by @mayabar in #147
- Present kv-cache related configuration parameters in readme file by @mayabar in #149
- updated readme file - added environment variables by @mayabar in #151
- Fix zmq endpoints in test cases by @pancak3 in #150
- Change user to not be root in the dockerfile by @mayabar in #153
- Add ZMQ connection retry configuration by @zhengkezhou1 in #152
- Added CI automation by @shmuelk in #155
- small changes in texts by @mayabar in #156
New Contributors
- @npolshakova made their first contribution in #128
- @pancak3 made their first contribution in #150
- @zhengkezhou1 made their first contribution in #152
Full Changelog: v0.3.2...v0.4.0
v0.3.2
v0.3.1
What's Changed
- Support long responses
- Beginnings of Kv cache support, Work in Progress
Change details
- Support long responses and additional fixes by @mayabar in #104
- Change project structure - separate main package to three by @mayabar in #105
- Code reorganization: moved configuration related code to common by @irar2 in #109
- Kv cache support without KV events by @mayabar in #107
- chore: Added a LICENSE file by @shmuelk in #117
- feat: Add new issue templates by @shmuelk in #114
- chore: Added common badges by @shmuelk in #118
- ZMQ publisher by @irar2 in #119
- Only create image with latest tag on release by @shmuelk in #120
- Kv events sender by @mayabar in #121
- Add definition of new action input by @shmuelk in #123
Full Changelog: v0.3.0...v0.3.1
v0.3.1-rc.2
What's Changed
- Support long responses
- Initial work on KV Cache event support, still Work In Progress
Details of changes
- Support long responses and additional fixes by @mayabar in #104
- Change project structure - separate main package to three by @mayabar in #105
- Code reorganization: moved configuration related code to common by @irar2 in #109
- Kv cache support without KV events by @mayabar in #107
- chore: Added a LICENSE file by @shmuelk in #117
- feat: Add new issue templates by @shmuelk in #114
- chore: Added common badges by @shmuelk in #118
- ZMQ publisher by @irar2 in #119
- Only create image with latest tag on release by @shmuelk in #120
- Kv events sender by @mayabar in #121
- Add definition of new action input by @shmuelk in #123
Full Changelog: v0.3.0...v0.3.1-rc.2
v0.3.0
Release Notes
Compatibility with vLLM
- Aligned command-line parameters with real vLLM. All parameters supported by both the simulator and the vLLM now share the same name and format:
- Support for --served-model-name
- Support for --seed
- Support for --max-model-len
- Added support for tools in chat completions
- Included usage in the response
- Added object field to the response JSON
- Added support for multimodal inputs in chat completions
- Added health and readiness endpoints
- Added P/D support; the connector type must be set to nixl
Additional Features
- Introduced configuration file support. All parameters can now be loaded from a configuration file in addition to being set via the command line.
- Added new test coverage
- Changed the Docker base image
- Added the ability to randomize time to first token, inter token latency, and KV-cache transfer latency
Migration Notes (for users upgrading from versions prior to v0.2.0)
- max-running-requests has been renamed to max-num-seqs
- lora has been replaced by lora-modules, which now accepts a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'
Change details since v0.2.2
- feat: add max-model-len configuration and validation for context window (#82) by @mohitpalsingh in #85
- Fixed readme, removed error for --help by @irar2 in #89
- Pd support by @mayabar in #94
- fix: crash when omitted stream_options by @jasonmadigan in #95
- style: 🔨 splits all import blocks into different sections by @yafengio in #98
- Fixed deployment.yaml by @irar2 in #99
- Enable configuration of various parameters in tools by @irar2 in #100
- Choose latencies randomly by @irar2 in #103
New Contributors
- @mohitpalsingh made their first contribution in #85
- @jasonmadigan made their first contribution in #95
Full Changelog: v0.2.2...v0.3.0
v0.2.2
What's Changed
- Initialize rand once, added seed to configuration by @irar2 in #79
- use string when storing lora adapters in simulator by @mayabar in #81
- Improved support for empty command line arguments by @irar2 in #80
- Added tests for LoRA configuration, load and unload by @irar2 in #86
Full Changelog: v0.2.1...v0.2.2