Skip to content

Releases: llm-d/llm-d-inference-sim

v0.6.0

29 Oct 11:06
9a57299

Choose a tag to compare

What's Changed

  • New requests queue by @irar2 in #214
  • Make writing to channels non-blocking by @irar2 in #225
  • Change packages' dependencies by @irar2 in #229
  • Added port header to response by @irar2 in #232
  • Test fix: number of running requests can be one request less when scheduling requests by @irar2 in #231
  • fix occasional ttft and tpot metrics test failures by @mayabar in #233
  • Configure the tool_choice option to use a specific tool by @MondayCha in #234
  • Additional latency related metrics by @mayabar in #237
  • Changed random from static to a field in the simulator by @irar2 in #238
  • Made workers' requests channel non-blocking by @irar2 in #239

New Contributors

Full Changelog: v0.5.2...v0.6.0

v0.5.2

22 Oct 07:48
1c3d559

Choose a tag to compare

What's Changed

  • Use custom dataset as response source by @pancak3 in #200
  • Add vllm:time_per_output_token_seconds and vllm:time_to_first_token_seconds metrics by @mayabar in #217
  • Use openai-go v3.6.1 in the tests by @irar2 in #223
  • feat(metrics): add request prompt, generation, max_tokens and success metrics by @googs1025 in #202

Full Changelog: v0.5.1...v0.5.2

v0.5.1

18 Sep 15:08
b8eb7a4

Choose a tag to compare

New Features

  • The llm-d-inference-sim server can be run in TLS mode with the certificate and key supplied by the user or automatically generated.

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.5.0

16 Sep 06:54
9c541b9

Choose a tag to compare

New features

  • Processing time is affected by server load
  • Change TTFT parameter to be based on number of request tokens
  • KV cache affects prefill time
  • Support failure injection
  • Implement kv-cache usage and waiting loras Prometheus metrics
  • Randomize response length based when max-tokens is defined in the request
  • Support DP (data parallel)
  • Support /tokenize endpoint

What's Changed

  • Fix server interrupt by @npolshakova in #161
  • Show final config in simulaor default logger at Info lvel by @pancak3 in #154
  • Cast bounds type in tests to func def: latency, interToken, and timeToFirst (to int) by @pancak3 in #163
  • Remvoe unnecessary deferal of server close by @pancak3 in #162
  • Fix: Rand generator is not set in a test suite which result in accessing nil pointer during runtime if run the only test suite by @pancak3 in #166
  • Use channels for metrics updates, added metrics tests by @irar2 in #171
  • Remove rerun on comment action by @irar2 in #174
  • Add failure injection mode to simulator by @smarunich in #131
  • Add waiting loras list to loraInfo metrics by @mayabar in #175
  • feat: generate response length based on a histogram when max_tokens is defined in the request by @mayabar in #169
  • extend response length buckets calculation to have not necessary equally sized buckets by @mayabar in #176
  • Use dynamic ports in zmq tests by @pancak3 in #170
  • Change time-to-first-token parameter to be based on number of request tokens #137 by @pancak3 in #165
  • Bugfix: was accessing number of tokens from nil var; getting it from req instead by @pancak3 in #177
  • feat: add helm charts for Kubernetes deployment by @Blackoutta in #182
  • chore: Make the image smaller by @shmuelk in #183
  • Take cached prompt tokens into account in prefill time calculation by @irar2 in #184
  • Add ignore eos in request by @pancak3 in #187
  • Support DP by @irar2 in #188
  • Change RandomNorm from float types to int by @pancak3 in #190
  • KV cache usage metric by @irar2 in #192
  • Adjust request "processing time" to current load by @pancak3 in #189
  • Updates for the new release of kv-cache-manager by @irar2 in #194
  • DP bug fix: wait after starting rank 0 sim by @irar2 in #193
  • Support /tokenize endpoint by @irar2 in #198
  • add Service to expose vLLM deployment and update doc by @googs1025 in #201
  • Split simulator.go into several files by @irar2 in #199

New Contributors

Full Changelog: v0.4.0...v0.5.0

v0.4.0

21 Aug 10:19
4076bd2

Choose a tag to compare

New Features

  • KV Cache support: request prompts are tokenized, divided to blocks, hash values are calculated and stored in a cache. KV events batches are published for store/remove of a block from the cache.
  • Fake metrics: the configuration can contain a predefined set of metrics to be sent to Prometheus as a substitute for the actual data. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together.
  • Adds headers for pod name and namespace to help in tests to be able to check which vLLM instance actually received the request.

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.4.0

v0.3.2

07 Aug 08:48
9bbb64d

Choose a tag to compare

What's Changed

  • Work on the CI pipeline
  • Additional work on KV-Cache support, work in progress

Change details

  • KV cache and tokenization related configuration by @irar2 in #125
  • Another attempt at adding a latest tag only on release builds by @shmuelk in #124

Full Changelog: v0.3.1...v0.3.2

v0.3.1

06 Aug 14:53
0308c8f

Choose a tag to compare

What's Changed

  • Support long responses
  • Beginnings of Kv cache support, Work in Progress

Change details

Full Changelog: v0.3.0...v0.3.1

v0.3.1-rc.2

06 Aug 13:58
0308c8f

Choose a tag to compare

v0.3.1-rc.2 Pre-release
Pre-release

What's Changed

  • Support long responses
  • Initial work on KV Cache event support, still Work In Progress

Details of changes

Full Changelog: v0.3.0...v0.3.1-rc.2

v0.3.0

20 Jul 08:29
7f1f766

Choose a tag to compare

v0.3.0 Pre-release
Pre-release

Release Notes

Compatibility with vLLM

  • Aligned command-line parameters with real vLLM. All parameters supported by both the simulator and the vLLM now share the same name and format:
    • Support for --served-model-name
    • Support for --seed
    • Support for --max-model-len
  • Added support for tools in chat completions
  • Included usage in the response
  • Added object field to the response JSON
  • Added support for multimodal inputs in chat completions
  • Added health and readiness endpoints
  • Added P/D support; the connector type must be set to nixl

Additional Features

  • Introduced configuration file support. All parameters can now be loaded from a configuration file in addition to being set via the command line.
  • Added new test coverage
  • Changed the Docker base image
  • Added the ability to randomize time to first token, inter token latency, and KV-cache transfer latency

Migration Notes (for users upgrading from versions prior to v0.2.0)

  • max-running-requests has been renamed to max-num-seqs
  • lora has been replaced by lora-modules, which now accepts a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'

Change details since v0.2.2

  • feat: add max-model-len configuration and validation for context window (#82) by @mohitpalsingh in #85
  • Fixed readme, removed error for --help by @irar2 in #89
  • Pd support by @mayabar in #94
  • fix: crash when omitted stream_options by @jasonmadigan in #95
  • style: 🔨 splits all import blocks into different sections by @yafengio in #98
  • Fixed deployment.yaml by @irar2 in #99
  • Enable configuration of various parameters in tools by @irar2 in #100
  • Choose latencies randomly by @irar2 in #103

New Contributors

Full Changelog: v0.2.2...v0.3.0

v0.2.2

13 Jul 10:02
7656a3c

Choose a tag to compare

v0.2.2 Pre-release
Pre-release

What's Changed

  • Initialize rand once, added seed to configuration by @irar2 in #79
  • use string when storing lora adapters in simulator by @mayabar in #81
  • Improved support for empty command line arguments by @irar2 in #80
  • Added tests for LoRA configuration, load and unload by @irar2 in #86

Full Changelog: v0.2.1...v0.2.2