Add failure injection mode to simulator #131

smarunich · 2025-08-13T22:25:28Z

The initial PR for a discussion on failure mode as doing some tests locally with client and gateway testing, wdyt in general from the mode standpoint? how it should be introduced? open to the feedback and work on it.

Introduces a 'failure' mode to the simulator, allowing random injection of OpenAI API-compatible error responses for testing error handling. Adds configuration options for failure injection rate and specific failure types, implements error response logic, and updates documentation and tests to cover the new functionality.

This pull request adds a new "failure" simulation mode to the LLM-D inference simulator, enabling randomized injection of OpenAI-compatible API error responses for enhanced error-handling test scenarios. It introduces configuration options for controlling the failure injection rate and specifying which error types to inject, along with robust validation and documentation updates. The implementation includes a new error response system, supporting unit tests, and integration into the main completion request handler.

Simulator functionality expansion:

Added a new failure simulation mode, which randomly injects OpenAI API-compatible error responses for testing purposes. This mode is selectable via the mode parameter and is documented in README.md. [1] [2]
Introduced new configuration parameters: failure-injection-rate (controls probability of error injection) and failure-types (specifies which error types to inject), with validation for allowed values and documentation updates. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Failure injection implementation:

Added pkg/common/failures.go, defining error types, failure injection logic, and random selection of error responses based on configuration.
Integrated failure injection into the main completion handler in simulator.go, returning error responses when triggered. Added a dedicated method for sending structured error responses. [1] [2]

Testing and validation:

Added unit tests in pkg/common/failures_test.go to verify failure injection logic and error response generation.

Other improvements:

Updated error response handling in simulator.go to use a consistent error response structure.
Minor fixes to test code for compatibility with new OpenAI Go SDK conventions. [1] [2] [3]

irar2 · 2025-08-14T08:03:33Z

@smarunich Thank you very much for the PR, failure generation support is a part of our roadmap, thanks for picking this up!

Some more or less general comments, so I am writing them here and not in the code:

We think that failure generation should not be a mode. (We have to update the explanation of mode, it's misleading, it should be a completion response generation mode, and not the mode of the simulator).
failure-injection-rate = 0 will mean no failure injection
failure types - please define constants for the possible values

Please move failures.go to llm-d-inference-sim package.

Please don’t call rand.Seed() - use common.RandomInt which is protected by mutex and allows reproducible behavior.

"Rate limit reached for %s in organization org-xxx on requests per min (RPM): Limit 3, Used 3, Requested 1." and "The model ‘%s-nonexistent’ does not exist" can be defined as constants and reused.

We may only have one function sendCompletionError instead of two (sendCompletionError and sendFailureResponse) which gets common.FailureSpec and boolean which defines whether this is an injected failure for logging.

In ErrorResponse object field is missing. In the documentation that we found, it should be present. Could you please share a link to the relevant documentation?

Signed-off-by: Shmuel Kallner <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

) Signed-off-by: Shmuel Kallner <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

* Publish kv-cache events Signed-off-by: Ira <[email protected]> * Fix lint errors Signed-off-by: Ira <[email protected]> * Review fixes Signed-off-by: Ira <[email protected]> * Sleep to allow prevous sub to close Signed-off-by: Ira <[email protected]> --------- Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

Introduces a 'failure' mode to the simulator, allowing random injection of OpenAI API-compatible error responses for testing error handling. Adds configuration options for failure injection rate and specific failure types, implements error response logic, and updates documentation and tests to cover the new functionality. Signed-off-by: Sergey Marunich <[email protected]>

Failure injection is now controlled by a dedicated 'failure-injection-rate' parameter instead of a separate 'failure' mode. Failure type constants are centralized, and error handling in the simulator is refactored to use a unified method for sending error responses. Documentation and tests are updated to reflect these changes, and the OpenAI error response format now includes an 'object' field. Signed-off-by: Sergey Marunich <[email protected]>

Extracts TOKENIZER_VERSION from the Dockerfile and uses it in the download-tokenizer target. This allows the Makefile to automatically use the correct tokenizer version specified in the Dockerfile, improving maintainability and consistency. Signed-off-by: Sergey Marunich <[email protected]>

Introduces a 'failure' mode to the simulator, allowing random injection of OpenAI API-compatible error responses for testing error handling. Adds configuration options for failure injection rate and specific failure types, implements error response logic, and updates documentation and tests to cover the new functionality. Signed-off-by: Sergey Marunich <[email protected]>

Failure injection is now controlled by a dedicated 'failure-injection-rate' parameter instead of a separate 'failure' mode. Failure type constants are centralized, and error handling in the simulator is refactored to use a unified method for sending error responses. Documentation and tests are updated to reflect these changes, and the OpenAI error response format now includes an 'object' field. Signed-off-by: Sergey Marunich <[email protected]>

Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

* Publish kv-cache events Signed-off-by: Ira <[email protected]> * Fix lint errors Signed-off-by: Ira <[email protected]> * Review fixes Signed-off-by: Ira <[email protected]> * Sleep to allow prevous sub to close Signed-off-by: Ira <[email protected]> --------- Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

) * - Use same version of tokenizer in both Dockerfile and Makefile - Fixes in readme file Signed-off-by: Maya Barnea <[email protected]> * updates according PR's review Signed-off-by: Maya Barnea <[email protected]> --------- Signed-off-by: Maya Barnea <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

Removed redundant lines and updated comments and help text to clarify that 'failure-injection-rate' is the probability of injecting failures, not specifically tied to failure mode. Signed-off-by: Sergey Marunich <[email protected]>

Signed-off-by: Sergey Marunich <[email protected]>

Signed-off-by: Sergey Marunich <[email protected]> KV cache and tokenization related configuration (llm-d#125) Signed-off-by: Ira <[email protected]> Publish kv-cache events (llm-d#126) * Publish kv-cache events Signed-off-by: Ira <[email protected]> * Fix lint errors Signed-off-by: Ira <[email protected]> * Review fixes Signed-off-by: Ira <[email protected]> * Sleep to allow prevous sub to close Signed-off-by: Ira <[email protected]> --------- Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]> Use same version of tokenizer in both Dockerfile and Makefile (llm-d#132) * - Use same version of tokenizer in both Dockerfile and Makefile - Fixes in readme file Signed-off-by: Maya Barnea <[email protected]> * updates according PR's review Signed-off-by: Maya Barnea <[email protected]> --------- Signed-off-by: Maya Barnea <[email protected]> Signed-off-by: Sergey Marunich <[email protected]> Replaces usage of param.NewOpt with openai.Int for MaxTokens and openai.Bool with param.NewOpt for IncludeUsage in simulator_test.go to align with updated API usage. Signed-off-by: Sergey Marunich <[email protected]>

Replaces usage of param.NewOpt with openai.Int for MaxTokens and openai.Bool with param.NewOpt for IncludeUsage in simulator_test.go to align with updated API usage.

Signed-off-by: Sergey Marunich <[email protected]>

Added descriptions for `failure-injection-rate` and `failure-types` configuration options to clarify their usage and defaults.

Changed the default value of FailureInjectionRate from 10 to 0 in newConfig to disable failure injection as was enabled by default with previous mode that deprecated

smarunich · 2025-08-14T22:27:09Z

@irar2 thank you a lot for your feedback, please take an initial look and let me know what you think i have incorporated your feedback, also adding this reference vllm-project/vllm#12886 to the ErrorResponse conversation... i am not contributing to the projects frequently, but do have a huge respect to what you are doing with this project so attempted to do my best, open for any modifications, etc.

p.s. I am using the build with Envoy AI Gateway Demos for provider failover demo https://github.com/smarunich/envoy-ai-gateway-demos/tree/main/demos/03-provider-fallback also do have github actions run to showcase the run: https://github.com/smarunich/envoy-ai-gateway-demos/actions/runs/16978406193/job/48132952395?pr=3

pkg/common/config.go

pkg/llm-d-inference-sim/failures.go

irar2 · 2025-08-17T09:06:41Z

pkg/llm-d-inference-sim/simulator.go

-		Param:   nil,
+// The first parameter can be either a string message or a FailureSpec
+// isInjected indicates if this is an injected failure for logging purposes
+func (s *VllmSimulator) sendCompletionError(ctx *fasthttp.RequestCtx, errorInfo interface{}, isInjected bool) {


Please update according to the general comment

pkg/llm-d-inference-sim/simulator_test.go

pkg/openai-server-api/response.go

irar2 · 2025-08-17T09:12:48Z

@smarunich Thanks for the links and the changes. We really appreciate your contribution.

We did some more digging and found out that even though the hierarchical error structure is already in the vLLM current code, but not in the latest release. So, we need to stay with the current structure for now, and upgrade it later. I created issue #135 for this.

In addition, code field is an int in vLLM, please don't change that, and it's the same code as the HTTP code.

I don't think there is a need in your FailureSpec struct, you can simply use the existing CompletionError.
Your ErrorTypes don't match the openai types.
It would be nice to have a constructor that will set the error type according to the code:

func NewCompletionError(message string, code int, param *string) CompletionError {
	errorType := ""
	switch code {
	case 400:
		errorType = "BadRequestError"
	case 401:
		errorType = "AuthenticationError"
	case 403:
		errorType = "PermissionDeniedError"
	case 404:
		errorType = "NotFoundError"
	case 422:
		errorType = "UnprocessableEntityError"
	case 429:
		errorType = "RateLimitError"
	default:
		if code >= 500 {
			errorType = "InternalServerError"
		} else {
			errorType = "APIConnectionError"
		}
	}
	return CompletionError{
		Object:  "error",
		Message: message,
		Code:    code,
		Type:    errorType,
		Param:   param,
	}
}

and call this constructor to create an error to pass to sendCompletionError and to create a map of predefinedFailures.

Please run make lint, our automatic testing fails because of the lint errors.

Please see additional comments inside the code.

irar2 · 2025-08-24T05:53:45Z

@smarunich We are planning a release soon, and would like to include this feature in it. Will you have time to continue with the PR in the next couple of days? If your schedule doesn't allow this, please let us know and we will continue with this PR.

smarunich · 2025-08-25T13:41:54Z

@smarunich We are planning a release soon, and would like to include this feature in it. Will you have time to continue with the PR in the next couple of days? If your schedule doesn't allow this, please let us know and we will continue with this PR.

@irar2 I was off last week, let me resume by today and update the thread.

Signed-off-by: Sergey Marunich <[email protected]>

Failure handling in the simulator now uses the CompletionError struct from the openai-server-api package, replacing custom error fields with a unified structure. This improves consistency in error responses and simplifies error injection logic. Associated tests and error handling code have been updated to reflect this change. Signed-off-by: Sergey Marunich <[email protected]>

smarunich · 2025-08-26T01:13:50Z

@smarunich We are planning a release soon, and would like to include this feature in it. Will you have time to continue with the PR in the next couple of days? If your schedule doesn't allow this, please let us know and we will continue with this PR.

@irar2 please feel free to take it forward, I have done few updates, but running out of cycles for this week as I have run into the conflicting priorities, I won't have cycle to work on this until very beginning of the next week, meanwhile I might miss on quality... don't want to delay you folks as truly appreciate your efforts, I would really appreciate if you can take this forward to completion!

irar2 · 2025-08-27T05:26:19Z

@smarunich We are planning a release soon, and would like to include this feature in it. Will you have time to continue with the PR in the next couple of days? If your schedule doesn't allow this, please let us know and we will continue with this PR.

@irar2 please feel free to take it forward, I have done few updates, but running out of cycles for this week as I have run into the conflicting priorities, I won't have cycle to work on this until very beginning of the next week, meanwhile I might miss on quality... don't want to delay you folks as truly appreciate your efforts, I would really appreciate if you can take this forward to completion!

@smarunich Thanks a lot! We will take it from here then.

Signed-off-by: Ira <[email protected]>

Signed-off-by: Ira Rosen <[email protected]>

irar2 · 2025-08-27T09:00:17Z

fixes #135

mayabar · 2025-08-27T09:40:40Z

pkg/common/config.go

 	FakeMetrics *Metrics `yaml:"fake-metrics" json:"fake-metrics"`
+
+	// FailureInjectionRate is the probability (0-100) of injecting failures
+	FailureInjectionRate int `yaml:"failure-injection-rate"`


please add json annotation

mayabar · 2025-08-27T09:47:06Z

pkg/llm-d-inference-sim/simulator_test.go

 		)
 	})

+	Describe("Failure injection mode", func() {


move this test to failure_test please

mayabar · 2025-08-27T10:03:40Z

Hi Sergey @smarunich
Can you please fix the sing off problem , so we will be able to merge this PR
Ira is going to fix requested changes

Signed-off-by: Ira <[email protected]>

irar2 · 2025-08-28T07:57:02Z

@smarunich We added your signature to the extended comments of the squashed merge that we did.
Thank you very much for your contribution, it is much appreciated.

smarunich · 2025-08-28T12:46:48Z

@smarunich We added your signature to the extended comments of the squashed merge that we did. Thank you very much for your contribution, it is much appreciated.

thank you @irar2 @mayabar for taking this forward to completion, truly appreciated!

smarunich force-pushed the failure-mode branch 3 times, most recently from 8239ac1 to 52edd56 Compare August 14, 2025 21:13

smarunich marked this pull request as draft August 14, 2025 21:35

shmuelk and others added 15 commits August 14, 2025 17:39

Add definition of new action input (llm-d#123)

638d0f7

Signed-off-by: Shmuel Kallner <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

KV cache and tokenization related configuration (llm-d#125)

9ffe957

Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

Another attempt at adding a latest tag only on release builds (llm-d#124

a5a7d81

) Signed-off-by: Shmuel Kallner <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

KV cache and tokenization related configuration (llm-d#125)

c35dbca

Signed-off-by: Ira <[email protected]> Signed-off-by: Sergey Marunich <[email protected]>

Clarify failure injection rate documentation

3ae7113

Removed redundant lines and updated comments and help text to clarify that 'failure-injection-rate' is the probability of injecting failures, not specifically tied to failure mode. Signed-off-by: Sergey Marunich <[email protected]>

Set default failure injection rate to 0

f5ae85b

Signed-off-by: Sergey Marunich <[email protected]>

rebase duplicates

9dbb689

Signed-off-by: Sergey Marunich <[email protected]>

smarunich force-pushed the failure-mode branch from aeb8bde to 08bcf08 Compare August 14, 2025 21:40

smarunich force-pushed the failure-mode branch from 08bcf08 to 106e276 Compare August 14, 2025 21:45

smarunich added 4 commits August 14, 2025 17:48

Update option constructors in simulator tests

5162226

Replaces usage of param.NewOpt with openai.Int for MaxTokens and openai.Bool with param.NewOpt for IncludeUsage in simulator_test.go to align with updated API usage.

Merge branch 'main' into failure-mode

7bd69e8

Signed-off-by: Sergey Marunich <[email protected]>

Document failure injection options in README

5182187

Added descriptions for `failure-injection-rate` and `failure-types` configuration options to clarify their usage and defaults.

Set FailureInjectionRate default to 0 in config

b68115f

Changed the default value of FailureInjectionRate from 10 to 0 in newConfig to disable failure injection as was enabled by default with previous mode that deprecated

smarunich marked this pull request as ready for review August 14, 2025 22:26

irar2 requested changes Aug 17, 2025

View reviewed changes

smarunich added 6 commits August 25, 2025 18:48

Refactor failure type usage and error response format

bfa02ff

Signed-off-by: Sergey Marunich <[email protected]>

Refactor failure type flag handling and code formatting

700e36f

Signed-off-by: Sergey Marunich <[email protected]>

Merge branch 'main' into failure-mode

14860b3

Signed-off-by: Sergey Marunich <[email protected]>

Fix config validation and simulator test argument handling

8f6d56c

Signed-off-by: Sergey Marunich <[email protected]>

remove duplicate

e0183b7

Signed-off-by: Sergey Marunich <[email protected]>

smarunich requested a review from irar2 August 26, 2025 12:57

irar2 and others added 2 commits August 27, 2025 10:58

Use one type for all errors. Map code to type

72dde24

Signed-off-by: Ira <[email protected]>

Merge branch 'main' into failure-mode

13492fc

Signed-off-by: Ira Rosen <[email protected]>

irar2 mentioned this pull request Aug 27, 2025

Support OpenAI error response structure #135

Closed

mayabar requested changes Aug 27, 2025

View reviewed changes

Review comments

7994048

Signed-off-by: Ira <[email protected]>

mayabar approved these changes Aug 27, 2025

View reviewed changes

irar2 approved these changes Aug 27, 2025

View reviewed changes

irar2 merged commit 74fd1c5 into llm-d:main Aug 28, 2025
3 of 4 checks passed

Add failure injection mode to simulator #131

Add failure injection mode to simulator #131

Uh oh!

Conversation

smarunich commented Aug 13, 2025

Uh oh!

irar2 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarunich commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irar2 Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irar2 commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irar2 commented Aug 24, 2025

Uh oh!

smarunich commented Aug 25, 2025

Uh oh!

smarunich commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irar2 commented Aug 27, 2025

Uh oh!

irar2 commented Aug 27, 2025

Uh oh!

mayabar Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

mayabar Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

mayabar commented Aug 27, 2025

Uh oh!

Uh oh!

irar2 commented Aug 28, 2025

Uh oh!

smarunich commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

irar2 commented Aug 14, 2025 •

edited

Loading

smarunich commented Aug 14, 2025 •

edited

Loading

irar2 commented Aug 17, 2025 •

edited

Loading

smarunich commented Aug 26, 2025 •

edited

Loading