Skip to content

[Feature]: dump current requests on failure or on SIGKILLย #11036

@okdimok

Description

@okdimok

๐Ÿš€ The feature, motivation and pitch

When using TRT-LLM in production it occasionally may crash or hang from some request combinations. However, it is currently very hard to reproduce such failures, because there is no way to learn, which requests were in the batch, when such a crash happened.
I propose adding a config parameter to enable dumping the current requests to a file, that would be triggered in the events of crashes, and when the external system is killing the worker due to it failing the health checks.

cc @ltalal

Alternatives

Setting up a proxy, which is fully aware of all the states of all the requests being executed on all the instances.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

feature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions