Skip to content

Provide a way not to double tokenize the prompt in case of the token-aware KV routing #1875

@atchernych

Description

@atchernych

What would you like to be added:
A need a way to have a plugin which can modify the request body when the EPP picks the worker.

Not sure what the best way to go about it. I envision:

  1. The director creates the LLMRequest once (with an empty map) and hands the same object through scheduling so the scorer’s annotation sticks around. It needs to have an extra annotations field.
  2. The EPP routing plugin fills in req.Annotations[tokenDataAnnotationKey] =
  3. After scheduling succeeds, runPreRequestPlugins invokes each plugin and (for mutators) passes the live body map. It needs to include the body mutations.
  4. Another pre-request plugin pkg/epp/requestcontrol/plugins/new/plugin.go reads this body and copies the data into the body to send to the workers.
  5. Once the director returns, StreamingServer.Process marshals reqCtx.Request.Body (now containing token_data) and stores it in the ext-proc responses that go back to Envoy.
  6. From there Envoy forwards those mutated bytes downstream. In short: router plugin → request annotations → PreRequest mutator → marshalled body → Envoy → worker.

Why is this needed:

I have an EPP routing plugin which does token-aware kv-routing. It uses the model's tokenizer. My plugin identifies the best worker but also tokenizes the prompt in the process. I want to be able to pass these tokens to the serving workers in the request body. Otherwise my workers will de-tokenize again. This would introduce additional latency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions