You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: dispatcher as forwarding decision maker for llm (#305)
llm serving is stateful. when a request is served in a distributed manner,
it needs to be routed to the same set of workers for efficient decoding.
In particular, when a token output needs to go back to the first layer,
it has to go back to the worker that holds KV cache. Otherwise, decoding
may be done incorrectly if not it's broken. However, when a serving pipeline
is constructed as a mesh, the current implementation doesn't guarantee
a correct forwarding. To allow correct forwarding, we make a server
(dispatcher) work as a forwarding decision maker for llm. Since there is
only one dispatcher, the workers of the last stage can deterministically
forward a generated token back to the dispatcher. Then, the dispatcher
determines whether the token needs to be sent back to the first stage or
not.
0 commit comments