Replies: 4 comments 1 reply
-
There are two separate but related concerns:
There's also some overlap here with the work that @jakelorocco is doing on our current sprint. The async+lazy stuff provides an obvious opportunity for batching. Also, as we move toward more sophisticated KV handling, primitives like |
Beta Was this translation helpful? Give feedback.
-
One immediate path forward would be to specify some batching interface in Backends and then build on top of it with explicit batching, then also use that interface for more automated batch construction. This is a significant enough change that we should probably consider the options and write up a design doc. |
Beta Was this translation helpful? Give feedback.
-
Also, lets explicitly call out how we will measure the benefits of this in terms of a real world use case, eg TaP agent. |
Beta Was this translation helpful? Give feedback.
-
One issue here is that no provider / inference engine (at least that we support / I could find) supports batching using the chat completions api (which is what we use). If we swap to the completions api and apply the chat templates ourselves, we could support this. Otherwise, the best we get is async calls (which are being worked on in this sprint) and hope that the provider / inference engine handles them efficiently and batches them. I believe async actually gives us most of the benefits outlined above; we can still fire off multiple requests and process the results for sampling strategies as we get them. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The existing SamplingStrategy.sample method currently operates in an iterative manner, individually processing each prompt and its associated requirements until they align or the loop budget is exhausted. This sequential process can lead to higher latency, as each prompt's validation against requirements requires separate calls and computations.
Introduce a batching mechanism in the SamplingStrategy.sample method that allows for grouping multiple prompts along with their respective requirements into a single batch for simultaneous processing. This should be configurable based on the implemented sampling strategy (e.g., duplication with additional context or requirements). For example. method extrapolates on the requirement validation and creates multiple instructions/prompts which can be batched together, can help to improve performance. Based on the sampling strategy, the prompts can be either duplicated with the additional context or requirements.
Beta Was this translation helpful? Give feedback.
All reactions