Add support for async / streaming / lazy / etc... #103
Replies: 5 comments 2 replies
-
@elronbandel @nrfulton @HendrikStrobelt, here is the write up of our discussions on async / streaming / etc... I will incorporate comments / changes into a final doc. Thank you! |
Beta Was this translation helpful? Give feedback.
-
In this case, Contexts must be immutable. I think we decided on that already though, right? (It's a major change in the sense that we have to reach into the tutorials and documentation and remove references to things like |
Beta Was this translation helpful? Give feedback.
-
I guess the idea here is something like the following. The Context gets a stream and some listeners: # File: mellea/stdlib/base.py
class StreamEntry:
...
class StreamListener:
....
class Context(abc.ABC):
...
_stream : Generator[StreamEntry] = ...
_stream_listeners : Iterator[StreamListener] = ...
def register_stream_listener(self, sl: StreamListener):
self._stream_listeners.append(sl)
def stream_insert(self, se: StreamEntry):
self._stream.insert(se)
for sl in self._stream_listeners:
sl.insert_event(se) The |
Beta Was this translation helpful? Give feedback.
-
How do we get the modeloutputthunk for a result whose computation hasn't been triggered yet? For example: i_1 = Instruction("stuff goes here")
# the next line does NOT trigger a generate call,
# but when a generate call happens on `i_1` then the output thunk will be used.
i_1_result : ModelOutputThunk = i_1.output_thunk() The problem with this is that we have proposed using a
|
Beta Was this translation helpful? Give feedback.
-
The advantage of components knowing about their own KV cache is that we could can hi-jack Python's own memory management, like this:
Punchline:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Async-Stream-Batch-Parse-Lazy-Logit-Probs
Things to support
Todo List
Enabling Changes
m.ctx.insert
ObjectProtocolError
whenever the insert method is called onctx
with pointer to release notes describing this change.Changes to Backend Classes
async
Backend
toInferenceEngine
while we're at it.Changes to Requirements
await when we validate:
Changes to ModelOutputThunks
Changes to Sessions
Changes to Sampling Strategies (and Other "Looping" Operations)
Design Decision 1: Backend.generate* calls are async
Rationale: you can always just wait.
Design Decision 2: Sessions are synchronous by default
Design Decision 3: Batch only when we control the loop
Batch when we can fully unroll the loop.
Most inference engines don't actually support batching with the chat endpoint. In those cases, batching will be replaced by async-await patterns.
Design Decision 4: Always stream under the hood (if possible)
Inference engines will always stream when possible. We will have different fields for accessing the stream and for getting the final value. We will need to make sure streaming doesn't cause performance degredations though.
Design Decision 5: Logits / Probs / Tokens are Optional
If available from a
.generate_*
call, backends will attach the Logits/Probs/Tokens to aModelOutputThunk
.Code that utilizes these fields will first have to check that they aren't empty (and the
ModelOutputThunk
isclosed
). It will also have to decide how to fallback if such values aren't available.Design Decision 6: Contexts Branch
Contexts will form a tree structure. Leaf nodes will be the primary place where new actions are inserted (although it should be possible to select a parent node and re-branch from there).
Session need to have helper functions that handle context branching.
Design Decision 7: Sessions Have Streams
Sessions need to be able to communicate as events happen. They will use a listener/subscriber model for that.
We will need to define / standardize at least some of the events. These include:
Events will likely also include a context identifier so that listeners can determine what context within a session an event is attached to.
Design Decision 8: Spans
We will implement parts for Components. Parts is a list of dependencies / parts that make up that component. If any are
empty
, they will need to be generated first before.generate_*
can be called on the current component.Scopes will be pointers to the context that created the component / cblock. With our current parsing strategy, we can really only generate cblocks.
CBlocks can be things like documents and large portions of text. As a result, they should also hold kv cache info. This will be either the cache itself, a dataclass that holds the backend, model_id, etc... that generated it, or a uuid that points to one of those things.
Beta Was this translation helpful? Give feedback.
All reactions