Add support for async / streaming / lazy / etc... #103

jakelorocco · 2025-08-27T14:57:38Z

jakelorocco
Aug 27, 2025
Maintainer

Async-Stream-Batch-Parse-Lazy-Logit-Probs

Things to support

async
streaming
batching
parsing
lazy -- perhaps de-prioritized, but mostly comes as a dependency of adding streaming anyways.
Access to logits and probs
Logit/token-wise requirement validation methods
logging / ui hooks

Todo List

Enabling Changes

Components should all have private fields if they don't already.
Contexts are immutable
Update all documentation that does stuff like m.ctx.insert
raise ObjectProtocolError whenever the insert method is called on ctx with pointer to release notes describing this change.

Changes to Backend Classes

all generate calls are async
if the inference engine can batch, callers are responsible for passing in multiple components and contexts
if streaming is available, all backend calls will stream
let's rename Backend to InferenceEngine while we're at it.

Changes to Requirements

add streaming validate functions, which are totally optional
await when we validate:

            val_result = await requirement.validate(
                self.backend,
                validation_target_ctx,
                format=format,
                model_options=model_options,
                generate_logs=generate_logs,
            )

streaming validate function will be a new field added to requirements
the caller of streaming requirements will need to decide whether to stop generation early due to failures

Changes to ModelOutputThunks

change value to a property and allow it to have None or Clopen or Closed

i_1 = Instruction(...)
i_1_output = m.act(i_1)
i_2 = Instruction(..., grounding_context=[i_1_output])

add accessor fields for logits / tokens / probs / kvcache
- cblocks need kvcache info as well
model output thunks need pointers to the contexts / model_options that created them

Changes to Sessions

Add context mutators to the base Session class as helpers
Add the ability to subscribe to sessions

Changes to Sampling Strategies (and Other "Looping" Operations)

these strategies need to add to the branched contexts as they go
SampleResults should be able to choose a "canonical" context along with their final result

Design Decision 1: Backend.generate* calls are async

Rationale: you can always just wait.

Design Decision 2: Sessions are synchronous by default

def m.act(c: Component):
    result = await backend.generate(c)
    await result
    ...

Design Decision 3: Batch only when we control the loop

Batch when we can fully unroll the loop.

Most inference engines don't actually support batching with the chat endpoint. In those cases, batching will be replaced by async-await patterns.

Design Decision 4: Always stream under the hood (if possible)

Inference engines will always stream when possible. We will have different fields for accessing the stream and for getting the final value. We will need to make sure streaming doesn't cause performance degredations though.

Design Decision 5: Logits / Probs / Tokens are Optional

If available from a .generate_* call, backends will attach the Logits/Probs/Tokens to a ModelOutputThunk.

Code that utilizes these fields will first have to check that they aren't empty (and the ModelOutputThunk is closed). It will also have to decide how to fallback if such values aren't available.

Design Decision 6: Contexts Branch

Contexts will form a tree structure. Leaf nodes will be the primary place where new actions are inserted (although it should be possible to select a parent node and re-branch from there).

stateDiagram-v2

    USER_INPUT1 --> MOT1

    state fork_state <<fork>>
    MOT1 --> fork_state

    fork_state --> MOT2A
    fork_state --> MOT2B
    fork_state --> MOT2C

    MOT2A --> MOT2A'
    MOT2A' --> MOT2A''

    MOT2C --> MOT2C'
    MOT2C --> MOT2C''

    state join_state <<join>>
    MOT2A'' --> join_state
    MOT2B --> join_state
    MOT2C' --> join_state

    join_state --> Chosen[MOT2A'']

Session need to have helper functions that handle context branching.

Design Decision 7: Sessions Have Streams

Sessions need to be able to communicate as events happen. They will use a listener/subscriber model for that.

We will need to define / standardize at least some of the events. These include:

context branching
requirement validation
tool calling

Events will likely also include a context identifier so that listeners can determine what context within a session an event is attached to.

Design Decision 8: Spans

We will implement parts for Components. Parts is a list of dependencies / parts that make up that component. If any are empty, they will need to be generated first before .generate_* can be called on the current component.

Scopes will be pointers to the context that created the component / cblock. With our current parsing strategy, we can really only generate cblocks.

CBlocks can be things like documents and large portions of text. As a result, they should also hold kv cache info. This will be either the cache itself, a dataclass that holds the backend, model_id, etc... that generated it, or a uuid that points to one of those things.

jakelorocco · 2025-08-27T14:58:35Z

jakelorocco
Aug 27, 2025
Maintainer Author

@elronbandel @nrfulton @HendrikStrobelt, here is the write up of our discussions on async / streaming / etc... I will incorporate comments / changes into a final doc. Thank you!

0 replies

nrfulton · 2025-08-27T17:35:01Z

nrfulton
Aug 27, 2025
Maintainer

Scopes will be pointers to the context that created the component / cblock. With our current parsing strategy, we can really only generate cblocks.

In this case, Contexts must be immutable. I think we decided on that already though, right? (It's a major change in the sense that we have to reach into the tutorials and documentation and remove references to things like m.ctx.insert.)

0 replies

nrfulton · 2025-08-27T17:40:00Z

nrfulton
Aug 27, 2025
Maintainer

Sessions have streams

I guess the idea here is something like the following.

The Context gets a stream and some listeners:

# File: mellea/stdlib/base.py
class StreamEntry:
    ...

class StreamListener:
   ....

class Context(abc.ABC):
    ...
    _stream : Generator[StreamEntry] = ...
    _stream_listeners : Iterator[StreamListener] = ...

    def register_stream_listener(self, sl: StreamListener):
        self._stream_listeners.append(sl)

    def stream_insert(self, se: StreamEntry):
        self._stream.insert(se)
        for sl in self._stream_listeners:
            sl.insert_event(se)

The CBlock and Component classes will implement the StreamEntry protocol. That protocol might have some optional GUI-related stuff.

1 reply

nrfulton Aug 27, 2025
Maintainer

The Backend for a GUIs should then be implemented using a StreamListener on a Context, rather than overriding backends.

nrfulton · 2025-08-27T17:47:37Z

nrfulton
Aug 27, 2025
Maintainer

How do we get the modeloutputthunk for a result whose computation hasn't been triggered yet? For example:

i_1 = Instruction("stuff goes here")

# the next line does NOT trigger a generate call, 
# but when a generate call happens on `i_1` then the output thunk will be used.
i_1_result : ModelOutputThunk = i_1.output_thunk()

The problem with this is that we have proposed using a Context to determine dependency ordering, so I supposed we would need something like this:

i_1_result : ModelOutputThunk = i_1.output_thunk(backend, model_id, context)

1 reply

jakelorocco Aug 27, 2025
Maintainer Author

Discussed in person: Lazy will be ignored as a part of this work stream. We will need to discuss it in detail later / write up a proposal. Points:

we likely want lazy computation to be a well integrated part of Mellea
it should be easy to debug the end result and figure out what went wrong and where in your code that was
we could have a compile / optimization mode that transpiles your code to more efficient mellea
we could just have a lazy mode / flag that doesn't require changing the source code

Given the immutability of contexts mentioned above, we also need to figure out how to communicate the future state of a context for a generation request.

nrfulton · 2025-08-27T17:56:04Z

nrfulton
Aug 27, 2025
Maintainer

The advantage of components knowing about their own KV cache is that we could can hi-jack Python's own memory management, like this:

my_instruction_1 = Instruction("Summarize the following odcument [more long instructions here...] this is 20,000 tokens")
my_instruction_2 = Instruction("Summarize the following odcument [more long instructions here...] this is 20,000 tokens")
my_instruction_3 = Instruction("Summarize the following odcument [more long instructions here...] this is 20,000 tokens")

doc_chunks = ... # groups of large docs

for chunk in doc_chunks:
    run_on_docset([my_instruction_1, my_instruction_2, my_instruction_3], doc_chunks)

def run_on_docset(instructions, docs):
    # The m.prefill method does a forward pass and stores the KV Cache in CBLock._cache, which is a dict mapping model x backend to KV Cache.
    # note: maybe this could be done automatically so that the prefill command isn't needed. Ie, the first time a CBlock is used we store the cache
    # and then from then on it stays with that CBlock
    doc_blocks = [ m.prefill(CBlock(doc)) for doc in docs ]

    for instruction in instructions:
        return reduce([m.act(instruction, additional_grounding_countext = doc) for doc in doc_blocks] # <-- this should be async.

Punchline:

class Component:
    def __del__(self):
        for backend, model_id in self._cahce.keys():
            backend.free_kv_memory(self._cache_uuid[backend, model_id])

0 replies

Add support for async / streaming / lazy / etc... #103

Uh oh!

jakelorocco Aug 27, 2025 Maintainer

Async-Stream-Batch-Parse-Lazy-Logit-Probs

Things to support

Todo List

Enabling Changes

Changes to Backend Classes

Changes to Requirements

Changes to ModelOutputThunks

Changes to Sessions

Changes to Sampling Strategies (and Other "Looping" Operations)

Design Decision 1: Backend.generate* calls are async

Design Decision 2: Sessions are synchronous by default

Design Decision 3: Batch only when we control the loop

Design Decision 4: Always stream under the hood (if possible)

Design Decision 5: Logits / Probs / Tokens are Optional

Design Decision 6: Contexts Branch

Design Decision 7: Sessions Have Streams

Design Decision 8: Spans

Replies: 5 comments · 2 replies

Uh oh!

jakelorocco Aug 27, 2025 Maintainer Author

Uh oh!

nrfulton Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

nrfulton Aug 27, 2025 Maintainer

Uh oh!

nrfulton Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

nrfulton Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

jakelorocco Aug 27, 2025 Maintainer Author

Uh oh!

nrfulton Aug 27, 2025 Maintainer

jakelorocco
Aug 27, 2025
Maintainer

Replies: 5 comments 2 replies

jakelorocco
Aug 27, 2025
Maintainer Author

nrfulton
Aug 27, 2025
Maintainer

nrfulton
Aug 27, 2025
Maintainer

nrfulton Aug 27, 2025
Maintainer

nrfulton
Aug 27, 2025
Maintainer

jakelorocco Aug 27, 2025
Maintainer Author

nrfulton
Aug 27, 2025
Maintainer