Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@benlebrun PR ready for review! |
benlebrun
left a comment
There was a problem hiding this comment.
Looks good, left a few comments---I think this can be cleaned up a bit, but generally looks good.
One thing to consider: we don't actually need the timeout functionality for async batching. The basic idea is to have a queue and a background task that eagerly pulls from the queue and processes all requests together in a batch.
Because of the way that the asyncio scheduler works, all concurrent requests will be batched together. The background task grabs one item with an await, and then drains the queue with get_nowait() to form the batch. Because other coroutines that call _queue_request will run while the background loop awaits the first get(), those concurrent requests will land in the same batch.
A rough sketch looks like this:
class AutoBatchedSketch:
def __init__(self):
self._queue = None
self._task = None
def _start(self):
if not self._task or self._task.done():
self._queue = asyncio.Queue()
self._task = asyncio.create_task(self._background_loop())
def _queue_request(self, request):
if not self._task or self._task.done():
self._start()
future = asyncio.get_running_loop().create_future()
self._queue.put_nowait((request, future))
return future
async def next_token_logprobs(self, token_ids) -> Any:
""" Public API. Enqueue a request and await its result. """
return await self._queue_request(token_ids)
async def _background_loop(self):
while True:
try:
requests = [await self._queue.get()]
try:
while True:
requests.append(self._queue.get_nowait())
except asyncio.QueueEmpty:
pass
inputs, futures = zip(*requests)
results = self._batch_call(inputs)
for future, result in zip(futures, results):
future.set_result(result)
except Exception as e:
for _, future in requests:
if not future.done():
future.set_exception(e)
raiseNot saying that we need to implement this approach now, but it is worth keeping in mind if you want a more efficient approach which doesn't require specifying a batch size and a timeout. We used it here https://github.com/genlm/genlm-bytes/blob/e76ca6908b2360690e5ecf098b377395b342978a/genlm/bytes/trie.py#L484.
benlebrun
left a comment
There was a problem hiding this comment.
Looks great! Just left a few comments, then should be good to merge!
|
|
||
| else: | ||
|
|
||
| class Query: |
There was a problem hiding this comment.
We should be using a data class here, e.g.,:
@DataClass
class Query:
prompt : str
future : asyncio.Future
past : Optional[mx.array] = None
genlm/backend/llm/mlx.py
Outdated
| self.generation_stream = mx.new_stream(mx.default_device()) | ||
|
|
||
| self.queries = [] | ||
| self.batch_size = ( |
There was a problem hiding this comment.
add a warning to let the user know that the model is not batchable.
genlm/backend/llm/mlx.py
Outdated
| @staticmethod | ||
| def _to_torch(logprobs): | ||
| """Convert MLX arrays into PyTorch tensors.""" | ||
| if logprobs.dtype in [mx.bfloat16]: |
… to new API, with dependency updated to enforce lowest version with compatible.
… to new API, with dependency updated to enforce lowest version with compatible.
… to new API, with dependency updated to enforce lowest version with compatible.
Adding batching function to MLX-LM backend with KV caching.