MLX-LM batching by shepardxia · Pull Request #52 · genlm/genlm-backend

shepardxia · 2025-10-23T19:53:18Z

Adding batching function to MLX-LM backend with KV caching.

codecov · 2025-10-23T20:08:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

shepardxia · 2025-10-29T21:22:26Z

@benlebrun PR ready for review!

benlebrun

Looks good, left a few comments---I think this can be cleaned up a bit, but generally looks good.

One thing to consider: we don't actually need the timeout functionality for async batching. The basic idea is to have a queue and a background task that eagerly pulls from the queue and processes all requests together in a batch.

Because of the way that the asyncio scheduler works, all concurrent requests will be batched together. The background task grabs one item with an await, and then drains the queue with get_nowait() to form the batch. Because other coroutines that call _queue_request will run while the background loop awaits the first get(), those concurrent requests will land in the same batch.

A rough sketch looks like this:

class AutoBatchedSketch:

    def __init__(self):
        self._queue = None
        self._task = None
    
    def _start(self):
        if not self._task or self._task.done():
            self._queue = asyncio.Queue()
            self._task = asyncio.create_task(self._background_loop())

    def _queue_request(self, request):
        if not self._task or self._task.done():
            self._start()

        future = asyncio.get_running_loop().create_future()
        self._queue.put_nowait((request, future))
        return future

    async def next_token_logprobs(self, token_ids) -> Any:
        """ Public API. Enqueue a request and await its result. """
        return await self._queue_request(token_ids)

    async def _background_loop(self):
        while True:
            try:
                requests = [await self._queue.get()]

                try:
                    while True: 
                        requests.append(self._queue.get_nowait())
                except asyncio.QueueEmpty:
                    pass

                    inputs, futures = zip(*requests)
                    results = self._batch_call(inputs)
                    for future, result in zip(futures, results):
                        future.set_result(result)

            except Exception as e:
                for _, future in requests:
                    if not future.done():
                        future.set_exception(e)
                raise

Not saying that we need to implement this approach now, but it is worth keeping in mind if you want a more efficient approach which doesn't require specifying a batch size and a timeout. We used it here https://github.com/genlm/genlm-bytes/blob/e76ca6908b2360690e5ecf098b377395b342978a/genlm/bytes/trie.py#L484.

genlm/backend/llm/mlx.py

tests/test_mlx.py

benlebrun

Looks great! Just left a few comments, then should be good to merge!

benlebrun · 2025-11-12T22:10:01Z

genlm/backend/llm/mlx.py


 else:

+    class Query:


We should be using a data class here, e.g.,:

@DataClass
class Query:
prompt : str
future : asyncio.Future
past : Optional[mx.array] = None

benlebrun · 2025-11-12T22:10:55Z

genlm/backend/llm/mlx.py

            self.generation_stream = mx.new_stream(mx.default_device())
-
+            self.queries = []
+            self.batch_size = (


add a warning to let the user know that the model is not batchable.

benlebrun · 2025-11-12T22:11:25Z

genlm/backend/llm/mlx.py

+        @staticmethod
+        def _to_torch(logprobs):
+            """Convert MLX arrays into PyTorch tensors."""
+            if logprobs.dtype in [mx.bfloat16]:


Can use is here

… to new API, with dependency updated to enforce lowest version with compatible.

shepardxia added 21 commits September 12, 2025 13:01

initial commit

7f55a4a

added naive logprobs and sampling methods

32ab601

add unit tests

8fb131c

fix format and linter issues

23fffb8

fix format and linter issues

85333c3

fix format and linter issues

b6d0e97

fix format and linter issues

aec329b

fix format and linter issues

5b1cd3f

fix format and linter issues

8813df7

fixed after Ben's review

7ee172d

fixing imports

23b0d16

separate coverage test job

94d2163

add benchmarking code and removed unnecessary torch <-> mlx conversions

938c040

revert output back to torch tensors

2d54f1e

cache subclassing to fix coverage

9021fdc

cache subclassing to fix coverage

fe0d412

cache subclassing to fix coverage

4883201

merge with main updates

8eafd77

cov

fb81350

update test

667d3ee

initial commit

a34138c

shepardxia added 8 commits October 23, 2025 19:09

revising pytest params

cbef988

prevent bf16 batching for now

3b19338

adding coverage

9444a81

add additional tests for mlx

317b972

modify test

a2c049c

modify test

612b8ab

modify test

da3f1d8

modify test

e9625ab

modify test

57b9e99

shepardxia marked this pull request as ready for review October 29, 2025 21:21

benlebrun requested changes Nov 4, 2025

View reviewed changes

shepardxia added 8 commits November 12, 2025 13:52

Update with token trie kv cache

3f017f0

Update with token trie kv cache, fixing tests

ef7c256

fixing tests

dee7540

fixing tests

0bc1d92

fixing tests

8f5e575

fixing tests

124c878

fixing tests

eddfb94

final fix

c904a94

shepardxia requested a review from benlebrun November 12, 2025 22:03

benlebrun approved these changes Nov 12, 2025

View reviewed changes

shepardxia added 3 commits November 12, 2025 19:15

Revised based on Ben's input. Updated HF cache construction to adhere…

0703a9b

… to new API, with dependency updated to enforce lowest version with compatible.

Revised based on Ben's input. Updated HF cache construction to adhere…

8299ba0

… to new API, with dependency updated to enforce lowest version with compatible.

Revised based on Ben's input. Updated HF cache construction to adhere…

8b58cdf

… to new API, with dependency updated to enforce lowest version with compatible.

shepardxia merged commit 73197e7 into main Nov 13, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

MLX-LM batching #52

MLX-LM batching #52
shepardxia merged 41 commits intomainfrom
mlx-lm-batching

shepardxia commented Oct 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

shepardxia commented Oct 29, 2025

Uh oh!

benlebrun left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benlebrun left a comment

Uh oh!

benlebrun Nov 12, 2025

Uh oh!

benlebrun Nov 12, 2025

Uh oh!

benlebrun Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

shepardxia commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shepardxia commented Oct 29, 2025

Uh oh!

benlebrun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benlebrun left a comment

Choose a reason for hiding this comment

Uh oh!

benlebrun Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

benlebrun Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

benlebrun Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shepardxia commented Oct 23, 2025 •

edited

Loading

codecov bot commented Oct 23, 2025 •

edited

Loading