Support for Batch Generation

I found the [PR](https://github.com/ml-explore/mlx-lm/pull/443) by Awni cool in mlx-lm about batch generation and was experimenting with it over the weekend. I was able to implement it with *almost* same benchmark numbers on my M5 MacBook Pro with Llama 3.2 3B 4-bit:

| Batch Size | MLX LM (t/s) | MLX Swift (t/s) |
|------------|--------------|-----------------|
| 1 | 61 | 62 |
| 2 | 122 | 118 |
| 32 | 349 | 344 |

There are subtle improvements that I have not been able to find but I think a review would help me out.

Creating this issue in-case if somebody else is already working on it. If not, I can clean up my [branch](https://github.com/rudrankriyam/mlx-swift-examples/tree/kv-cache-batching) and send a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Batch Generation #437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for Batch Generation #437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions