Llama improvements

Now that Llama 3.2 was added in [this PR](https://github.com/tracel-ai/models/pull/55), we are missing the following improvements:

- [x] Async checks (check stop criterion for stop token in background thread)
- [x] Tensor cache should be fixed size (slice assign, not concat)

We are missing some ops for Top-P sampling, but we can have the first release with greedy sampling (argmax).

For Top-P sampling, we need:
- [ ] sorting (`tensor.sort_descending_with_indices` runs the default impl on CPU)
- [ ] cumsum ([Burn PR](https://github.com/tracel-ai/burn/pull/2664) missing cubecl kernel)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama improvements #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama improvements #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions