LLM Backend Implementation

This issue relates to the LLM Pipeline and related backend/driver changes.

Current Implementation Information*:
- Pipeline is able to initialize the model and interpreter, allocate tensors, set up KV caches.
- Pipeline Accepts input in format of `std::vector<int>, int` (list of tokens, end_token_id).
- Pipeline is able to run inference and provide proper output.
- Output is in format of `std::vector<int>` (list of tokens).
- Pipeline can be configured to allow a different delegate, a set number of output tokens, and a certain number of CPU threads (currently, values are in-code only).
- Pipeline is able to call a `first_token_callback` provided by the driver, to indicate to LoadGen TTFT.

Todo:

- [ ] Combine first token inference function into normal inference function.**
- [x] Code Formatting and linting.
- [x] Changing `issue_query()` signature to comply with changes made to backend interface.
- [x] Resolve any remaining code quality and CI issues.
- [ ] Potentially provide logits to a potential cross-backend decoder instead of building one inside the pipeline (Discussion needed)

\* This relates to the default implementation using TFLite (LiteRT) on a CPU delegate.
\** This only affects how the code looks, the inference is still functional.


Any other discussions or requirements relating to the pipeline should go here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Backend Implementation #1057

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLM Backend Implementation #1057

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions