-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Milestone
Description
This issue relates to the LLM Pipeline and related backend/driver changes.
Current Implementation Information*:
- Pipeline is able to initialize the model and interpreter, allocate tensors, set up KV caches.
- Pipeline Accepts input in format of
std::vector<int>, int(list of tokens, end_token_id). - Pipeline is able to run inference and provide proper output.
- Output is in format of
std::vector<int>(list of tokens). - Pipeline can be configured to allow a different delegate, a set number of output tokens, and a certain number of CPU threads (currently, values are in-code only).
- Pipeline is able to call a
first_token_callbackprovided by the driver, to indicate to LoadGen TTFT.
Todo:
- Combine first token inference function into normal inference function.**
- Code Formatting and linting.
- Changing
issue_query()signature to comply with changes made to backend interface. - Resolve any remaining code quality and CI issues.
- Potentially provide logits to a potential cross-backend decoder instead of building one inside the pipeline (Discussion needed)
* This relates to the default implementation using TFLite (LiteRT) on a CPU delegate.
** This only affects how the code looks, the inference is still functional.
Any other discussions or requirements relating to the pipeline should go here.
Metadata
Metadata
Assignees
Labels
No labels