Skip to content

LLM Backend Implementation #1057

@farook-edev

Description

@farook-edev

This issue relates to the LLM Pipeline and related backend/driver changes.

Current Implementation Information*:

  • Pipeline is able to initialize the model and interpreter, allocate tensors, set up KV caches.
  • Pipeline Accepts input in format of std::vector<int>, int (list of tokens, end_token_id).
  • Pipeline is able to run inference and provide proper output.
  • Output is in format of std::vector<int> (list of tokens).
  • Pipeline can be configured to allow a different delegate, a set number of output tokens, and a certain number of CPU threads (currently, values are in-code only).
  • Pipeline is able to call a first_token_callback provided by the driver, to indicate to LoadGen TTFT.

Todo:

  • Combine first token inference function into normal inference function.**
  • Code Formatting and linting.
  • Changing issue_query() signature to comply with changes made to backend interface.
  • Resolve any remaining code quality and CI issues.
  • Potentially provide logits to a potential cross-backend decoder instead of building one inside the pipeline (Discussion needed)

* This relates to the default implementation using TFLite (LiteRT) on a CPU delegate.
** This only affects how the code looks, the inference is still functional.

Any other discussions or requirements relating to the pipeline should go here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions