Skip to content

Make Cache statically configurable at model construction time #32500

@guangy10

Description

@guangy10

Feature request

Be able to construct and load a model like:

model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    attn_implementation="sdpa",
    generation_config=GenerationConfig(
        use_cache=True,
        cache_implementation=cache_implementation,
        max_length=max_cache_len,
        cache_config={
            "batch_size": batch_size,
            "max_cache_len": max_cache_len,
        },
    ),
)

See additional context in #32253

Motivation

This feature request is to support torch.export(), and ensure the model is exportable in a way that can be further lowered and run in ExecuTorch with performance out-of-the-box.

Your contribution

TBD

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions