Skip to content

Conversation

@metascroy
Copy link
Contributor

@metascroy metascroy commented Jan 13, 2025

This is a version of llama that delegates to ANE and can handle a fixed sequence length (e.g., prefill) with input_pos >= 0. To export the model, run model_export_script.sh. In the script, modify the variables:

export MODEL_IN=$HOME/models/stories110M/stories110M.pt
export TOKENIZER=$HOME/models/stories110M/tokenizer.bin
export PARAMS=$HOME/models/stories110M/params.json
export MODEL_OUT_DIR=$HOME/models/stories110M
export STATIC_SEQ_LENGTH=500  

to appropriate values for the model you are trying to export. If STATIC_SEQ_LENGTH=1, the model can be used for decoding. If STATIC_SEQ_LENGTH > 1, it can be used for static prefill.

Note that although the arg enable_dynamic_shape is set, dynamic_shapes are overridden to be None. The reason for setting enable_dynamic_shape is because this is how to handle seq_lengh > 1 in llama_transformer.py.

To avoid issues with CoreML delegation, we do the following:

  • Run SDPA/KV-cache ops outside of CoreML by skipping them in the partitioner. The ops required to update KV-cache are not supported on ANE.
  • We also skip any node that is a symbolic int or has a symbolic int arg (which is not supported by CoreML).
  • Finally, we skip the embedding op because coreml converts the token to uint16 when running on ANE, which will not work with llama3-type models.

For stories110M, I get:

  • 163 tokens/sec decode on iPhone 15 Pro
  • 56ms for 500 token prefill

(I get a segfault in SDPA op when STATIC_SEQ_LENGTH >= 512, which requires investigation).

Note: you will encounter an error during CoreML conversion when running export_model_script.sh. You must first make the change below so that CoreML tools can handle negative infinity in sympy_numbers.

Update the function _map_sympy_number_to_int in coremltools/converters/mil/frontend/torch/exir_utils.py as follows:

def _map_sympy_number_to_int(sympy_number: sympy.core.numbers.Number) -> int:
    MAX_DIM = 2**31 - 1
    MIN_DIM = -2**31
    if sympy_number == sympy.oo or sympy_number > MAX_DIM:
        return MAX_DIM
    elif sympy_number == -sympy.oo or sympy_number < MIN_DIM:
        return MIN_DIM
    else:
        return int(sympy_number)

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7616

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit d0e196b with merge base 3f9324c (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 13, 2025
@github-actions
Copy link

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@metascroy metascroy changed the title coreml llama [not for landing] coreml llama Jan 13, 2025
@metascroy metascroy closed this May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants