[not for landing] coreml llama #7616
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a version of llama that delegates to ANE and can handle a fixed sequence length (e.g., prefill) with input_pos >= 0. To export the model, run
model_export_script.sh. In the script, modify the variables:to appropriate values for the model you are trying to export. If STATIC_SEQ_LENGTH=1, the model can be used for decoding. If STATIC_SEQ_LENGTH > 1, it can be used for static prefill.
Note that although the arg enable_dynamic_shape is set, dynamic_shapes are overridden to be None. The reason for setting enable_dynamic_shape is because this is how to handle seq_lengh > 1 in llama_transformer.py.
To avoid issues with CoreML delegation, we do the following:
For stories110M, I get:
(I get a segfault in SDPA op when STATIC_SEQ_LENGTH >= 512, which requires investigation).
Note: you will encounter an error during CoreML conversion when running export_model_script.sh. You must first make the change below so that CoreML tools can handle negative infinity in sympy_numbers.
Update the function _map_sympy_number_to_int in coremltools/converters/mil/frontend/torch/exir_utils.py as follows: