Skip to content

Conversation

@metascroy
Copy link
Contributor

CoreML Multifunction Model Experiment

This PR adds tooling to create and benchmark CoreML multifunction models that combine prefill and decode functions into a single model package.

Overview

CoreML multifunction models allow multiple functions (e.g., prefill and decode) to share weights within a single model package. This experiment evaluates:

  • Memory usage of multifunction models vs. individual models
  • Performance characteristics when switching between prefill and decode

Step 1: Export Static Models

First, export two PTE files with different sequence lengths using export_static_llm_coreml.py:

# Export prefill model (seqlen=32)
python export_static_llm_coreml.py \
    --checkpoint <path_to_checkpoint> \
    --params <path_to_params.json> \
    --seq_length 32 \
    --output model_32.pte

# Export decode model (seqlen=1)
python export_static_llm_coreml.py \
    --checkpoint <path_to_checkpoint> \
    --params <path_to_params.json> \
    --seq_length 1 \
    --output model_1.pte

Step 2: Create Multifunction Models

Use create_multifunctions.py to combine the prefill and decode models:

python create_multifunctions.py \
    --prefill_model $HOME/Desktop/model_32.pte \
    --decode_model $HOME/Desktop/model_1.pte \
    --output_dir $HOME/Desktop/mods

This will:

  1. Extract CoreML models from both PTE files
  2. Create multifunction packages combining prefill/decode for each model piece
  3. Output mod1.mlpackage, mod2.mlpackage, mod3.mlpackage

Optional: Pre-compile Models

Add the --compile flag to pre-compile the models to .mlmodelc format:

python create_multifunctions.py \
    --prefill_model $HOME/Desktop/model_32.pte \
    --decode_model $HOME/Desktop/model_1.pte \
    --output_dir $HOME/Desktop/mods \
    --compile

This outputs mod1.mlmodelc, mod2.mlmodelc, mod3.mlmodelc instead. Pre-compiled models skip the compilation step at runtime.

Step 3: Benchmark with CoreML Test

  1. Open the Benchmark app Xcode project at extension/benchmark/apple/Benchmark
  2. Drag and drop the .mlpackage or .mlmodelc files into the Resources folder
  3. Run the benchmark: Product → Test

Configuring the Benchmark

Edit CoreMLTests.mm to configure the benchmark behavior:

// Enable/disable decode function benchmarking
const BOOL kEnableDecode = YES;

// Enable/disable individual model pieces
const BOOL kEnableMod1 = YES;  // Embedding piece
const BOOL kEnableMod2 = YES;  // Transformer piece
const BOOL kEnableMod3 = YES;  // Output piece

Benchmark Output

The benchmark runs:

  • Prefill 1: 30 iterations × enabled models
  • Decode: 50 iterations × enabled models (if kEnableDecode = YES)
  • Prefill 2: 30 iterations × enabled models

Output example:

=== Benchmark Results ===
Prefill 1: 30 iterations x 3 models, total time: 1234.56 ms (41.15 ms/iter)
Decode: 50 iterations x 3 models, total time: 567.89 ms (11.36 ms/iter)
Prefill 2: 30 iterations x 3 models, total time: 1230.12 ms (41.00 ms/iter)
Total time (prefill 1 + decode + prefill 2): 3032.57 ms
=========================

Observations

Memory Usage

Multifunction models do not appear to use significantly more memory than individual models. The weights are shared between the prefill and decode functions, so memory overhead is minimal.

Model Piece Memory

The embedding piece (mod1) uses significantly more memory compared to the other pieces. This can be observed by toggling kEnableMod1 = NO and comparing memory usage:

// To isolate memory usage of mod2 and mod3:
const BOOL kEnableMod1 = NO;   // Disable embedding piece
const BOOL kEnableMod2 = YES;
const BOOL kEnableMod3 = YES;

This suggests the embedding table is a major contributor to overall memory footprint.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16514

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 1 Unrelated Failure

As of commit fcc943d with merge base 913436a (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 9, 2026
@github-actions
Copy link

github-actions bot commented Jan 9, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JacobSzwejbka
Copy link
Contributor

And theres no cache coordination problems because static cache is io?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants