If you find this useful, please ⭐ the repo!
Convert and run Apple Foundation Model (AFMTextV7) on Apple Silicon using MLX, bypassing the 4096 token context limit of Apple's Foundation Models API.
For exploration and academic curiosity.
Visit my other repo https://github.com/scouzi1966/MLXLMProbe for MLX observation of models
⚠️ IMPORTANT LEGAL DISCLAIMERThe source model (AFMTextV7) requires Apple Developer credentials and is subject to Apple's licensing terms. Do not distribute the model weights or tokenizer files.
Review Apple's official terms before use:
- Extended Context: Run AFM7 with up to 32K+ tokens (vs 4096 limit in Apple's API)
- Run on GPU: MLX on GPU vs ANE with the on-device model
- Apple Silicon Optimized: Native MLX acceleration on M1/M2/M3/M4 chips
- Quantization Support: 4-bit and 8-bit quantization for faster inference
- Interactive Mode: Multi-turn conversation support
- Custom System Prompts: Configurable assistant behavior
AFMTextV7 is Apple's on-device foundation model with:
- 56 layers (35 regular + 21 KV-reuse)
- 2048 hidden dimension
- 16 query heads, 2 KV heads (Grouped Query Attention)
- 153,600 vocabulary size
- ~6B parameters
- macOS 26 (Tahoe)
- Apple Silicon (M1/M2/M3/M4)
- 16GB+ RAM (32GB recommended for float32)
pip install -r requirements.txtOr install manually:
pip install mlx mlx-lm torch numpy sentencepiece safetensorsYou need the PyTorch checkpoint and tokenizer from Apple's tamm library:
assets/
├── base-model.pt # PyTorch checkpoint
├── tokenizer.model # SentencePiece tokenizer
└── tokenizer-config.json
python convert.py --checkpoint assets/base-model.pt --output mlx_afm7python generate.py "What is the capital of France?"python quantize.py --model mlx_afm7 --bits 4
python generate.py --model mlx_afm7_q4 "Your prompt"Convert PyTorch checkpoint to MLX format.
# Basic conversion
python convert.py
# Custom paths
python convert.py --checkpoint /path/to/model.pt --output ./my_model
# Verbose output
python convert.py --verboseOptions:
| Option | Default | Description |
|---|---|---|
--checkpoint |
assets/base-model.pt |
PyTorch checkpoint path |
--output, -o |
mlx_afm7 |
Output directory |
--assets |
assets |
Directory with tokenizer files |
--verbose, -v |
False | Detailed progress |
Generate text using the converted model.
# Simple prompt
python generate.py "What is 2+2?"
# With temperature
python generate.py -t 0.7 --max-tokens 200 "Write a poem"
# With system prompt
python generate.py --system "You are a pirate" "Tell me about ships"
# Interactive mode
python generate.py --interactive
# Use quantized model
python generate.py --model mlx_afm7_q4 "Your prompt"Options:
| Option | Default | Description |
|---|---|---|
--model, -m |
mlx_afm7 |
Model directory |
--max-tokens |
100 |
Max tokens to generate (not context window) |
--temperature, -t |
0.0 |
Sampling temperature (0=greedy) |
--system, -s |
None | System prompt |
--interactive, -i |
False | Interactive conversation mode |
--verbose, -v |
False | Show token-by-token generation |
--max-context |
32768 |
Maximum context window (input + output) |
Context Window Notes:
- Max context (input + output): 32,768 tokens (model's trained limit)
--max-tokenslimits generated output only--max-contextsets the context window limit (default: 32768)- Script warns if input + max-tokens exceeds max-context
Quantize the model for reduced memory and faster inference.
# 4-bit quantization (recommended)
python quantize.py --model mlx_afm7 --bits 4
# 8-bit quantization
python quantize.py --model mlx_afm7 --bits 8
# Custom output
python quantize.py --model mlx_afm7 --bits 4 --output my_q4_modelOptions:
| Option | Default | Description |
|---|---|---|
--model, -m |
mlx_afm7 |
Input model directory |
--output, -o |
{model}_q{bits} |
Output directory |
--bits, -b |
4 |
Quantization bits (4 or 8) |
--group-size, -g |
64 |
Quantization group size |
--verbose, -v |
False | Detailed progress |
Performance Comparison:
| Format | Size | Speed* |
|---|---|---|
| Float32 | 11.6 GB | ~50 tok/s |
| 8-bit | ~5.8 GB | ~80 tok/s |
| 4-bit | ~1.8 GB | ~170 tok/s |
*Approximate values on M1 Max
The mlx-lm library has bugs that affect AFM7. Our scripts apply patches automatically, but if you use mlx-lm directly, you need to apply these patches:
import mlx_lm.models.afm7 as afm7_module
# Patch 1: Fix fake_8bit_quant (corrupts K,V when scale=1.0)
afm7_module.fake_8bit_quant = lambda x, scale: x
# Patch 2: Fix FusedLinear.to_quantized (missing mode parameter)
if hasattr(afm7_module, "FusedLinear"):
orig = afm7_module.FusedLinear.to_quantized
afm7_module.FusedLinear.to_quantized = lambda self, group_size=64, bits=4, mode=None: orig(self, group_size, bits)
# IMPORTANT: Apply patches BEFORE importing load/generate
from mlx_lm import load, generateAFM7 uses a specific prompt format (NOT ChatML):
system
{system_prompt}<turn_end> user
{user_message}<turn_end> assistant
The scripts handle this formatting automatically.
| Token | ID | Usage |
|---|---|---|
<turn_start> |
150000 | Start of turn |
<turn_end> |
150001 | End of turn (EOS) |
<n> |
4 | Newline character |
<pad> |
0 | Padding |
<unk> |
2 | Unknown token |
python generate.py "Explain quantum computing in simple terms"python generate.py -t 0.8 --max-tokens 500 "Write a short story about a robot learning to paint"python generate.py --system "You are an expert Python programmer" \
"Write a function to find prime numbers using the Sieve of Eratosthenes"python generate.py --interactive --system "You are a helpful coding assistant"# First quantize
python quantize.py --bits 4
# Then use
python generate.py --model mlx_afm7_q4 --max-tokens 200 "Your prompt"- Conversion Guide - Detailed guide on the conversion process
- Prompt Results - Example outputs from various prompts
Make sure you have the PyTorch checkpoint from tamm:
assets/base-model.pt
Make sure you have the tokenizer files:
assets/tokenizer.model
The mlx-lm patches may not have been applied. Make sure you're using our scripts or apply patches manually.
Try using a quantized model:
python quantize.py --bits 4
python generate.py --model mlx_afm7_q4 "Your prompt"This toolkit (scripts and documentation) is provided for educational purposes.
The AFM7 model weights and tokenizer are proprietary to Apple. Review Apple's terms at the URLs listed in the disclaimer above before use. Do not distribute model files.
- Apple's tamm library for the original PyTorch implementation
- MLX team for the efficient Apple Silicon framework
- mlx-lm team for the afm7 model implementation