Skip to content

Convert and run Apple Foundation Model (AFM7) on Apple Silicon using MLX with extended context length

Notifications You must be signed in to change notification settings

scouzi1966/afm7-mlx-toolkit

Repository files navigation

AFM7 MLX Toolkit

If you find this useful, please ⭐ the repo!

Convert and run Apple Foundation Model (AFMTextV7) on Apple Silicon using MLX, bypassing the 4096 token context limit of Apple's Foundation Models API.

For exploration and academic curiosity.

Visit my other repo https://github.com/scouzi1966/MLXLMProbe for MLX observation of models

⚠️ IMPORTANT LEGAL DISCLAIMER

The source model (AFMTextV7) requires Apple Developer credentials and is subject to Apple's licensing terms. Do not distribute the model weights or tokenizer files.

Review Apple's official terms before use:

Features

  • Extended Context: Run AFM7 with up to 32K+ tokens (vs 4096 limit in Apple's API)
  • Run on GPU: MLX on GPU vs ANE with the on-device model
  • Apple Silicon Optimized: Native MLX acceleration on M1/M2/M3/M4 chips
  • Quantization Support: 4-bit and 8-bit quantization for faster inference
  • Interactive Mode: Multi-turn conversation support
  • Custom System Prompts: Configurable assistant behavior

Model Architecture

AFMTextV7 is Apple's on-device foundation model with:

  • 56 layers (35 regular + 21 KV-reuse)
  • 2048 hidden dimension
  • 16 query heads, 2 KV heads (Grouped Query Attention)
  • 153,600 vocabulary size
  • ~6B parameters

Requirements

System Requirements

  • macOS 26 (Tahoe)
  • Apple Silicon (M1/M2/M3/M4)
  • 16GB+ RAM (32GB recommended for float32)

Python Dependencies

pip install -r requirements.txt

Or install manually:

pip install mlx mlx-lm torch numpy sentencepiece safetensors

Model Files

You need the PyTorch checkpoint and tokenizer from Apple's tamm library:

assets/
├── base-model.pt      # PyTorch checkpoint
├── tokenizer.model    # SentencePiece tokenizer
└── tokenizer-config.json

Quick Start

1. Convert the Model

python convert.py --checkpoint assets/base-model.pt --output mlx_afm7

2. Run Inference

python generate.py "What is the capital of France?"

3. (Optional) Quantize for Speed

python quantize.py --model mlx_afm7 --bits 4
python generate.py --model mlx_afm7_q4 "Your prompt"

Scripts

convert.py - Model Conversion

Convert PyTorch checkpoint to MLX format.

# Basic conversion
python convert.py

# Custom paths
python convert.py --checkpoint /path/to/model.pt --output ./my_model

# Verbose output
python convert.py --verbose

Options:

Option Default Description
--checkpoint assets/base-model.pt PyTorch checkpoint path
--output, -o mlx_afm7 Output directory
--assets assets Directory with tokenizer files
--verbose, -v False Detailed progress

generate.py - Text Generation

Generate text using the converted model.

# Simple prompt
python generate.py "What is 2+2?"

# With temperature
python generate.py -t 0.7 --max-tokens 200 "Write a poem"

# With system prompt
python generate.py --system "You are a pirate" "Tell me about ships"

# Interactive mode
python generate.py --interactive

# Use quantized model
python generate.py --model mlx_afm7_q4 "Your prompt"

Options:

Option Default Description
--model, -m mlx_afm7 Model directory
--max-tokens 100 Max tokens to generate (not context window)
--temperature, -t 0.0 Sampling temperature (0=greedy)
--system, -s None System prompt
--interactive, -i False Interactive conversation mode
--verbose, -v False Show token-by-token generation
--max-context 32768 Maximum context window (input + output)

Context Window Notes:

  • Max context (input + output): 32,768 tokens (model's trained limit)
  • --max-tokens limits generated output only
  • --max-context sets the context window limit (default: 32768)
  • Script warns if input + max-tokens exceeds max-context

quantize.py - Model Quantization

Quantize the model for reduced memory and faster inference.

# 4-bit quantization (recommended)
python quantize.py --model mlx_afm7 --bits 4

# 8-bit quantization
python quantize.py --model mlx_afm7 --bits 8

# Custom output
python quantize.py --model mlx_afm7 --bits 4 --output my_q4_model

Options:

Option Default Description
--model, -m mlx_afm7 Input model directory
--output, -o {model}_q{bits} Output directory
--bits, -b 4 Quantization bits (4 or 8)
--group-size, -g 64 Quantization group size
--verbose, -v False Detailed progress

Performance Comparison:

Format Size Speed*
Float32 11.6 GB ~50 tok/s
8-bit ~5.8 GB ~80 tok/s
4-bit ~1.8 GB ~170 tok/s

*Approximate values on M1 Max

Important Notes

mlx-lm Patches Required

The mlx-lm library has bugs that affect AFM7. Our scripts apply patches automatically, but if you use mlx-lm directly, you need to apply these patches:

import mlx_lm.models.afm7 as afm7_module

# Patch 1: Fix fake_8bit_quant (corrupts K,V when scale=1.0)
afm7_module.fake_8bit_quant = lambda x, scale: x

# Patch 2: Fix FusedLinear.to_quantized (missing mode parameter)
if hasattr(afm7_module, "FusedLinear"):
    orig = afm7_module.FusedLinear.to_quantized
    afm7_module.FusedLinear.to_quantized = lambda self, group_size=64, bits=4, mode=None: orig(self, group_size, bits)

# IMPORTANT: Apply patches BEFORE importing load/generate
from mlx_lm import load, generate

Prompt Format

AFM7 uses a specific prompt format (NOT ChatML):

system
{system_prompt}<turn_end> user
 {user_message}<turn_end> assistant

The scripts handle this formatting automatically.

Special Tokens

Token ID Usage
<turn_start> 150000 Start of turn
<turn_end> 150001 End of turn (EOS)
<n> 4 Newline character
<pad> 0 Padding
<unk> 2 Unknown token

Examples

Basic Chat

python generate.py "Explain quantum computing in simple terms"

Creative Writing

python generate.py -t 0.8 --max-tokens 500 "Write a short story about a robot learning to paint"

Code Generation

python generate.py --system "You are an expert Python programmer" \
    "Write a function to find prime numbers using the Sieve of Eratosthenes"

Interactive Conversation

python generate.py --interactive --system "You are a helpful coding assistant"

Using Quantized Model

# First quantize
python quantize.py --bits 4

# Then use
python generate.py --model mlx_afm7_q4 --max-tokens 200 "Your prompt"

Documentation

Troubleshooting

"Checkpoint not found"

Make sure you have the PyTorch checkpoint from tamm:

assets/base-model.pt

"Tokenizer not found"

Make sure you have the tokenizer files:

assets/tokenizer.model

Garbled output

The mlx-lm patches may not have been applied. Make sure you're using our scripts or apply patches manually.

Out of memory

Try using a quantized model:

python quantize.py --bits 4
python generate.py --model mlx_afm7_q4 "Your prompt"

License & Legal

This toolkit (scripts and documentation) is provided for educational purposes.

The AFM7 model weights and tokenizer are proprietary to Apple. Review Apple's terms at the URLs listed in the disclaimer above before use. Do not distribute model files.

Acknowledgments

  • Apple's tamm library for the original PyTorch implementation
  • MLX team for the efficient Apple Silicon framework
  • mlx-lm team for the afm7 model implementation

About

Convert and run Apple Foundation Model (AFM7) on Apple Silicon using MLX with extended context length

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •