AFM7 MLX Toolkit

If you find this useful, please ⭐ the repo!

Convert and run Apple Foundation Model (AFMTextV7) on Apple Silicon using MLX, bypassing the 4096 token context limit of Apple's Foundation Models API.

For exploration and academic curiosity.

Visit my other repo https://github.com/scouzi1966/MLXLMProbe for MLX observation of models

⚠️ IMPORTANT LEGAL DISCLAIMER

The source model (AFMTextV7) requires Apple Developer credentials and is subject to Apple's licensing terms. Do not distribute the model weights or tokenizer files.

Review Apple's official terms before use:

https://developer.apple.com/apple-intelligence/foundation-models-adapter/

https://developer.apple.com/apple-intelligence/acceptable-use-requirements-for-the-foundation-models-framework/

https://developer.apple.com/support/terms/apple-developer-program-license-agreement/

Features

Extended Context: Run AFM7 with up to 32K+ tokens (vs 4096 limit in Apple's API)
Run on GPU: MLX on GPU vs ANE with the on-device model
Apple Silicon Optimized: Native MLX acceleration on M1/M2/M3/M4 chips
Quantization Support: 4-bit and 8-bit quantization for faster inference
Interactive Mode: Multi-turn conversation support
Custom System Prompts: Configurable assistant behavior

Model Architecture

AFMTextV7 is Apple's on-device foundation model with:

56 layers (35 regular + 21 KV-reuse)
2048 hidden dimension
16 query heads, 2 KV heads (Grouped Query Attention)
153,600 vocabulary size
~6B parameters

Requirements

System Requirements

macOS 26 (Tahoe)
Apple Silicon (M1/M2/M3/M4)
16GB+ RAM (32GB recommended for float32)

Python Dependencies

pip install -r requirements.txt

Or install manually:

pip install mlx mlx-lm torch numpy sentencepiece safetensors

Model Files

You need the PyTorch checkpoint and tokenizer from Apple's tamm library:

assets/
├── base-model.pt      # PyTorch checkpoint
├── tokenizer.model    # SentencePiece tokenizer
└── tokenizer-config.json

Quick Start

1. Convert the Model

python convert.py --checkpoint assets/base-model.pt --output mlx_afm7

2. Run Inference

python generate.py "What is the capital of France?"

3. (Optional) Quantize for Speed

python quantize.py --model mlx_afm7 --bits 4
python generate.py --model mlx_afm7_q4 "Your prompt"

Scripts

convert.py - Model Conversion

Convert PyTorch checkpoint to MLX format.

# Basic conversion
python convert.py

# Custom paths
python convert.py --checkpoint /path/to/model.pt --output ./my_model

# Verbose output
python convert.py --verbose

Options:

Option	Default	Description
`--checkpoint`	`assets/base-model.pt`	PyTorch checkpoint path
`--output, -o`	`mlx_afm7`	Output directory
`--assets`	`assets`	Directory with tokenizer files
`--verbose, -v`	False	Detailed progress

generate.py - Text Generation

Generate text using the converted model.

# Simple prompt
python generate.py "What is 2+2?"

# With temperature
python generate.py -t 0.7 --max-tokens 200 "Write a poem"

# With system prompt
python generate.py --system "You are a pirate" "Tell me about ships"

# Interactive mode
python generate.py --interactive

# Use quantized model
python generate.py --model mlx_afm7_q4 "Your prompt"

Options:

Option	Default	Description
`--model, -m`	`mlx_afm7`	Model directory
`--max-tokens`	`100`	Max tokens to generate (not context window)
`--temperature, -t`	`0.0`	Sampling temperature (0=greedy)
`--system, -s`	None	System prompt
`--interactive, -i`	False	Interactive conversation mode
`--verbose, -v`	False	Show token-by-token generation
`--max-context`	`32768`	Maximum context window (input + output)

Context Window Notes:

Max context (input + output): 32,768 tokens (model's trained limit)
--max-tokens limits generated output only
--max-context sets the context window limit (default: 32768)
Script warns if input + max-tokens exceeds max-context

quantize.py - Model Quantization

Quantize the model for reduced memory and faster inference.

# 4-bit quantization (recommended)
python quantize.py --model mlx_afm7 --bits 4

# 8-bit quantization
python quantize.py --model mlx_afm7 --bits 8

# Custom output
python quantize.py --model mlx_afm7 --bits 4 --output my_q4_model

Options:

Option	Default	Description
`--model, -m`	`mlx_afm7`	Input model directory
`--output, -o`	`{model}_q{bits}`	Output directory
`--bits, -b`	`4`	Quantization bits (4 or 8)
`--group-size, -g`	`64`	Quantization group size
`--verbose, -v`	False	Detailed progress

Performance Comparison:

Format	Size	Speed*
Float32	11.6 GB	~50 tok/s
8-bit	~5.8 GB	~80 tok/s
4-bit	~1.8 GB	~170 tok/s

*Approximate values on M1 Max

Important Notes

mlx-lm Patches Required

The mlx-lm library has bugs that affect AFM7. Our scripts apply patches automatically, but if you use mlx-lm directly, you need to apply these patches:

import mlx_lm.models.afm7 as afm7_module

# Patch 1: Fix fake_8bit_quant (corrupts K,V when scale=1.0)
afm7_module.fake_8bit_quant = lambda x, scale: x

# Patch 2: Fix FusedLinear.to_quantized (missing mode parameter)
if hasattr(afm7_module, "FusedLinear"):
    orig = afm7_module.FusedLinear.to_quantized
    afm7_module.FusedLinear.to_quantized = lambda self, group_size=64, bits=4, mode=None: orig(self, group_size, bits)

# IMPORTANT: Apply patches BEFORE importing load/generate
from mlx_lm import load, generate

Prompt Format

AFM7 uses a specific prompt format (NOT ChatML):

system
{system_prompt}<turn_end> user
 {user_message}<turn_end> assistant

The scripts handle this formatting automatically.

Special Tokens

Token	ID	Usage
`<turn_start>`	150000	Start of turn
`<turn_end>`	150001	End of turn (EOS)
`<n>`	4	Newline character
`<pad>`	0	Padding
`<unk>`	2	Unknown token

Examples

Basic Chat

python generate.py "Explain quantum computing in simple terms"

Creative Writing

python generate.py -t 0.8 --max-tokens 500 "Write a short story about a robot learning to paint"

Code Generation

python generate.py --system "You are an expert Python programmer" \
    "Write a function to find prime numbers using the Sieve of Eratosthenes"

Interactive Conversation

python generate.py --interactive --system "You are a helpful coding assistant"

Using Quantized Model

# First quantize
python quantize.py --bits 4

# Then use
python generate.py --model mlx_afm7_q4 --max-tokens 200 "Your prompt"

Documentation

Conversion Guide - Detailed guide on the conversion process
Prompt Results - Example outputs from various prompts

Troubleshooting

"Checkpoint not found"

Make sure you have the PyTorch checkpoint from tamm:

assets/base-model.pt

"Tokenizer not found"

Make sure you have the tokenizer files:

assets/tokenizer.model

Garbled output

The mlx-lm patches may not have been applied. Make sure you're using our scripts or apply patches manually.

Out of memory

Try using a quantized model:

python quantize.py --bits 4
python generate.py --model mlx_afm7_q4 "Your prompt"

License & Legal

This toolkit (scripts and documentation) is provided for educational purposes.

The AFM7 model weights and tokenizer are proprietary to Apple. Review Apple's terms at the URLs listed in the disclaimer above before use. Do not distribute model files.

Acknowledgments

Apple's tamm library for the original PyTorch implementation
MLX team for the efficient Apple Silicon framework
mlx-lm team for the afm7 model implementation

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
examples		examples
.gitignore		.gitignore
README.md		README.md
convert.py		convert.py
generate.py		generate.py
probe_cli.py		probe_cli.py
probe_inference.py		probe_inference.py
probe_viewer.html		probe_viewer.html
quantize.py		quantize.py
requirements.txt		requirements.txt

scouzi1966/afm7-mlx-toolkit

Folders and files

Latest commit

History

Repository files navigation

AFM7 MLX Toolkit

Features

Model Architecture

Requirements

System Requirements

Python Dependencies

Model Files

Quick Start

1. Convert the Model

2. Run Inference

3. (Optional) Quantize for Speed

Scripts

convert.py - Model Conversion

generate.py - Text Generation

quantize.py - Model Quantization

Important Notes

mlx-lm Patches Required

Prompt Format

Special Tokens

Examples

Basic Chat

Creative Writing

Code Generation

Interactive Conversation

Using Quantized Model

Documentation

Troubleshooting

"Checkpoint not found"

"Tokenizer not found"

Garbled output

Out of memory

License & Legal

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages