The fastest way to run LLMs in Node.js on Apple Silicon.
npm install node-mlxRequirements: macOS 14+ (Sonoma) on Apple Silicon (M1/M2/M3/M4), Node.js 20+
npx node-mlx "What is 2+2?"
npx node-mlx --model phi-3-mini # Interactive chatimport { loadModel, RECOMMENDED_MODELS } from "node-mlx"
// Load a model (downloads automatically on first use)
const model = loadModel(RECOMMENDED_MODELS["llama-3.2-1b"])
// Generate text
const result = model.generate("Explain quantum computing in simple terms:", {
maxTokens: 200,
temperature: 0.7
})
console.log(result.text)
console.log(`${result.tokensPerSecond} tokens/sec`)
// Clean up when done
model.unload()import { loadModel } from "node-mlx"
// Use any model from mlx-community
const model = loadModel("mlx-community/Phi-4-mini-instruct-4bit")
const result = model.generate("Write a haiku about coding:", {
maxTokens: 50,
temperature: 0.8
})
console.log(result.text)
model.unload()The RECOMMENDED_MODELS constant provides shortcuts to tested models:
import { RECOMMENDED_MODELS } from "node-mlx"
// Small & Fast
RECOMMENDED_MODELS["qwen-2.5-0.5b"] // Qwen 2.5 0.5B - Great for simple tasks
RECOMMENDED_MODELS["llama-3.2-1b"] // Llama 3.2 1B - Fast general purpose
RECOMMENDED_MODELS["qwen-2.5-1.5b"] // Qwen 2.5 1.5B - Good balance
// Medium
RECOMMENDED_MODELS["llama-3.2-3b"] // Llama 3.2 3B - Better quality
RECOMMENDED_MODELS["qwen-2.5-3b"] // Qwen 2.5 3B - Multilingual
RECOMMENDED_MODELS["phi-3-mini"] // Phi-3 Mini - Reasoning tasks
// Multimodal (text-only mode)
RECOMMENDED_MODELS["gemma-3n-2b"] // Gemma 3n 2B - Efficient
RECOMMENDED_MODELS["gemma-3n-4b"] // Gemma 3n 4B - Higher qualityYou can also use any model from mlx-community:
loadModel("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
loadModel("mlx-community/Qwen3-30B-A3B-4bit") // MoE model- First use: Model downloads from HuggingFace (~2-8 GB depending on model)
- Cached: Models are stored in
~/.cache/huggingface/for future use - GPU ready: Model loads directly into Apple Silicon unified memory
// First call - downloads and caches
const model = loadModel("mlx-community/Llama-3.2-1B-Instruct-4bit")
// β³ Downloading... (one time only)
// Second call - instant from cache
const model2 = loadModel("mlx-community/Llama-3.2-1B-Instruct-4bit")
// β‘ Ready immediatelyFor single generations without keeping the model loaded:
import { generate } from "node-mlx"
// Loads, generates, unloads automatically
const result = generate("mlx-community/Llama-3.2-1B-Instruct-4bit", "Hello, world!", {
maxTokens: 100
})Loads a model from HuggingFace or local path.
| Option | Type | Default | Description |
|---|---|---|---|
maxTokens |
number | 256 | Maximum tokens to generate |
temperature |
number | 0.7 | Sampling randomness (0 = deterministic) |
topP |
number | 0.9 | Nucleus sampling threshold |
{
text: string // Generated text
tokenCount: number // Tokens generated
tokensPerSecond: number // Generation speed
}import { isSupported, getVersion } from "node-mlx"
isSupported() // true on Apple Silicon Mac
getVersion() // Library versionBenchmarks on Mac Studio M1 Ultra (64GB):
| Model | node-mlx | node-llama-cpp | Winner |
|---|---|---|---|
| Qwen3 30B (MoE) | 67 tok/s | 1 tok/s | 60x faster π |
| GPT-OSS 20B | 58 tok/s | 5 tok/s | 11x faster π |
| Ministral 8B | 101 tok/s | 51 tok/s | 2x faster π |
| Phi-4 14B | 56 tok/s | 32 tok/s | 1.8x faster π |
Why is MLX faster?
- Unified Memory β No data copying between CPU and GPU
- Metal Optimization β Native Apple GPU kernels
- Lazy Evaluation β Fused operations, minimal memory bandwidth
- Native Quantization β 4-bit optimized for Apple Silicon
| Architecture | Example Models | Status |
|---|---|---|
| Qwen2 | Qwen 2.5, Qwen3 (MoE) | β Full support |
| Llama | Llama 3.2, Mistral, Ministral | β Full support |
| Phi3 | Phi-3, Phi-4 | β Full support |
| GPT-OSS | GPT-OSS 20B (MoE) | β Full support |
| Gemma3n | Gemma 3n (VLM text-only) | π§ Experimental |
| node-mlx | node-llama-cpp | |
|---|---|---|
| Platform | macOS Apple Silicon | Cross-platform |
| Backend | Apple MLX | llama.cpp |
| Memory | Unified CPU+GPU | Separate |
| Model Format | MLX/Safetensors | GGUF |
| MoE Support | β Excellent |
Choose node-mlx for maximum performance on Apple Silicon. Choose node-llama-cpp for cross-platform or GGUF compatibility.
Everything below is for contributors and maintainers.
git clone https://github.com/sebastian-software/node-mlx.git
cd node-mlx
pnpm install# Build everything
pnpm build:swift # Swift library
pnpm build:native # N-API addon (uses local Swift build)
pnpm build # TypeScript (all packages)
# Or for development with prebuilds
cd packages/node-mlx
pnpm prebuildify # Create prebuilt binaries for Node 20/22/24pnpm test # All packages
pnpm test:coverage # With coverage
# Swift tests
cd packages/swift
swift testnode-mlx/
βββ package.json # Workspace root (private)
βββ pnpm-workspace.yaml
βββ turbo.json # Task orchestration
β
βββ packages/
βββ node-mlx/ # π¦ The npm package
β βββ package.json # Published as "node-mlx"
β βββ src/ # TypeScript API
β βββ test/ # TypeScript tests
β βββ native/ # C++ N-API binding
β βββ prebuilds/ # Prebuilt binaries (generated)
β βββ swift/ # Swift artifacts (generated)
β
βββ swift/ # Swift Package
β βββ Package.swift
β βββ Sources/NodeMLXCore/ # Swift implementation
β βββ Tests/ # Swift tests
β
βββ hf2swift/ # Model code generator
β βββ src/ # TypeScript generator
β βββ tests/ # Generator tests
β
βββ benchmarks/ # Performance benchmarks
βββ src/ # Benchmark scripts
# 1. Build Swift (copies to packages/node-mlx/swift/)
pnpm build:swift
# 2. Create prebuilds for Node 20/22/24
cd packages/node-mlx
pnpm prebuildify
# 3. Build TypeScript
pnpm build
# 4. Publish
npm publishThe published package includes:
dist/β TypeScript (ESM + CJS)prebuilds/darwin-arm64/node.nodeβ N-API binary (72 KB)swift/libNodeMLX.dylibβ Swift ML libraryswift/mlx-swift_Cmlx.bundle/β Metal shaders
pnpm hf2swift \
--model MyModel \
--source path/to/modeling_mymodel.py \
--config organization/model-name \
--output packages/swift/Sources/NodeMLXCore/Models/MyModel.swiftThe hf2swift generator parses Python model code and produces Swift using MLX primitives.
Built on MLX by Apple, mlx-swift, and swift-transformers by HuggingFace.
Special thanks to mlx-swift-lm β we adopted and adapted several core components from their excellent implementation:
- KV Cache management (
KVCacheSimple,RotatingKVCache) - Token sampling strategies (temperature, top-p, repetition penalty)
- RoPE implementations (Llama3, Yarn, LongRoPE)
- Attention utilities and quantization support
MIT Β© 2026 Sebastian Software GmbH