Skip to content

sebastian-software/node-mlx

node-mlx

The fastest way to run LLMs in Node.js on Apple Silicon.

CI npm version License: MIT


Installation

npm install node-mlx

Requirements: macOS 14+ (Sonoma) on Apple Silicon (M1/M2/M3/M4), Node.js 20+

Try it with the CLI

npx node-mlx "What is 2+2?"
npx node-mlx --model phi-3-mini   # Interactive chat

Usage

Basic Example

import { loadModel, RECOMMENDED_MODELS } from "node-mlx"

// Load a model (downloads automatically on first use)
const model = loadModel(RECOMMENDED_MODELS["llama-3.2-1b"])

// Generate text
const result = model.generate("Explain quantum computing in simple terms:", {
  maxTokens: 200,
  temperature: 0.7
})

console.log(result.text)
console.log(`${result.tokensPerSecond} tokens/sec`)

// Clean up when done
model.unload()

Using Phi-4

import { loadModel } from "node-mlx"

// Use any model from mlx-community
const model = loadModel("mlx-community/Phi-4-mini-instruct-4bit")

const result = model.generate("Write a haiku about coding:", {
  maxTokens: 50,
  temperature: 0.8
})

console.log(result.text)
model.unload()

Available Models

The RECOMMENDED_MODELS constant provides shortcuts to tested models:

import { RECOMMENDED_MODELS } from "node-mlx"

// Small & Fast
RECOMMENDED_MODELS["qwen-2.5-0.5b"] // Qwen 2.5 0.5B - Great for simple tasks
RECOMMENDED_MODELS["llama-3.2-1b"] // Llama 3.2 1B - Fast general purpose
RECOMMENDED_MODELS["qwen-2.5-1.5b"] // Qwen 2.5 1.5B - Good balance

// Medium
RECOMMENDED_MODELS["llama-3.2-3b"] // Llama 3.2 3B - Better quality
RECOMMENDED_MODELS["qwen-2.5-3b"] // Qwen 2.5 3B - Multilingual
RECOMMENDED_MODELS["phi-3-mini"] // Phi-3 Mini - Reasoning tasks

// Multimodal (text-only mode)
RECOMMENDED_MODELS["gemma-3n-2b"] // Gemma 3n 2B - Efficient
RECOMMENDED_MODELS["gemma-3n-4b"] // Gemma 3n 4B - Higher quality

You can also use any model from mlx-community:

loadModel("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
loadModel("mlx-community/Qwen3-30B-A3B-4bit") // MoE model

How Model Loading Works

  1. First use: Model downloads from HuggingFace (~2-8 GB depending on model)
  2. Cached: Models are stored in ~/.cache/huggingface/ for future use
  3. GPU ready: Model loads directly into Apple Silicon unified memory
// First call - downloads and caches
const model = loadModel("mlx-community/Llama-3.2-1B-Instruct-4bit")
// ⏳ Downloading... (one time only)

// Second call - instant from cache
const model2 = loadModel("mlx-community/Llama-3.2-1B-Instruct-4bit")
// ⚑ Ready immediately

One-Shot Generation

For single generations without keeping the model loaded:

import { generate } from "node-mlx"

// Loads, generates, unloads automatically
const result = generate("mlx-community/Llama-3.2-1B-Instruct-4bit", "Hello, world!", {
  maxTokens: 100
})

API Reference

loadModel(modelId: string): Model

Loads a model from HuggingFace or local path.

model.generate(prompt, options): GenerationResult

Option Type Default Description
maxTokens number 256 Maximum tokens to generate
temperature number 0.7 Sampling randomness (0 = deterministic)
topP number 0.9 Nucleus sampling threshold

GenerationResult

{
  text: string // Generated text
  tokenCount: number // Tokens generated
  tokensPerSecond: number // Generation speed
}

Utilities

import { isSupported, getVersion } from "node-mlx"

isSupported() // true on Apple Silicon Mac
getVersion() // Library version

Performance

Benchmarks on Mac Studio M1 Ultra (64GB):

Model node-mlx node-llama-cpp Winner
Qwen3 30B (MoE) 67 tok/s 1 tok/s 60x faster πŸ†
GPT-OSS 20B 58 tok/s 5 tok/s 11x faster πŸ†
Ministral 8B 101 tok/s 51 tok/s 2x faster πŸ†
Phi-4 14B 56 tok/s 32 tok/s 1.8x faster πŸ†
Why is MLX faster?
  1. Unified Memory – No data copying between CPU and GPU
  2. Metal Optimization – Native Apple GPU kernels
  3. Lazy Evaluation – Fused operations, minimal memory bandwidth
  4. Native Quantization – 4-bit optimized for Apple Silicon

Supported Architectures

Architecture Example Models Status
Qwen2 Qwen 2.5, Qwen3 (MoE) βœ… Full support
Llama Llama 3.2, Mistral, Ministral βœ… Full support
Phi3 Phi-3, Phi-4 βœ… Full support
GPT-OSS GPT-OSS 20B (MoE) βœ… Full support
Gemma3n Gemma 3n (VLM text-only) πŸ”§ Experimental

vs. node-llama-cpp

node-mlx node-llama-cpp
Platform macOS Apple Silicon Cross-platform
Backend Apple MLX llama.cpp
Memory Unified CPU+GPU Separate
Model Format MLX/Safetensors GGUF
MoE Support βœ… Excellent ⚠️ Limited

Choose node-mlx for maximum performance on Apple Silicon. Choose node-llama-cpp for cross-platform or GGUF compatibility.


Development

Everything below is for contributors and maintainers.

Setup

git clone https://github.com/sebastian-software/node-mlx.git
cd node-mlx
pnpm install

Build

# Build everything
pnpm build:swift    # Swift library
pnpm build:native   # N-API addon (uses local Swift build)
pnpm build          # TypeScript (all packages)

# Or for development with prebuilds
cd packages/node-mlx
pnpm prebuildify    # Create prebuilt binaries for Node 20/22/24

Test

pnpm test           # All packages
pnpm test:coverage  # With coverage

# Swift tests
cd packages/swift
swift test

Project Structure

node-mlx/
β”œβ”€β”€ package.json                 # Workspace root (private)
β”œβ”€β”€ pnpm-workspace.yaml
β”œβ”€β”€ turbo.json                   # Task orchestration
β”‚
└── packages/
    β”œβ”€β”€ node-mlx/                # πŸ“¦ The npm package
    β”‚   β”œβ”€β”€ package.json         # Published as "node-mlx"
    β”‚   β”œβ”€β”€ src/                 # TypeScript API
    β”‚   β”œβ”€β”€ test/                # TypeScript tests
    β”‚   β”œβ”€β”€ native/              # C++ N-API binding
    β”‚   β”œβ”€β”€ prebuilds/           # Prebuilt binaries (generated)
    β”‚   └── swift/               # Swift artifacts (generated)
    β”‚
    β”œβ”€β”€ swift/                   # Swift Package
    β”‚   β”œβ”€β”€ Package.swift
    β”‚   β”œβ”€β”€ Sources/NodeMLXCore/ # Swift implementation
    β”‚   └── Tests/               # Swift tests
    β”‚
    β”œβ”€β”€ hf2swift/                # Model code generator
    β”‚   β”œβ”€β”€ src/                 # TypeScript generator
    β”‚   └── tests/               # Generator tests
    β”‚
    └── benchmarks/              # Performance benchmarks
        └── src/                 # Benchmark scripts

Publishing

# 1. Build Swift (copies to packages/node-mlx/swift/)
pnpm build:swift

# 2. Create prebuilds for Node 20/22/24
cd packages/node-mlx
pnpm prebuildify

# 3. Build TypeScript
pnpm build

# 4. Publish
npm publish

The published package includes:

  • dist/ – TypeScript (ESM + CJS)
  • prebuilds/darwin-arm64/node.node – N-API binary (72 KB)
  • swift/libNodeMLX.dylib – Swift ML library
  • swift/mlx-swift_Cmlx.bundle/ – Metal shaders

Adding New Models

pnpm hf2swift \
  --model MyModel \
  --source path/to/modeling_mymodel.py \
  --config organization/model-name \
  --output packages/swift/Sources/NodeMLXCore/Models/MyModel.swift

The hf2swift generator parses Python model code and produces Swift using MLX primitives.


Credits

Built on MLX by Apple, mlx-swift, and swift-transformers by HuggingFace.

Special thanks to mlx-swift-lm – we adopted and adapted several core components from their excellent implementation:

  • KV Cache management (KVCacheSimple, RotatingKVCache)
  • Token sampling strategies (temperature, top-p, repetition penalty)
  • RoPE implementations (Llama3, Yarn, LongRoPE)
  • Attention utilities and quantization support

License

MIT Β© 2026 Sebastian Software GmbH

About

LLM inference for Node.js powered by Apple MLX on Apple Silicon

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published