55<h1 align =" center " >native-llm</h1 >
66
77<p align =" center " >
8- <strong >Run AI models locally. No cloud. No limits. No cost .</strong >
8+ <strong >The easiest way to run AI models locally .</strong >
99</p >
1010
1111<p align =" center " >
2525
2626---
2727
28- ## 🎯 Why native-llm?
29-
30- | | ☁️ Cloud AI | 🏠 native-llm |
31- | ----------- | ------------------------ | -------------------- |
32- | ** Cost** | $0.001 - $0.10 per query | ** Free forever** |
33- | ** Speed** | 1-20 seconds | ** < 100ms** |
34- | ** Privacy** | Data sent to servers | ** 100% local** |
35- | ** Limits** | Rate limits & quotas | ** Unlimited** |
36- | ** Offline** | ❌ Requires internet | ✅ ** Works offline** |
37-
38- ** The bottom line:** Local models now achieve ** 91% of GPT-5's quality** — at zero cost.
39-
40- ---
41-
4228## 🚀 Quick Start
4329
4430``` bash
@@ -48,67 +34,93 @@ npm install native-llm
4834``` typescript
4935import { LLMEngine } from " native-llm"
5036
51- // That's it. One line to load a model.
5237const engine = new LLMEngine ({ model: " gemma" })
5338
5439const result = await engine .generate ({
5540 prompt: " Explain quantum computing to a 5-year-old"
5641})
5742
5843console .log (result .text )
59- // → "Imagine you have a magical coin that can be heads AND tails at the same time..."
6044```
6145
62- Models download automatically on first use. No setup. No configuration . Just works.
46+ ** That's it. ** Model downloads automatically. GPU detected automatically . Just works.
6347
6448---
6549
66- ## ⚡ Performance
50+ ## 🎯 Why native-llm?
6751
68- Benchmarked on ** Apple M1 Ultra** with Metal GPU acceleration:
52+ ** A friendly wrapper around [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) that handles the
53+ hard parts:**
6954
70- | Model | Size | Speed | Best For |
71- | --------------------- | ----- | ------------ | ----------------- |
72- | 🚀 ** Gemma 3n E2B ** | 3 GB | ** 36 tok/s ** | Maximum speed |
73- | ⭐ ** Gemma 3n E4B ** | 5 GB | ** 18 tok/s ** | Best balance |
74- | 💻 ** Qwen 2.5 Coder ** | 5 GB | ** 23 tok/s ** | Code generation |
75- | 🧠 ** DeepSeek R1 ** | 5 GB | ** 9 tok/s ** | Complex reasoning |
76- | 👑 ** Gemma 3 27B ** | 18 GB | ** 5 tok/s ** | Maximum quality |
55+ | Without native-llm | With native-llm |
56+ | -------------------------- | ----------------------- |
57+ | Find GGUF model URLs | ` model: "gemma" ` |
58+ | Configure HuggingFace auth | Auto from ` HF_TOKEN ` |
59+ | 20+ lines of setup | 3 lines |
60+ | Handle Qwen3 thinking mode | Automatic |
61+ | Research model benchmarks | Curated recommendations |
7762
78- > 💡 ** Our pick:** Start with ` gemma-3n-e4b ` — it's the sweet spot of quality and speed.
63+ ### Local vs Cloud
64+
65+ | | ☁️ Cloud AI | 🏠 native-llm |
66+ | ----------- | ------------------------ | -------------------- |
67+ | ** Cost** | $0.001 - $0.10 per query | ** Free forever** |
68+ | ** Speed** | 1-20 seconds | ** < 100ms** |
69+ | ** Privacy** | Data sent to servers | ** 100% local** |
70+ | ** Limits** | Rate limits & quotas | ** Unlimited** |
71+ | ** Offline** | ❌ Requires internet | ✅ ** Works offline** |
7972
8073---
8174
8275## 🎨 Models
8376
84- Use simple aliases — we handle the rest:
77+ ### Simple Aliases
8578
8679``` typescript
87- new LLMEngine ({ model: " gemma" }) // Fast & efficient
88- new LLMEngine ({ model: " gemma-large " }) // Maximum quality
80+ new LLMEngine ({ model: " gemma" }) // Best balance (default)
81+ new LLMEngine ({ model: " gemma-fast " }) // Maximum speed
8982new LLMEngine ({ model: " qwen-coder" }) // Code generation
90- new LLMEngine ({ model: " deepseek" }) // Chain-of-thought reasoning
91- new LLMEngine ({ model: " phi" }) // STEM & science
83+ new LLMEngine ({ model: " deepseek" }) // Complex reasoning
9284```
9385
94- Or use any of the ** 1000+ GGUF models ** on HuggingFace:
86+ ### Smart Recommendations
9587
9688``` typescript
97- new LLMEngine ({ model: " /path/to/any-model.gguf" })
89+ import { LLMEngine } from " native-llm"
90+
91+ // Get the right model for your use case
92+ const model = LLMEngine .getModelForUseCase (" code" ) // → qwen-2.5-coder-7b
93+ const model = LLMEngine .getModelForUseCase (" fast" ) // → gemma-3n-e2b
94+ const model = LLMEngine .getModelForUseCase (" quality" ) // → gemma-3-27b
95+
96+ // List all available models
97+ const models = LLMEngine .listModels ()
98+ // → [{ id: "gemma-3n-e4b", name: "Gemma 3n E4B", size: "5 GB", ... }, ...]
9899```
99100
101+ ### Performance (M1 Ultra)
102+
103+ | Model | Size | Speed | Best For |
104+ | --------------------- | ----- | ------------ | ----------------- |
105+ | 🚀 ** Gemma 3n E2B** | 3 GB | ** 36 tok/s** | Maximum speed |
106+ | ⭐ ** Gemma 3n E4B** | 5 GB | ** 18 tok/s** | Best balance |
107+ | 💻 ** Qwen 2.5 Coder** | 5 GB | ** 23 tok/s** | Code generation |
108+ | 🧠 ** DeepSeek R1** | 5 GB | ** 9 tok/s** | Complex reasoning |
109+ | 👑 ** Gemma 3 27B** | 18 GB | ** 5 tok/s** | Maximum quality |
110+
100111---
101112
102113## ✨ Features
103114
104- | Feature | Description |
105- | --------------------- | ----------------------------------------------------------- |
106- | 🔥 ** Native Speed** | Direct N-API bindings to llama.cpp — no subprocess overhead |
107- | 🍎 ** Metal GPU** | Full Apple Silicon acceleration out of the box |
108- | 🖥️ ** Cross-Platform** | macOS, Linux, Windows — CUDA support for NVIDIA |
109- | 📦 ** Auto-Download** | Models fetched from HuggingFace automatically |
110- | 🌊 ** Streaming** | Real-time token-by-token output |
111- | 📝 ** TypeScript** | Full type definitions included |
115+ | Feature | Description |
116+ | --------------------- | ---------------------------------------------------------- |
117+ | 📦 ** Zero Config** | Models download automatically, GPU detected automatically |
118+ | 🎯 ** Smart Defaults** | Curated models, sensible parameters, thinking-mode handled |
119+ | 🔥 ** Native Speed** | Direct llama.cpp bindings — no Python, no subprocess |
120+ | 🍎 ** Metal GPU** | Full Apple Silicon acceleration out of the box |
121+ | 🖥️ ** Cross-Platform** | macOS, Linux, Windows with CUDA support |
122+ | 🌊 ** Streaming** | Real-time token-by-token output |
123+ | 📝 ** TypeScript** | Full type definitions included |
112124
113125---
114126
@@ -126,8 +138,8 @@ Get yours in 30 seconds: [huggingface.co/settings/tokens](https://huggingface.co
126138
127139## 📚 Documentation
128140
129- ** [ → Full Documentation] ( https://sebastian-software.github.io/native-llm/ ) ** — Benchmarks, model
130- comparison, streaming, chat API , and more.
141+ ** [ → Full Documentation] ( https://sebastian-software.github.io/native-llm/ ) ** — Streaming, chat API,
142+ custom models , and more.
131143
132144<p align =" center " >
133145 <strong >MIT License</strong > · Made with ❤️ by <a href =" https://sebastian-software.de " >Sebastian Software</a >
0 commit comments