Intel will not provide or guarantee development of or support for this project, including but not limited to, maintenance, bug fixes, new releases or updates.
Patches to this project are no longer accepted by Intel.
This project has been identified as having known security issues.
< English | 中文 >
IPEX-LLM is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
Note
IPEX-LLMprovides seamless integration with llama.cpp, Ollama, vLLM, HuggingFace transformers, LangChain, LlamaIndex, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.- 70+ models have been optimized/verified on
ipex-llm(e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.
- [2025/05] You can now run DeepSeek V3/R1 671B and Qwen3MoE 235B models with just 1 or 2 Intel Arc GPU (such as A770 or B580) using FlashMoE in
ipex-llm. - [2025/04] We released
ipex-llm 2.2.0, which includes Ollama Portable Zip and llama.cpp Portable Zip.⚠️ Warning (for llama.cpp Portable Zip)
mmap-based model loading in llama.cpp may leak data via side-channels in multi-tenant or shared-host environments.
To disablemmap, add:--no-mmap
- [2025/04] We added support of PyTorch 2.6 for Intel GPU.
- [2025/03] We added support for Gemma3 model in the latest llama.cpp Portable Zip.
- [2025/03] We can now run DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon using the latest llama.cpp Portable Zip.
- [2025/02] We added support of llama.cpp Portable Zip for Intel GPU (both Windows and Linux) and NPU (Windows only).
- [2025/02] We added support of Ollama Portable Zip to directly run Ollama on Intel GPU for both Windows and Linux (without the need of manual installations).
- [2025/02] We added support for running vLLM 0.6.6 on Intel Arc GPUs.
- [2025/01] We added the guide for running
ipex-llmon Intel Arc B580 GPU. - [2025/01] We added support for running Ollama 0.5.4 on Intel GPU.
- [2024/12] We added both Python and C++ support for Intel Core Ultra NPU (including 100H, 200V, 200K and 200H series).
More updates
- [2024/11] We added support for running vLLM 0.6.2 on Intel Arc GPUs.
- [2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
- [2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
- [2024/07] We added FP6 support on Intel GPU.
- [2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
- [2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
- [2024/06] We added support for running RAGFlow with
ipex-llmon Intel GPU. - [2024/05]
ipex-llmnow supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here. - [2024/05] You can now easily run
ipex-llminference, serving and finetuning using the Docker images. - [2024/05] You can now install
ipex-llmon Windows using just "one command". - [2024/04] You can now run Open WebUI on Intel GPU using
ipex-llm; see the quickstart here. - [2024/04] You can now run Llama 3 on Intel GPU using
llama.cppandollamawithipex-llm; see the quickstart here. - [2024/04]
ipex-llmnow supports Llama 3 on both Intel GPU and CPU. - [2024/04]
ipex-llmnow provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU. - [2024/03]
bigdl-llmhas now becomeipex-llm(see the migration guide here); you may find the originalBigDLproject here. - [2024/02]
ipex-llmnow supports directly loading model from ModelScope (魔搭). - [2024/02]
ipex-llmadded initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use
ipex-llmthrough Text-Generation-WebUI GUI. - [2024/02]
ipex-llmnow supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively. - [2024/02]
ipex-llmnow supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA). - [2024/01] Using
ipex-llmQLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here). - [2023/12]
ipex-llmnow supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates"). - [2023/12]
ipex-llmnow supports Mixtral-8x7B on both Intel GPU and CPU. - [2023/12]
ipex-llmnow supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"). - [2023/12]
ipex-llmnow supports FP8 and FP4 inference on Intel GPU. - [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into
ipex-llmis available. - [2023/11]
ipex-llmnow supports vLLM continuous batching on both Intel GPU and CPU. - [2023/10]
ipex-llmnow supports QLoRA finetuning on both Intel GPU and CPU. - [2023/10]
ipex-llmnow supports FastChat serving on on both Intel CPU and GPU. - [2023/09]
ipex-llmnow supports Intel GPU (including iGPU, Arc, Flex and MAX). - [2023/09]
ipex-llmtutorial is released.
See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.
| Intel Core Ultra iGPU | Intel Core Ultra NPU | 2-Card Intel Arc dGPUs | Intel Xeon + Arc dGPU |
|
|
|
|
|
Ollama (Mistral-7B, Q4_K) |
HuggingFace (Llama3.2-3B, SYM_INT4) |
llama.cpp (DeepSeek-R1-Distill-Qwen-32B, Q4_K) |
FlashMoE (Qwen3MoE-235B, Q4_K) |
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
|
|
You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
| Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
|---|---|---|---|---|---|---|
| Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
| Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
| Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
| Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
| Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
| gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
| Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
| Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
| Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
- Ollama: running Ollama on Intel GPU without the need of manual installations
- llama.cpp: running llama.cpp on Intel GPU without the need of manual installations
- Arc B580: running
ipex-llmon Intel Arc B580 GPU for Ollama, llama.cpp, PyTorch, HuggingFace, etc. - NPU: running
ipex-llmon Intel NPU in both Python/C++ or llama.cpp API. - PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of
ipex-llm) on Intel GPU for Windows and Linux - vLLM: running
ipex-llmin vLLM on both Intel GPU and CPU - FastChat: running
ipex-llmin FastChat serving on on both Intel GPU and CPU - Serving on multiple Intel GPUs: running
ipex-llmserving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI - Text-Generation-WebUI: running
ipex-llminoobaboogaWebUI - Axolotl: running
ipex-llmin Axolotl for LLM finetuning - Benchmarking: running (latency and throughput) benchmarks for
ipex-llmon Intel CPU and GPU
- GPU Inference in C++: running
llama.cpp,ollama, etc., withipex-llmon Intel GPU - GPU Inference in Python : running HuggingFace
transformers,LangChain,LlamaIndex,ModelScope, etc. withipex-llmon Intel GPU - vLLM on GPU: running
vLLMserving withipex-llmon Intel GPU - vLLM on CPU: running
vLLMserving withipex-llmon Intel CPU - FastChat on GPU: running
FastChatserving withipex-llmon Intel GPU - VSCode on GPU: running and developing
ipex-llmapplications in Python using VSCode on Intel GPU
- GraphRAG: running Microsoft's
GraphRAGusing local LLM withipex-llm - RAGFlow: running
RAGFlow(an open-source RAG engine) withipex-llm - LangChain-Chatchat: running
LangChain-Chatchat(Knowledge Base QA using RAG pipeline) withipex-llm - Coding copilot: running
Continue(coding copilot in VSCode) withipex-llm - Open WebUI: running
Open WebUIwithipex-llm - PrivateGPT: running
PrivateGPTto interact with documents withipex-llm - Dify platform: running
ipex-llminDify(production-ready LLM app development platform)
- Windows GPU: installing
ipex-llmon Windows with Intel GPU - Linux GPU: installing
ipex-llmon Linux with Intel GPU - For more details, please refer to the full installation guide
-
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
-
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
-
- Low-bit models: saving and loading
ipex-llmlow-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) - GGUF: directly loading GGUF models into
ipex-llm - AWQ: directly loading AWQ models into
ipex-llm - GPTQ: directly loading GPTQ models into
ipex-llm
- Low-bit models: saving and loading
- Tutorials
Over 70 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
| Model | CPU Example | GPU Example | NPU Example |
|---|---|---|---|
| LLaMA | link1, link2 | link | |
| LLaMA 2 | link1, link2 | link | Python link, C++ link |
| LLaMA 3 | link | link | Python link, C++ link |
| LLaMA 3.1 | link | link | |
| LLaMA 3.2 | link | Python link, C++ link | |
| LLaMA 3.2-Vision | link | ||
| ChatGLM | link | ||
| ChatGLM2 | link | link | |
| ChatGLM3 | link | link | |
| GLM-4 | link | link | |
| GLM-4V | link | link | |
| GLM-Edge | link | Python link | |
| GLM-Edge-V | link | ||
| Mistral | link | link | |
| Mixtral | link | link | |
| Falcon | link | link | |
| MPT | link | link | |
| Dolly-v1 | link | link | |
| Dolly-v2 | link | link | |
| Replit Code | link | link | |
| RedPajama | link1, link2 | ||
| Phoenix | link1, link2 | ||
| StarCoder | link1, link2 | link | |
| Baichuan | link | link | |
| Baichuan2 | link | link | Python link |
| InternLM | link | link | |
| InternVL2 | link | ||
| Qwen | link | link | |
| Qwen1.5 | link | link | |
| Qwen2 | link | link | Python link, C++ link |
| Qwen2.5 | link | Python link, C++ link | |
| Qwen-VL | link | link | |
| Qwen2-VL | link | ||
| Qwen2-Audio | link | ||
| Aquila | link | link | |
| Aquila2 | link | link | |
| MOSS | link | ||
| Whisper | link | link | |
| Phi-1_5 | link | link | |
| Flan-t5 | link | link | |
| LLaVA | link | link | |
| CodeLlama | link | link | |
| Skywork | link | ||
| InternLM-XComposer | link | ||
| WizardCoder-Python | link | ||
| CodeShell | link | ||
| Fuyu | link | ||
| Distil-Whisper | link | link | |
| Yi | link | link | |
| BlueLM | link | link | |
| Mamba | link | link | |
| SOLAR | link | link | |
| Phixtral | link | link | |
| InternLM2 | link | link | |
| RWKV4 | link | ||
| RWKV5 | link | ||
| Bark | link | link | |
| SpeechT5 | link | ||
| DeepSeek-MoE | link | ||
| Ziya-Coding-34B-v1.0 | link | ||
| Phi-2 | link | link | |
| Phi-3 | link | link | |
| Phi-3-vision | link | link | |
| Yuan2 | link | link | |
| Gemma | link | link | |
| Gemma2 | link | ||
| DeciLM-7B | link | link | |
| Deepseek | link | link | |
| StableLM | link | link | |
| CodeGemma | link | link | |
| Command-R/cohere | link | link | |
| CodeGeeX2 | link | link | |
| MiniCPM | link | link | Python link, C++ link |
| MiniCPM3 | link | ||
| MiniCPM-V | link | ||
| MiniCPM-V-2 | link | link | |
| MiniCPM-Llama3-V-2_5 | link | Python link | |
| MiniCPM-V-2_6 | link | link | Python link |
| MiniCPM-o-2_6 | link | ||
| Janus-Pro | link | ||
| Moonlight | link | ||
| StableDiffusion | link | ||
| Bce-Embedding-Base-V1 | Python link | ||
| Speech_Paraformer-Large | Python link |
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
Footnotes
-
Performance varies by use, configuration and other factors.
ipex-llmmay not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩2