| Quick Start | DeepSeek Deployment Guide | Qwen3 Deployment Guide | Changelog |
fastllm is a high-performance LLMs inference library implemented in C++ with no backend dependencies (e.g. PyTorch).
It enables hybrid inference of MOE models, achieving 20+ tps on consumer-grade single GPUs (e.g., 4090) for DeepSeek R1 671B INT4 model inference.
Deployment discussion QQ group: 831641348
- 🚀 DeepSeek hybrid inference - deploy with multi-concurrency on consumer-grade single GPUs
- 🚀 Multi-NUMA node acceleration support
- 🚀 Dynamic batch and streaming output
- 🚀 Multi-GPU deployment and GPU+CPU hybrid deployment
- 🚀 Frontend-backend separation design for easy support of new computing devices
- 🚀 Support ROCm, so it's possible to inference with AMD GPU.
- 🚀 Pure C++ backend for easy cross-platform porting (can be directly compiled on Android)
- 🚀 Support customize model structures in Python
- PIP install (currently Nvidia GPU only)
Linux systems can try direct pip installation:
pip install ftllm -U
(Note: Due to PyPI size limitations, the package doesn't include CUDA dependencies - manual installation of CUDA 12+ is recommended)
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
sudo sh cuda_12.8.1_570.124.06_linux.run
If pip installation fails or you have special requirements, you can build from source:
Thie project is built with cmake. Requires pre-installed gcc, g++ (7.5+ tested, 9.4+ recommended), make, cmake (3.23+ recommended)
GPU compilation requires CUDA environment (9.2+ istested). Use the newest CUDA version possible.
Compilation commands:
bash install.sh -DUSE_CUDA=ON -D CMAKE_CUDA_COMPILER=$(which nvcc) # GPU version
# bash install.sh -DUSE_CUDA=ON -DCUDA_ARCH=89 -D CMAKE_CUDA_COMPILER=$(which nvcc) # Specify CUDA arch (e.g. 89 for RTX 4090)
# bash install.sh # CPU-only versionFor compilation instructions on other platforms, please refer to the documentation:
If you meet problem during compilation, see FAQ doc.
Taking the Qwen/Qwen3-0.6B model as an example:
ftllm run Qwen/Qwen3-0.6B
ftllm webui Qwen/Qwen3-0.6B
ftllm server Qwen/Qwen3-0.6B
You can launch a locally downloaded Hugging Face model. Assuming the local model path is /mnt/Qwen/Qwen2-0.5B-Instruct/, use the following command (similar for webui and server):
ftllm run /mnt/Qwen/Qwen3-0.6B/
If you can't remember the exact model name, you can input an approximate name (matching is not guaranteed).
For example:
ftllm run qwen2-7b-awq
ftllm run deepseek-v3-0324-int4
If you don't want to use the default cache directory, you can set it via parameter --cache_dir, for example:
ftllm run deepseek-v3-0324-int4 --cache_dir /mnt/
Or you can set it via the environment variable FASTLLM_CACHEDIR. For example, on Linux:
export FASTLLM_CACHEDIR=/mnt/
The following are common parameters when running the ftllm module:
-
-tor--threads:- Description: Sets the number of CPU threads to use.
- Example:
-t 27
-
--dtype:- Description: Specifies the data type of the model.
- Options:
int4or other supported data types. - Example:
--dtype int4
-
--device:- Description: Specifies the computing device for the model.
- Common Values:
cpu,cuda, ornuma. - Example:
--device cpuor--device cuda
-
--moe_device:- Description: Specifies the computing device for the MOE (Mixture of Experts) layer.
- Common Values:
cpu,cuda, ornuma. - Example:
--moe_device cpu
-
--moe_experts:- Description: Specifies the number of experts to use in the MOE layer. If not set, it follows the model's configuration. Reducing the number of experts may speed up inference but could lower accuracy.
- Example:
--moe_experts 6
-
--port:- Description: Specifies the port number for the service.
- Example:
--port 8080
Please read Arguments for Demos for further information.
Use the following command to download a model locally:
ftllm download deepseek-ai/DeepSeek-R1
If using quantized model loading (e.g., --dtype int4), the model will be quantized online each time it is loaded, which can be slow.
ftllm.export is a tool for exporting and converting model weights. It supports converting model weights to different data types. Below are detailed instructions on how to use ftllm.export.
ftllm export <model_path> -o <output_path> --dtype <data_type> -t <threads> ftllm export /mnt/DeepSeek-V3 -o /mnt/DeepSeek-V3-INT4 --dtype int4 -t 16You can specify --moe_dtype for mixed precision of a MoE model, for example:
ftllm export /mnt/DeepSeek-V3 -o /mnt/DeepSeek-V3-FP16INT4 --dtype float16 --moe_dtype int4 -t 16The exported model can be used similarly to the original model. The --dtype parameter will be ignored when using the exported model.
For example:
ftllm run /mnt/DeepSeek-V3-INT4/Fastllm supports original, AWQ and FASTLLM models. Please refer Supported Models for older models.
