From d323eae3f157c0766dc8f837a942cf64614c6633 Mon Sep 17 00:00:00 2001 From: MUHAMMAD ANAS Date: Tue, 9 Sep 2025 13:45:11 +0500 Subject: [PATCH 1/2] Update installation instructions and formatting in README Corrected formatting and improved clarity in the installation instructions for Windows and Debian/Ubuntu users. Added emphasis on initializing submodules for building the project. --- README.md | 206 ++++++++++-------------------------------------------- 1 file changed, 35 insertions(+), 171 deletions(-) diff --git a/README.md b/README.md index 798c0e95..33e33317 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achi m2_performance m2_performance ->The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp. +> The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp. ## Demo @@ -35,6 +35,7 @@ https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1 ## Acknowledgements This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in [T-MAC](https://github.com/microsoft/T-MAC/). For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC. + ## Official Models @@ -153,23 +154,22 @@ This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp)
- - ## Installation ### Requirements - python>=3.9 - cmake>=3.22 - clang>=18 - - For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake): - - Desktop-development with C++ - - C++-CMake Tools for Windows - - Git for Windows - - C++-Clang Compiler for Windows - - MS-Build Support for LLVM-Toolset (clang) - - For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/) - - `bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"` + - For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options (this also automatically installs the required additional tools like CMake): + - Desktop-development with C++ + - C++-CMake Tools for Windows + - Git for Windows + - C++-Clang Compiler for Windows + - MS-Build Support for LLVM-Toolset (clang) + - For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/) + ``` + bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" + ``` - conda (highly recommend) ### Build from source @@ -177,162 +177,26 @@ This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) > [!IMPORTANT] > If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues. -1. Clone the repo -```bash -git clone --recursive https://github.com/microsoft/BitNet.git -cd BitNet -``` -2. Install the dependencies -```bash -# (Recommended) Create a new conda environment -conda create -n bitnet-cpp python=3.9 -conda activate bitnet-cpp - -pip install -r requirements.txt -``` -3. Build the project -```bash -# Manually download the model and run with local path -huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T -python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s - -``` -
-usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
-                    [--use-pretuned]
-
-Setup the environment for running inference
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
-                        Model used for inference
-  --model-dir MODEL_DIR, -md MODEL_DIR
-                        Directory to save/load the model
-  --log-dir LOG_DIR, -ld LOG_DIR
-                        Directory to save the logging info
-  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
-                        Quantization type
-  --quant-embd          Quantize the embeddings to f16
-  --use-pretuned, -p    Use the pretuned kernel parameters
-
-## Usage -### Basic usage -```bash -# Run inference with the quantized model -python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv -``` -
-usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
-
-Run inference
-
-optional arguments:
-  -h, --help            show this help message and exit
-  -m MODEL, --model MODEL
-                        Path to model file
-  -n N_PREDICT, --n-predict N_PREDICT
-                        Number of tokens to predict when generating text
-  -p PROMPT, --prompt PROMPT
-                        Prompt to generate text from
-  -t THREADS, --threads THREADS
-                        Number of threads to use
-  -c CTX_SIZE, --ctx-size CTX_SIZE
-                        Size of the prompt context
-  -temp TEMPERATURE, --temperature TEMPERATURE
-                        Temperature, a hyperparameter that controls the randomness of the generated text
-  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
-                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)
-
- -### Benchmark -We provide scripts to run the inference benchmark providing a model. - -``` -usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS] - -Setup the environment for running the inference - -required arguments: - -m MODEL, --model MODEL - Path to the model file. - -optional arguments: - -h, --help - Show this help message and exit. - -n N_TOKEN, --n-token N_TOKEN - Number of generated tokens. - -p N_PROMPT, --n-prompt N_PROMPT - Prompt to generate text from. - -t THREADS, --threads THREADS - Number of threads to use. -``` - -Here's a brief explanation of each argument: - -- `-m`, `--model`: The path to the model file. This is a required argument that must be provided when running the script. -- `-n`, `--n-token`: The number of tokens to generate during the inference. It is an optional argument with a default value of 128. -- `-p`, `--n-prompt`: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512. -- `-t`, `--threads`: The number of threads to use for running the inference. It is an optional argument with a default value of 2. -- `-h`, `--help`: Show the help message and exit. Use this argument to display usage information. - -For example: - -```sh -python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4 -``` - -This command would run the inference benchmark using the model located at `/path/to/model`, generating 200 tokens from a 256 token prompt, utilizing 4 threads. - -For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine: - -```bash -python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M - -# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate -python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128 -``` - -### Convert from `.safetensors` Checkpoints - -```sh -# Prepare the .safetensors model file -huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16 - -# Convert to gguf model -python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16 -``` - -### FAQ (Frequently Asked Questions)📌 - -#### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp? - -**A:** -This is an issue introduced in recent version of llama.cpp. Please refer to this [commit](https://github.com/tinglou/llama.cpp/commit/4e3db1e3d78cc1bcd22bcb3af54bd2a4628dd323) in the [discussion](https://github.com/abetlen/llama-cpp-python/issues/1942) to fix this issue. - -#### Q2: How to build with clang in conda environment on windows? - -**A:** -Before building the project, verify your clang installation and access to Visual Studio tools by running: -``` -clang -v -``` - -This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as: -``` -'clang' is not recognized as an internal or external command, operable program or batch file. -``` - -It indicates that your command line window is not properly initialized for Visual Studio tools. - -• If you are using Command Prompt, run: -``` -"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64 -``` - -• If you are using Windows PowerShell, run the following commands: -``` -Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64" -``` - -These steps will initialize your environment and allow you to use the correct Visual Studio tools. +1. **Clone the repo** + ```bash + git clone --recursive https://github.com/microsoft/BitNet.git + cd BitNet + ``` + +2. **Install the dependencies** + ```bash + # (Recommended) Create a new conda environment + conda create -n bitnet-cpp python=3.9 + conda activate bitnet-cpp + + pip install -r requirements.txt + ``` + +3. **Make sure `llama.cpp` is present (submodule or manual clone)** + Many build issues happen when `3rdparty/llama.cpp` is empty because the repo was cloned without submodules. + + **A) Preferred — initialize the submodule** + ```bash + # from repo root + git submodule sync --recursive + git submodule update --init --recursive From 0cec2338995e57de6d99a85bee9b78642dfcae29 Mon Sep 17 00:00:00 2001 From: MUHAMMAD ANAS Date: Tue, 9 Sep 2025 13:50:54 +0500 Subject: [PATCH 2/2] Update README with build and usage instructions --- README.md | 147 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 147 insertions(+) diff --git a/README.md b/README.md index 33e33317..21235d17 100644 --- a/README.md +++ b/README.md @@ -200,3 +200,150 @@ This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) # from repo root git submodule sync --recursive git submodule update --init --recursive + +4. Build the project +```bash +# Manually download the model and run with local path +huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T +python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s + +``` +
+usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
+                    [--use-pretuned]
+
+Setup the environment for running inference
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
+                        Model used for inference
+  --model-dir MODEL_DIR, -md MODEL_DIR
+                        Directory to save/load the model
+  --log-dir LOG_DIR, -ld LOG_DIR
+                        Directory to save the logging info
+  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
+                        Quantization type
+  --quant-embd          Quantize the embeddings to f16
+  --use-pretuned, -p    Use the pretuned kernel parameters
+
+## Usage +### Basic usage +```bash +# Run inference with the quantized model +python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv +``` +
+usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
+
+Run inference
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -m MODEL, --model MODEL
+                        Path to model file
+  -n N_PREDICT, --n-predict N_PREDICT
+                        Number of tokens to predict when generating text
+  -p PROMPT, --prompt PROMPT
+                        Prompt to generate text from
+  -t THREADS, --threads THREADS
+                        Number of threads to use
+  -c CTX_SIZE, --ctx-size CTX_SIZE
+                        Size of the prompt context
+  -temp TEMPERATURE, --temperature TEMPERATURE
+                        Temperature, a hyperparameter that controls the randomness of the generated text
+  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
+                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)
+
+ +### Benchmark +We provide scripts to run the inference benchmark providing a model. + +``` +usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS] + +Setup the environment for running the inference + +required arguments: + -m MODEL, --model MODEL + Path to the model file. + +optional arguments: + -h, --help + Show this help message and exit. + -n N_TOKEN, --n-token N_TOKEN + Number of generated tokens. + -p N_PROMPT, --n-prompt N_PROMPT + Prompt to generate text from. + -t THREADS, --threads THREADS + Number of threads to use. +``` + +Here's a brief explanation of each argument: + +- `-m`, `--model`: The path to the model file. This is a required argument that must be provided when running the script. +- `-n`, `--n-token`: The number of tokens to generate during the inference. It is an optional argument with a default value of 128. +- `-p`, `--n-prompt`: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512. +- `-t`, `--threads`: The number of threads to use for running the inference. It is an optional argument with a default value of 2. +- `-h`, `--help`: Show the help message and exit. Use this argument to display usage information. + +For example: + +```sh +python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4 +``` + +This command would run the inference benchmark using the model located at `/path/to/model`, generating 200 tokens from a 256 token prompt, utilizing 4 threads. + +For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine: + +```bash +python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M + +# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate +python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128 +``` + +### Convert from `.safetensors` Checkpoints + +```sh +# Prepare the .safetensors model file +huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16 + +# Convert to gguf model +python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16 +``` + +### FAQ (Frequently Asked Questions)📌 + +#### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp? + +**A:** +This is an issue introduced in recent version of llama.cpp. Please refer to this [commit](https://github.com/tinglou/llama.cpp/commit/4e3db1e3d78cc1bcd22bcb3af54bd2a4628dd323) in the [discussion](https://github.com/abetlen/llama-cpp-python/issues/1942) to fix this issue. + +#### Q2: How to build with clang in conda environment on windows? + +**A:** +Before building the project, verify your clang installation and access to Visual Studio tools by running: +``` +clang -v +``` + +This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as: +``` +'clang' is not recognized as an internal or external command, operable program or batch file. +``` + +It indicates that your command line window is not properly initialized for Visual Studio tools. + +• If you are using Command Prompt, run: +``` +"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64 +``` + +• If you are using Windows PowerShell, run the following commands: +``` +Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64" +``` + +These steps will initialize your environment and allow you to use the correct Visual Studio tools.