From d323eae3f157c0766dc8f837a942cf64614c6633 Mon Sep 17 00:00:00 2001
From: MUHAMMAD ANAS <anasjadoon31@gmail.com>
Date: Tue, 9 Sep 2025 13:45:11 +0500
Subject: [PATCH 1/2] Update installation instructions and formatting in README

Corrected formatting and improved clarity in the installation instructions for Windows and Debian/Ubuntu users. Added emphasis on initializing submodules for building the project.
---
 README.md | 206 ++++++++++--------------------------------------------
 1 file changed, 35 insertions(+), 171 deletions(-)
diff --git a/README.md b/README.md
index 798c0e95..33e33317 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@ The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achi
 <img src="./assets/m2_performance.jpg" alt="m2_performance" width="800"/>
 <img src="./assets/intel_performance.jpg" alt="m2_performance" width="800"/>
 
->The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.
+> The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.
 
 ## Demo
 
@@ -35,6 +35,7 @@ https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1
 ## Acknowledgements
 
 This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in [T-MAC](https://github.com/microsoft/T-MAC/). For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
+
 ## Official Models
 <table>
     </tr>
@@ -153,23 +154,22 @@ This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp)
     </tr>
 </table>
 
-
-
 ## Installation
 
 ### Requirements
 - python>=3.9
 - cmake>=3.22
 - clang>=18
-    - For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
-        -  Desktop-development with C++
-        -  C++-CMake Tools for Windows
-        -  Git for Windows
-        -  C++-Clang Compiler for Windows
-        -  MS-Build Support for LLVM-Toolset (clang)
-    - For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/)
-
-        `bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"`
+  - For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options (this also automatically installs the required additional tools like CMake):
+    - Desktop-development with C++
+    - C++-CMake Tools for Windows
+    - Git for Windows
+    - C++-Clang Compiler for Windows
+    - MS-Build Support for LLVM-Toolset (clang)
+  - For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/)
+    ```
+    bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
+    ```
 - conda (highly recommend)
 
 ### Build from source
@@ -177,162 +177,26 @@ This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp)
 > [!IMPORTANT]
 > If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.
 
-1. Clone the repo
-```bash
-git clone --recursive https://github.com/microsoft/BitNet.git
-cd BitNet
-```
-2. Install the dependencies
-```bash
-# (Recommended) Create a new conda environment
-conda create -n bitnet-cpp python=3.9
-conda activate bitnet-cpp
-
-pip install -r requirements.txt
-```
-3. Build the project
-```bash
-# Manually download the model and run with local path
-huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
-python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
-
-```
-<pre>
-usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
-                    [--use-pretuned]
-
-Setup the environment for running inference
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
-                        Model used for inference
-  --model-dir MODEL_DIR, -md MODEL_DIR
-                        Directory to save/load the model
-  --log-dir LOG_DIR, -ld LOG_DIR
-                        Directory to save the logging info
-  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
-                        Quantization type
-  --quant-embd          Quantize the embeddings to f16
-  --use-pretuned, -p    Use the pretuned kernel parameters
-</pre>
-## Usage
-### Basic usage
-```bash
-# Run inference with the quantized model
-python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
-```
-<pre>
-usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
-
-Run inference
-
-optional arguments:
-  -h, --help            show this help message and exit
-  -m MODEL, --model MODEL
-                        Path to model file
-  -n N_PREDICT, --n-predict N_PREDICT
-                        Number of tokens to predict when generating text
-  -p PROMPT, --prompt PROMPT
-                        Prompt to generate text from
-  -t THREADS, --threads THREADS
-                        Number of threads to use
-  -c CTX_SIZE, --ctx-size CTX_SIZE
-                        Size of the prompt context
-  -temp TEMPERATURE, --temperature TEMPERATURE
-                        Temperature, a hyperparameter that controls the randomness of the generated text
-  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
-                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)
-</pre>
-
-### Benchmark
-We provide scripts to run the inference benchmark providing a model.
-
-```  
-usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]  
-   
-Setup the environment for running the inference  
-   
-required arguments:  
-  -m MODEL, --model MODEL  
-                        Path to the model file. 
-   
-optional arguments:  
-  -h, --help  
-                        Show this help message and exit. 
-  -n N_TOKEN, --n-token N_TOKEN  
-                        Number of generated tokens. 
-  -p N_PROMPT, --n-prompt N_PROMPT  
-                        Prompt to generate text from. 
-  -t THREADS, --threads THREADS  
-                        Number of threads to use. 
-```  
-   
-Here's a brief explanation of each argument:  
-   
-- `-m`, `--model`: The path to the model file. This is a required argument that must be provided when running the script.  
-- `-n`, `--n-token`: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.  
-- `-p`, `--n-prompt`: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.  
-- `-t`, `--threads`: The number of threads to use for running the inference. It is an optional argument with a default value of 2.  
-- `-h`, `--help`: Show the help message and exit. Use this argument to display usage information.  
-   
-For example:  
-   
-```sh  
-python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4  
-```  
-   
-This command would run the inference benchmark using the model located at `/path/to/model`, generating 200 tokens from a 256 token prompt, utilizing 4 threads.  
-
-For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:
-
-```bash
-python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M
-
-# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
-python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128
-```
-
-### Convert from `.safetensors` Checkpoints
-
-```sh
-# Prepare the .safetensors model file
-huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16
-
-# Convert to gguf model
-python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
-```
-
-### FAQ (Frequently Asked Questions)📌 
-
-#### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?
-
-**A:**
-This is an issue introduced in recent version of llama.cpp. Please refer to this [commit](https://github.com/tinglou/llama.cpp/commit/4e3db1e3d78cc1bcd22bcb3af54bd2a4628dd323) in the [discussion](https://github.com/abetlen/llama-cpp-python/issues/1942) to fix this issue.
-
-#### Q2: How to build with clang in conda environment on windows?
-
-**A:** 
-Before building the project, verify your clang installation and access to Visual Studio tools by running:
-```
-clang -v
-```
-
-This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:
-```
-'clang' is not recognized as an internal or external command, operable program or batch file.
-```
-
-It indicates that your command line window is not properly initialized for Visual Studio tools.
-
-• If you are using Command Prompt, run:
-```
-"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
-```
-
-• If you are using Windows PowerShell, run the following commands:
-```
-Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"
-```
-
-These steps will initialize your environment and allow you to use the correct Visual Studio tools.
+1. **Clone the repo**
+    ```bash
+    git clone --recursive https://github.com/microsoft/BitNet.git
+    cd BitNet
+    ```
+
+2. **Install the dependencies**
+    ```bash
+    # (Recommended) Create a new conda environment
+    conda create -n bitnet-cpp python=3.9
+    conda activate bitnet-cpp
+
+    pip install -r requirements.txt
+    ```
+
+3. **Make sure `llama.cpp` is present (submodule or manual clone)**  
+   Many build issues happen when `3rdparty/llama.cpp` is empty because the repo was cloned without submodules.
+
+   **A) Preferred — initialize the submodule**
+   ```bash
+   # from repo root
+   git submodule sync --recursive
+   git submodule update --init --recursive

From 0cec2338995e57de6d99a85bee9b78642dfcae29 Mon Sep 17 00:00:00 2001
From: MUHAMMAD ANAS <anasjadoon31@gmail.com>
Date: Tue, 9 Sep 2025 13:50:54 +0500
Subject: [PATCH 2/2] Update README with build and usage instructions

---
 README.md | 147 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/README.md b/README.md
index 33e33317..21235d17 100644
--- a/README.md
+++ b/README.md
@@ -200,3 +200,150 @@ This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp)
    # from repo root
    git submodule sync --recursive
    git submodule update --init --recursive
+   
+4. Build the project
+```bash
+# Manually download the model and run with local path
+huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
+python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
+
+```
+<pre>
+usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
+                    [--use-pretuned]
+
+Setup the environment for running inference
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
+                        Model used for inference
+  --model-dir MODEL_DIR, -md MODEL_DIR
+                        Directory to save/load the model
+  --log-dir LOG_DIR, -ld LOG_DIR
+                        Directory to save the logging info
+  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
+                        Quantization type
+  --quant-embd          Quantize the embeddings to f16
+  --use-pretuned, -p    Use the pretuned kernel parameters
+</pre>
+## Usage
+### Basic usage
+```bash
+# Run inference with the quantized model
+python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
+```
+<pre>
+usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
+
+Run inference
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -m MODEL, --model MODEL
+                        Path to model file
+  -n N_PREDICT, --n-predict N_PREDICT
+                        Number of tokens to predict when generating text
+  -p PROMPT, --prompt PROMPT
+                        Prompt to generate text from
+  -t THREADS, --threads THREADS
+                        Number of threads to use
+  -c CTX_SIZE, --ctx-size CTX_SIZE
+                        Size of the prompt context
+  -temp TEMPERATURE, --temperature TEMPERATURE
+                        Temperature, a hyperparameter that controls the randomness of the generated text
+  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
+                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)
+</pre>
+
+### Benchmark
+We provide scripts to run the inference benchmark providing a model.
+
+```  
+usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]  
+   
+Setup the environment for running the inference  
+   
+required arguments:  
+  -m MODEL, --model MODEL  
+                        Path to the model file. 
+   
+optional arguments:  
+  -h, --help  
+                        Show this help message and exit. 
+  -n N_TOKEN, --n-token N_TOKEN  
+                        Number of generated tokens. 
+  -p N_PROMPT, --n-prompt N_PROMPT  
+                        Prompt to generate text from. 
+  -t THREADS, --threads THREADS  
+                        Number of threads to use. 
+```  
+   
+Here's a brief explanation of each argument:  
+   
+- `-m`, `--model`: The path to the model file. This is a required argument that must be provided when running the script.  
+- `-n`, `--n-token`: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.  
+- `-p`, `--n-prompt`: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.  
+- `-t`, `--threads`: The number of threads to use for running the inference. It is an optional argument with a default value of 2.  
+- `-h`, `--help`: Show the help message and exit. Use this argument to display usage information.  
+   
+For example:  
+   
+```sh  
+python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4  
+```  
+   
+This command would run the inference benchmark using the model located at `/path/to/model`, generating 200 tokens from a 256 token prompt, utilizing 4 threads.  
+
+For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:
+
+```bash
+python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M
+
+# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
+python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128
+```
+
+### Convert from `.safetensors` Checkpoints
+
+```sh
+# Prepare the .safetensors model file
+huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16
+
+# Convert to gguf model
+python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
+```
+
+### FAQ (Frequently Asked Questions)📌 
+
+#### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?
+
+**A:**
+This is an issue introduced in recent version of llama.cpp. Please refer to this [commit](https://github.com/tinglou/llama.cpp/commit/4e3db1e3d78cc1bcd22bcb3af54bd2a4628dd323) in the [discussion](https://github.com/abetlen/llama-cpp-python/issues/1942) to fix this issue.
+
+#### Q2: How to build with clang in conda environment on windows?
+
+**A:** 
+Before building the project, verify your clang installation and access to Visual Studio tools by running:
+```
+clang -v
+```
+
+This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:
+```
+'clang' is not recognized as an internal or external command, operable program or batch file.
+```
+
+It indicates that your command line window is not properly initialized for Visual Studio tools.
+
+• If you are using Command Prompt, run:
+```
+"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
+```
+
+• If you are using Windows PowerShell, run the following commands:
+```
+Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"
+```
+
+These steps will initialize your environment and allow you to use the correct Visual Studio tools.