I tested building and running the Dockerfile using a MacBook Pro M4 Max running Rancher Desktop. I wasn't able to convert to TL1 for a Tensor-optimized look-up table so I used I2_S (Integer 2-bit Symmetric).
You might need to increase the resources available to Rancher Desktop (or Docker Desktop) to see a decent amount of performance. I was seeing ~20-30 token/s. I used QEMU emulation. I haven't yet tested using the Apple Virtualization framework.
I put together the steps to get this working on ARM from Bjan Bowen's Blog.
git clone https://github.com/ajsween/bitnet-b1-58-arm-docker.git
cd bitnet-b1-58-arm-docker
docker build -t bitnet-b1.58-2b-4t-arm:latest .
docker run -it --rm bitnet-b1.58-2b-4t-arm:latest
docker run --rm bitnet-b1.58-2b-4t-arm:latest \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "How do I change a tire?\n" \
-t 4 \
-c 4096 \
--temp 0.4 \
-n 1024 2&>/dev/null
I find the statistics that are visible through including STDERR in STDOUT to be useful but if you want to only see the Prompts and Responses remove the pseudo-TTY option (-t).
docker run -i --rm bitnet-b1.58-2b-4t-arm:latest
usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
Run inference
optional arguments:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Path to model file
-n N_PREDICT, --n-predict N_PREDICT
Number of tokens to predict when generating text
-p PROMPT, --prompt PROMPT
Prompt to generate text from
-t THREADS, --threads THREADS
Number of threads to use
-c CTX_SIZE, --ctx-size CTX_SIZE
Size of the prompt context
-temp TEMPERATURE, --temperature TEMPERATURE
Temperature, a hyperparameter that controls the randomness of the generated text
-cnv, --conversation Whether to enable chat mode or not (for instruct models.)
(When this option is turned on, the prompt specified by -p will be used as the system prompt.)
apt update && apt install -y \
python3-pip python3-dev cmake build-essential \
git software-properties-common wget
wget -O - https://apt.llvm.org/llvm.sh | bash -s 18
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
pip install -r requirements.txt
python utils/codegen_tl1.py \
--model bitnet_b1_58-3B \
--BM 160,320,320 \
--BK 64,128,64 \
--bm 32,64,32
export CC=clang-18 CXX=clang++-18
rm -rf build && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
cd ..
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf \
--local-dir models/BitNet-b1.58-2B-4T
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "Hello from BitNet on Pi4!" -cnv
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "Hello from BitNet running on ARM in a container on a Apple M4 Max!" \
-cnv -t 4 -c 2048