Bitnet #95
Replies: 3 comments 11 replies
-
I was curious to see Microsoft's Bitnet performance on
The script warns that this is a debug build, but going to the Here is what I get with this repo:
22X (!!!) difference in prompt processing speed. 2.8X difference in token generation (TG) speed. TG is memory bound, so let's check what we get with just 1 thread. First theirs (be patient if you try it):
Then ours
Aha. 12.8X. Perhaps they did not turn on
Oops. Perhaps
Arghh. Comment out the
Running |
Beta Was this translation helpful? Give feedback.
-
OK, here is apples-to-apples performance comparison on my M2-Max laptop between Microsoft's
The difference in performance decreases with model size, but that's just a matter of memory bandwidth saturation for |
Beta Was this translation helpful? Give feedback.
-
They updated the repo with the first Official model (all previous models were just supported models, and had far less training) https://huggingface.co/microsoft/bitnet-b1.58-2B-4T it looks competitive at it's size as it was trained with 4T tokens. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
A Microsoft team has released CPU inference code for 1.58-bit Bitnets. The repo, based 100% on
llama.cpp
, and only adding Bitnet CPU kernels (ARM_NEON, AVX2
) has 2.1k stars as of this writing. As per @Dampfinchen "this is just insanity".Well, here we have had Bitnet inference for while. For CPU and GPU. Faster than Microsoft's by quite some margin.
There is a screen recording in their repo demoing the 3.3B Bitnet model writing a 900 token essay and achieving 71 t/s on M2 Ultra. Here is a screen recording from my M2-Max laptop (~1/2 the computing power and memory bandwidth of M2 Ultra) getting 74 t/s on the same prompt.
m2_max_cpu.mp4
And here it is running on the M2-Max 30-core GPU
m2_max_gpu.mp4
Finally, here running on RTX-4080
cuda.mp4
The prompt is very short (9 tokens), but it is still worth noting that Microsoft's implementation processes the prompt at a rate of 85 t/s, while here we get 157 t/s with half the computing power.
Beta Was this translation helpful? Give feedback.
All reactions