Cpu fused kernel #1804

jiqing-feng · 2025-11-14T01:46:57Z

The fused kernel optimized 4bit model inference about 4x speed-up on TPOT compared dequant+matmul. For next optimization of TTFT, we need to import libxsmm.

Signed-off-by: jiqing-feng <[email protected]>

github-actions · 2025-11-14T15:08:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jiqing-feng · 2025-11-17T01:50:13Z

Hi @matthewdouglas . All CI passed, I have rebased the PR. Please let me know what need to be changed before merge. Thanks!

Signed-off-by: jiqing-feng <[email protected]>

csrc/cpu_ops.cpp

bitsandbytes/functional.py

bitsandbytes/nn/modules.py

matthewdouglas · 2025-11-19T15:27:54Z

bitsandbytes/nn/modules.py

+        if (
+            not self.enable_optimized_cpu
+            and x.device.type == "cpu"
+            and has_avx512bf16()
+            and not self.training
+            and x.requires_grad == False
+        ):
+            self.weight.data, quant_state = convert_weight_packed_for_cpu(self.weight.data, quant_state)
+            self.enable_optimized_cpu = True
+            quant_state.enable_optimized_cpu = True
+


There's a couple things I'm wondering about:

When we serialize from CPU after running through forward(), we probably still want to be compatible with other devices. I am thinking for when serializing we want to undo this transformation if it's present.

Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?

@SunMarc I would appreciate any feedback you might have on this part!

For me, I prefer that we stick with only one packing format for serialization and the all other hardware / kernels convert this packing format at initialization or during the forward as we do here.

So we need a way to disable serialization or send a warning when someone tries to do that. This is probably something that we can do in transformers as I think most of the models are serialized from there.

Also instead of enable_optimized_cpu maybe we can rename it packing_format ?

Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?

Either we re-convert the weights for cuda (but this opens the door to many conversion function between all packing format) or we just raise an error asking the users to only run the model on one device.

Signed-off-by: jiqing-feng <[email protected]>

CMakeLists.txt

SunMarc

Left a comment !

SunMarc · 2025-11-19T16:41:46Z

bitsandbytes/nn/modules.py

+        if (
+            not self.enable_optimized_cpu
+            and x.device.type == "cpu"
+            and has_avx512bf16()
+            and not self.training
+            and x.requires_grad == False
+        ):
+            self.weight.data, quant_state = convert_weight_packed_for_cpu(self.weight.data, quant_state)
+            self.enable_optimized_cpu = True
+            quant_state.enable_optimized_cpu = True
+


For me, I prefer that we stick with only one packing format for serialization and the all other hardware / kernels convert this packing format at initialization or during the forward as we do here.

So we need a way to disable serialization or send a warning when someone tries to do that. This is probably something that we can do in transformers as I think most of the models are serialized from there.

Also instead of enable_optimized_cpu maybe we can rename it packing_format ?

Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?

Either we re-convert the weights for cuda (but this opens the door to many conversion function between all packing format) or we just raise an error asking the users to only run the model on one device.

jiqing-feng · 2025-11-20T04:32:10Z

Hi @matthewdouglas . The BNB will only load 1 lib (one from cpu/cuda/xpu). It means we can only build 1 .so file for the bnb. But we cannot build CPU and XPU together, because CPU relies on openMP(libiomp5.so) but XPU relies on GNU OpenMP (libgomp.so), build them together will raise error like: libbitsandbytes_xpu.so: undefined symbol: __kmpc_for_static_init_8. I suppose it's same for cuda. But without OMP, the CPU kernel might be even worse than python op, and there might be other incompatible flags across different backends.

In the current stage, we can only consider to build one backend, so the format cpu will not be triggered in other backends. Even though, I added the reverse logic in case we want to support multi-backends in the future.

cc @SunMarc

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng added 30 commits October 28, 2025 15:02

add template to support more dtypes

6be1412

Signed-off-by: jiqing-feng <[email protected]>

update cmake list

252ac0f

Signed-off-by: jiqing-feng <[email protected]>

fix typo

f98c9e5

Signed-off-by: jiqing-feng <[email protected]>

fix compile cpu

902bf35

Signed-off-by: jiqing-feng <[email protected]>

make different dtype works

fef8459

Signed-off-by: jiqing-feng <[email protected]>

use bf16 on CPU

55cbaa0

Signed-off-by: jiqing-feng <[email protected]>

fix state2 dtype

bbef95b

Signed-off-by: jiqing-feng <[email protected]>

remove torch

e842513

Signed-off-by: jiqing-feng <[email protected]>

rm torch

d4473fa

Signed-off-by: jiqing-feng <[email protected]>

enable float to bf16

dea8dd6

Signed-off-by: jiqing-feng <[email protected]>

rm dequantizeBlockwise4bitCpu

e9bb4fe

Signed-off-by: jiqing-feng <[email protected]>

fix check

cdc8d5e

Signed-off-by: jiqing-feng <[email protected]>

enable dequant 4bit kernel

baacfac

Signed-off-by: jiqing-feng <[email protected]>

fix typo

eec3521

Signed-off-by: jiqing-feng <[email protected]>

fix typo

d7cc1c5

Signed-off-by: jiqing-feng <[email protected]>

fix dequantize

124b754

Signed-off-by: jiqing-feng <[email protected]>

fix

0f918c7

Signed-off-by: jiqing-feng <[email protected]>

fix

e1a8b20

Signed-off-by: jiqing-feng <[email protected]>

test

eab45c8

Signed-off-by: jiqing-feng <[email protected]>

fix

d9f5dd8

Signed-off-by: jiqing-feng <[email protected]>

fix

070f8a0

Signed-off-by: jiqing-feng <[email protected]>

fix

a84addf

Signed-off-by: jiqing-feng <[email protected]>

fix

c4bb660

Signed-off-by: jiqing-feng <[email protected]>

fix

4ba13fd

Signed-off-by: jiqing-feng <[email protected]>

change input param

c0d05ec

Signed-off-by: jiqing-feng <[email protected]>

fix typo

62a16a6

Signed-off-by: jiqing-feng <[email protected]>

fix input param

d9ad828

Signed-off-by: jiqing-feng <[email protected]>

spliut 8bit and 4bit

09ed6cb

Signed-off-by: jiqing-feng <[email protected]>

fix typo

a3f7b61

Signed-off-by: jiqing-feng <[email protected]>

fix typo

4708470

Signed-off-by: jiqing-feng <[email protected]>

fix def

c5e1894

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng marked this pull request as ready for review November 14, 2025 05:30

jiqing-feng added 4 commits November 14, 2025 09:29

rebase

f2029c6

Signed-off-by: jiqing-feng <[email protected]>

fix position

df1d669

Signed-off-by: jiqing-feng <[email protected]>

fix format

bb3ac8d

Signed-off-by: jiqing-feng <[email protected]>

rm duplicated func

26b5685

Signed-off-by: jiqing-feng <[email protected]>

matthewdouglas added the x64 CPU label Nov 14, 2025

Merge branch 'main' into cpu_fused_kernel

445725b

rm useless code comments

580010c

Signed-off-by: jiqing-feng <[email protected]>