Skip to content

Conversation

@jiqing-feng
Copy link
Contributor

The fused kernel optimized 4bit model inference about 4x speed-up on TPOT compared dequant+matmul. For next optimization of TTFT, we need to import libxsmm.

Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
@jiqing-feng jiqing-feng marked this pull request as ready for review November 14, 2025 05:30
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas . All CI passed, I have rebased the PR. Please let me know what need to be changed before merge. Thanks!

Signed-off-by: jiqing-feng <[email protected]>
Comment on lines 519 to 529
if (
not self.enable_optimized_cpu
and x.device.type == "cpu"
and has_avx512bf16()
and not self.training
and x.requires_grad == False
):
self.weight.data, quant_state = convert_weight_packed_for_cpu(self.weight.data, quant_state)
self.enable_optimized_cpu = True
quant_state.enable_optimized_cpu = True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple things I'm wondering about:

When we serialize from CPU after running through forward(), we probably still want to be compatible with other devices. I am thinking for when serializing we want to undo this transformation if it's present.

Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?

@SunMarc I would appreciate any feedback you might have on this part!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I prefer that we stick with only one packing format for serialization and the all other hardware / kernels convert this packing format at initialization or during the forward as we do here.

So we need a way to disable serialization or send a warning when someone tries to do that. This is probably something that we can do in transformers as I think most of the models are serialized from there.

Also instead of enable_optimized_cpu maybe we can rename it packing_format ?

Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?

Either we re-convert the weights for cuda (but this opens the door to many conversion function between all packing format) or we just raise an error asking the users to only run the model on one device.

Copy link
Contributor

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment !

Comment on lines 519 to 529
if (
not self.enable_optimized_cpu
and x.device.type == "cpu"
and has_avx512bf16()
and not self.training
and x.requires_grad == False
):
self.weight.data, quant_state = convert_weight_packed_for_cpu(self.weight.data, quant_state)
self.enable_optimized_cpu = True
quant_state.enable_optimized_cpu = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I prefer that we stick with only one packing format for serialization and the all other hardware / kernels convert this packing format at initialization or during the forward as we do here.

So we need a way to disable serialization or send a warning when someone tries to do that. This is probably something that we can do in transformers as I think most of the models are serialized from there.

Also instead of enable_optimized_cpu maybe we can rename it packing_format ?

Possibly an edge concern, but if we do a forward pass on CPU and then move to an accelerator, what would happen? I assume the weights are then in the wrong order?

Either we re-convert the weights for cuda (but this opens the door to many conversion function between all packing format) or we just raise an error asking the users to only run the model on one device.

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Nov 20, 2025

Hi @matthewdouglas . The BNB will only load 1 lib (one from cpu/cuda/xpu). It means we can only build 1 .so file for the bnb. But we cannot build CPU and XPU together, because CPU relies on openMP(libiomp5.so) but XPU relies on GNU OpenMP (libgomp.so), build them together will raise error like: libbitsandbytes_xpu.so: undefined symbol: __kmpc_for_static_init_8. I suppose it's same for cuda. But without OMP, the CPU kernel might be even worse than python op, and there might be other incompatible flags across different backends.

In the current stage, we can only consider to build one backend, so the format cpu will not be triggered in other backends. Even though, I added the reverse logic in case we want to support multi-backends in the future.

cc @SunMarc

Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants