Skip to content

Commit d9aa3c1

Browse files
authored
[SW-235047] use w8a8 path for per_channel for performance regression fixing (#1629)
https://jira.habana-labs.com/browse/SW-235047 ## Essential Elements of an Effective PR Description Checklist - [X] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ## Purpose Previous, any HPU fp8 linear will go `hpu_ops.apply_fp8_linear_hpu`, which is not necessary for per_channel scaling since W8A8 also supports HPU; which introduced a performance regression for WOQ model. This PR is to skip per_channel scaling fp8 support in `hpu_ops.apply_fp8_linear_hpu` ## Test Plan ## Test Result <!--- pyml disable-next-line no-emphasis-as-heading --> Signed-off-by: Chendi.Xue <[email protected]>
1 parent 93b8bad commit d9aa3c1

File tree

1 file changed

+1
-1
lines changed
  • vllm/model_executor/layers/quantization

1 file changed

+1
-1
lines changed

vllm/model_executor/layers/quantization/fp8.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -473,7 +473,7 @@ def apply(self,
473473
use_aiter_and_is_supported=self.use_aiter_and_is_supported,
474474
)
475475

476-
if current_platform.is_hpu():
476+
if self.block_quant and current_platform.is_hpu():
477477
if layer.weight_scale.dim() > 1:
478478
weight_scale = layer.weight_scale.transpose(0, 1)
479479
else:

0 commit comments

Comments
 (0)