Skip to content

feat (quant/mx): Added midmax scale rounding option to MX types#1409

Merged
nickfraser merged 12 commits intoXilinx:devfrom
nickfraser:feat/midmax
Dec 2, 2025
Merged

feat (quant/mx): Added midmax scale rounding option to MX types#1409
nickfraser merged 12 commits intoXilinx:devfrom
nickfraser:feat/midmax

Conversation

@nickfraser
Copy link
Collaborator

@nickfraser nickfraser commented Nov 6, 2025

Add "midmax" scaling for MX datatypes. Midmax is a rounding mode for the shared scale in MX datatypes. Also, plugs MidMax scaling into the LLM example. The standard mode (as referenced in the OCP MX datatype spec) is "floor" and does the following:

$$ po2\_shared\_scale = \lfloor log_2 (\| x \|_\infty) \rfloor $$

Midmax replaces this floor operation with a special rounding mode that reduces the rounding error in the maximum value in $x$, the the cost of potentially increasing the rounding error in the smallest values in $x$.

Rerunning the experiments from the Post-Training Model Expansion paper, we get the following results:

model spinquant expansion_step scale_round_func float_ppl quant_ppl ARC-C ARC-E HS WG PIQA all_acc
meta-llama/Llama-3.2-1B False 0 floor 8.938 11.694 0.289 0.597 0.418 0.559 0.693 0.511
meta-llama/Llama-3.2-1B False 0 midmax 8.938 11.574 0.289 0.610 0.425 0.556 0.712 0.519
meta-llama/Llama-3.2-1B False 7 floor 8.938 11.452 0.283 0.611 0.426 0.575 0.701 0.519
meta-llama/Llama-3.2-1B False 7 midmax 8.938 11.241 0.272 0.619 0.430 0.574 0.701 0.519
meta-llama/Llama-3.2-1B True 0 floor 8.938 11.518 0.305 0.628 0.422 0.569 0.709 0.527
meta-llama/Llama-3.2-1B True 0 midmax 8.938 11.552 0.293 0.595 0.424 0.562 0.707 0.516
meta-llama/Llama-3.2-1B True 7 floor 8.938 11.359 0.298 0.590 0.433 0.556 0.697 0.515
meta-llama/Llama-3.2-1B True 7 midmax 8.938 11.294 0.303 0.606 0.430 0.559 0.699 0.519

Note that beyond the OCP MX v1 spec datatypes, MidMax is not thoroughly tested and should tested further when looking beyond these types.

Also, while adding this feature, I took the opportunity to remove some duplicated code between MXWeightMixin and MXActMixin into a parent class (MXMixin).

@nickfraser nickfraser marked this pull request as draft November 6, 2025 12:56
@nickfraser nickfraser added the do not merge This should not be merged just yet label Nov 6, 2025
@nickfraser nickfraser self-assigned this Nov 6, 2025
@pablomlago pablomlago self-requested a review November 6, 2025 15:56
@nickfraser nickfraser removed the do not merge This should not be merged just yet label Nov 10, 2025
@nickfraser nickfraser marked this pull request as ready for review November 10, 2025 14:21
@nickfraser nickfraser changed the title feat (ex/llm): Added midmax rounding to LLM example feat (quant/mx): Added midmax scale rounding option to MX types Nov 19, 2025
@nickfraser nickfraser requested a review from Giuseppe5 November 21, 2025 15:03
Copy link
Collaborator Author

@nickfraser nickfraser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 comment, otherwise ready for review.

@nickfraser nickfraser requested review from Giuseppe5 and removed request for Giuseppe5 and pablomlago December 2, 2025 12:31
Copy link
Collaborator

@Giuseppe5 Giuseppe5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change, then it can be merged.

@nickfraser nickfraser merged commit 587494a into Xilinx:dev Dec 2, 2025
29 checks passed
@nickfraser nickfraser deleted the feat/midmax branch December 2, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants