You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[NVFP4] Use observers to generate global weight scales (#1504)
SUMMARY:
- Requires: neuralmagic/compressed-tensors#339
- Uses observers to generate global weight scales; these were previously
being generated during the init function in compressed-tensors however,
using observers is more consistent with our workflows and parameter
lifecycle
- Also moves in the fused layer update step to llmcompressor - this can
be removed once we have an update from vLLM. However, right now this
requires us to split up the `update_weight_global_scale` and
`weight_weight_zp_scale` steps - these can be combined once the vLLM
change is made
- Update examples to include sample generation - this is now very quick
thanks to this PR:
neuralmagic/compressed-tensors#336
Note: The mse observer is very much tied to generating a scale and
zero-point so it can't be used for global scale generation at the
moment. We will have to decouple this functionality in order to support
general scale optimization
TEST PLAN:
- Tested e2e with nvfp4 and nvfp4a16
- Validated existing workflows work e2e (w4a16, spaarse2of4 + fp8, w8a8
int8)
0 commit comments