Skip to content

AutoFP8 to llmcompressor migration for FP8 quantization#2701

Merged
siddvenk merged 1 commit intodeepjavalibrary:masterfrom
a-ys:llm-compressor-upgrade
Feb 2, 2025
Merged

AutoFP8 to llmcompressor migration for FP8 quantization#2701
siddvenk merged 1 commit intodeepjavalibrary:masterfrom
a-ys:llm-compressor-upgrade

Conversation

@a-ys
Copy link
Contributor

@a-ys a-ys commented Feb 1, 2025

Description

Migrates existing FP8 quantization functionality to llmcompressor from AutoFP8. The existing quantization recipe is fp8 weights and activation quantization, with static activation scales. Uses cnn-dailymail for calibration, defaulting to 512 samples at 2048 seq len.

Installed the previous version of llm-compressor (0.3.1) due to an issue with compatibility between llm-compressor (0.4.0) and transformers 2.5.2, which is the current version in djl-serving.

TODOs

  • Update CI tests in lmi-distro to verify these changes.
  • Remove AutoFP8 dependency from container.
  • Future feature support (prioritization tbd)
    • MoE model quantization. (May run with exiting code, but will not ignore the correct layers)
    • KV cache quantization with static scales.
    • Calibration dataset selection
    • AWQ migration (not yet supported in llmcompressor)

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • Please add the link of Integration Tests Executor run with related tests.
  • Have you manually built the docker image and verify the change?
  • Have you run related tests? Check how to set up the test environment here; One example would be pytest tests.py -k "TestCorrectnessLmiDist" -m "lmi_dist"
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Feature/Issue validation/testing

  • Tested quantization of tinyllama with llmcompressor and serving with v14 v2 preview container through Neo workflow.

@a-ys a-ys requested review from a team and zachgk as code owners February 1, 2025 05:40
"will not include this field.")

if output_properties.get("option.quantize") == "fp8":
output_properties["option.quantize"] = "compressed-tensors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work for both lmi-dist (vllm 0.6.3.post1) and vanilla vllm (0.7.0)?

@siddvenk siddvenk merged commit d4f5ee7 into deepjavalibrary:master Feb 2, 2025
9 checks passed
@siddvenk
Copy link
Contributor

siddvenk commented Feb 2, 2025

merging this for now - lgtm. let's validate this again with the vllm update for both lmi-dist/vllm

@a-ys a-ys mentioned this pull request Feb 5, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants