feat: add KL-divergence evaluation tool for quantized models#2439
feat: add KL-divergence evaluation tool for quantized models#2439NJX-njx wants to merge 1 commit intovllm-project:mainfrom
Conversation
Ref vllm-project#2031 Implement a tool to evaluate how well quantized models preserve the original model's probability distribution using KL divergence. Features: - Computes KLD(base || target) and KLD(target || base) since KL divergence is asymmetric - Reports per-sample and aggregate statistics - Supports any HuggingFace causal LM and dataset - Python API via evaluate_kl_divergence() function - CLI via python -m llmcompressor.evaluation.kl_divergence - Memory-efficient: logits moved to CPU immediately to free GPU VRAM Files added: - src/llmcompressor/evaluation/__init__.py - src/llmcompressor/evaluation/__main__.py - src/llmcompressor/evaluation/kl_divergence.py - tests/llmcompressor/evaluation/test_kl_divergence.py
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust KL-divergence evaluation tool designed to measure the statistical difference between the probability distributions of a base model and its quantized counterpart. By offering both a Python API and a command-line interface, it provides flexible options for developers and researchers to quantify the impact of quantization on model output distributions, thereby aiding in the development and validation of compressed models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a new KL-divergence evaluation tool for comparing quantized models against their base counterparts, complete with both a Python API and a CLI. The implementation includes a KLDivergenceResult dataclass for storing statistics, core functions for calculating token-level KL divergence, and a comprehensive evaluate_kl_divergence function. Unit tests cover the core computation and integration with mocked models. Overall, the feature is well-structured and tested, but there are a few areas for improvement regarding tokenizer handling and unused parameters.
| else: | ||
| base_model_obj = base_model | ||
| base_model_id = getattr(base_model, "name_or_path", "base_model") | ||
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
There was a problem hiding this comment.
When base_model is an already-loaded torch.nn.Module object, base_model_id falls back to the string "base_model" if name_or_path is not present. Subsequently, AutoTokenizer.from_pretrained(base_model_id) will attempt to load a tokenizer using the literal string "base_model", which is unlikely to be a valid HuggingFace model ID or local path. This will lead to a runtime error when trying to load the tokenizer.
To fix this, if a model object is passed, the corresponding tokenizer should either be passed explicitly as an argument to evaluate_kl_divergence, or the base_model object must guarantee a valid name_or_path attribute that can be used to load its tokenizer. A safer approach would be to add a tokenizer parameter to evaluate_kl_divergence that can be optionally provided when base_model is an object.
base_model_id = getattr(base_model, "name_or_path", None)
if tokenizer is None and base_model_id is None:
raise ValueError(
"If base_model is an object and no tokenizer is provided, "
"base_model must have a 'name_or_path' attribute to load the tokenizer."
)
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(base_model_id)| text_column: str = "text", | ||
| num_samples: int = 128, | ||
| max_seq_length: int = 512, | ||
| batch_size: int = 1, |
There was a problem hiding this comment.
The batch_size parameter is currently not utilized in the evaluate_kl_divergence function, as indicated by the comment "not used yet". If batching is not planned for immediate implementation, it might be clearer to remove this parameter for now to avoid confusion and reintroduce it when the functionality is ready. If it's a placeholder for future work, a more explicit comment on its purpose and future plans would be beneficial.
There was a problem hiding this comment.
Pull request overview
Adds a KL-divergence evaluation utility to compare a quantized/target causal LM against a base model, reporting forward and reverse KLD, with both a Python API and a CLI entrypoint.
Changes:
- Introduces
evaluate_kl_divergence()+KLDivergenceResultfor forward/reverse/symmetric KLD reporting. - Adds a CLI runnable module and exposes the API via
llmcompressor.evaluation. - Adds unit tests for the core KL computation plus network-dependent integration-style tests using tiny HF models/datasets.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
src/llmcompressor/evaluation/kl_divergence.py |
Implements the KLD evaluation API, core KL computation, dataset/model loading, and a CLI main(). |
src/llmcompressor/evaluation/__main__.py |
Adds a package-level -m entrypoint wrapper around the CLI. |
src/llmcompressor/evaluation/__init__.py |
Re-exports the evaluation API from the new evaluation package. |
tests/llmcompressor/evaluation/test_kl_divergence.py |
Adds unit + integration-style tests for the new evaluation functionality. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import argparse | ||
| import json | ||
| import logging | ||
| import sys |
There was a problem hiding this comment.
sys is imported but never used, which will fail linting (ruff/pyflakes) and adds noise. Remove the unused import.
| import sys |
| base_model_id = base_model | ||
| else: | ||
| base_model_obj = base_model | ||
| base_model_id = getattr(base_model, "name_or_path", "base_model") |
There was a problem hiding this comment.
When base_model is passed as an already-loaded torch.nn.Module, the code derives base_model_id via getattr(base_model, "name_or_path", "base_model") and then calls AutoTokenizer.from_pretrained(base_model_id). For non-HF modules (or HF models without a meaningful name_or_path), this will raise at runtime (e.g., trying to load tokenizer for literal "base_model"). Consider either requiring an HF PreTrainedModel here, or add an explicit tokenizer (or tokenizer_id) parameter and use it when models are preloaded.
| base_model_id = getattr(base_model, "name_or_path", "base_model") | |
| base_model_id = getattr(base_model, "name_or_path", None) | |
| if not base_model_id: | |
| raise TypeError( | |
| "When passing a preloaded 'base_model', it must be a " | |
| "Hugging Face transformers.PreTrainedModel instance with a " | |
| "valid 'name_or_path' attribute so that the corresponding " | |
| "tokenizer can be loaded." | |
| ) |
| # Compute KL divergence per token (skip the last position which has no | ||
| # next-token prediction to compare against, but for logit comparison | ||
| # all positions are valid) | ||
| forward_kld = _kl_divergence_per_token(base_log_probs, target_log_probs) | ||
| reverse_kld = _kl_divergence_per_token(target_log_probs, base_log_probs) |
There was a problem hiding this comment.
The comment says to "skip the last position" but the implementation computes KL for all positions. Since KL here is just comparing distributions between models (no ground-truth next token needed), either update the comment to match the behavior or actually slice off the last timestep if that’s the intended metric.
| @@ -0,0 +1,5 @@ | |||
| """Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``.""" | |||
There was a problem hiding this comment.
This module docstring says it can be run as python -m llmcompressor.evaluation.kl_divergence, but src/llmcompressor/evaluation/__main__.py is actually executed by python -m llmcompressor.evaluation. Update the docstring to reflect the correct invocation (or drop this file if it’s not intended).
| """Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``.""" | |
| """Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation``.""" |
|
|
||
| from llmcompressor.evaluation.kl_divergence import main | ||
|
|
||
| main() |
There was a problem hiding this comment.
Calling main() at import time makes llmcompressor.evaluation.__main__ unsafe to import (it will execute the CLI immediately). Even though __main__ is typically only executed via python -m, it’s safer to wrap the call in an if __name__ == "__main__": guard.
| main() | |
| if __name__ == "__main__": | |
| main() |
| class TestKLDivergencePerToken: | ||
| """Tests for the core KL divergence computation.""" | ||
|
|
||
| def test_identical_distributions(self): | ||
| """KL divergence of identical distributions should be zero.""" | ||
| logits = torch.randn(10, 100) |
There was a problem hiding this comment.
Most tests in this repo are explicitly marked (e.g. @pytest.mark.unit). Consider marking the pure tensor-level tests in this file as unit to match the suite’s marker conventions.
| class TestEvaluateKLDivergenceWithMocks: | ||
| """Integration-style tests using tiny models.""" | ||
|
|
||
| @pytest.fixture | ||
| def tiny_models(self): | ||
| """Create two tiny randomly-initialized models for testing.""" | ||
| try: |
There was a problem hiding this comment.
TestEvaluateKLDivergenceWithMocks performs network-dependent HF config/dataset loads (and conditionally skips). To keep CI selection consistent, it should be marked @pytest.mark.integration like other network/third-party integration tests in this repo.
| # --- Evaluate --- | ||
| result = KLDivergenceResult(num_samples=len(ds)) | ||
| total_forward_kld = 0.0 | ||
| total_reverse_kld = 0.0 | ||
| total_tokens = 0 | ||
|
|
||
| for sample in tqdm(ds, desc="Evaluating KL divergence"): | ||
| text = sample[text_column] | ||
| inputs = tokenizer( | ||
| text, | ||
| return_tensors="pt", | ||
| max_length=max_seq_length, | ||
| truncation=True, | ||
| padding=False, | ||
| ) | ||
|
|
||
| input_ids = inputs["input_ids"] | ||
| attention_mask = inputs.get("attention_mask") | ||
|
|
||
| if input_ids.shape[1] < 2: | ||
| continue | ||
|
|
There was a problem hiding this comment.
KLDivergenceResult.num_samples is initialized to len(ds), but the loop can continue (e.g., sequences that tokenize to <2 tokens). That makes num_samples inconsistent with the actual number of evaluated samples (and with the per-sample arrays lengths). Track an evaluated_samples counter and set result.num_samples from that (or add a separate field if you want both attempted and evaluated counts).
| Run a forward pass and return logits (on CPU to save GPU memory). | ||
|
|
||
| :param model: the causal LM | ||
| :param input_ids: shape (1, seq_len) | ||
| :param attention_mask: shape (1, seq_len) or None | ||
| :return: logits tensor of shape (seq_len, vocab_size) | ||
| """ | ||
| outputs = model( | ||
| input_ids=input_ids.to(model.device), | ||
| attention_mask=( | ||
| attention_mask.to(model.device) if attention_mask is not None else None | ||
| ), | ||
| ) | ||
| # Move to CPU immediately to free GPU memory | ||
| return outputs.logits[0].float().cpu() | ||
|
|
There was a problem hiding this comment.
_collect_logits always moves full (seq_len, vocab) logits to CPU (.cpu()). For GPU runs this can dominate runtime via device-to-host transfers, especially for large vocabularies. Consider computing log_softmax + KL on the model device and only moving the reduced per-token/per-sample scalars to CPU (or gate CPU offload behind an option).
| Unit tests for the KL-divergence evaluation tool. | ||
|
|
||
| Tests core computation logic using small synthetic models and tensors. | ||
| Does not require GPU or large model downloads. |
There was a problem hiding this comment.
The file-level docstring says these tests "do not require GPU or large model downloads", but the integration-style tests fetch a HF config and dataset over the network (and skip if unavailable). Either revise the docstring to reflect the network dependency, or rework the tests to be fully offline.
| Unit tests for the KL-divergence evaluation tool. | |
| Tests core computation logic using small synthetic models and tensors. | |
| Does not require GPU or large model downloads. | |
| Unit and integration tests for the KL-divergence evaluation tool. | |
| Tests core computation logic using small synthetic models and tensors, and | |
| includes integration-style tests with tiny models that may download small | |
| configs/datasets over the network (tests are skipped if unavailable). These | |
| tests do not require a GPU or large model downloads. |
Ref #2031 - Implements KL-divergence eval tool with both forward and reverse KLD, Python API and CLI