feat: add KL-divergence evaluation tool for quantized models by NJX-njx · Pull Request #2439 · vllm-project/llm-compressor

NJX-njx · 2026-03-04T10:32:24Z

Ref #2031 - Implements KL-divergence eval tool with both forward and reverse KLD, Python API and CLI

Ref vllm-project#2031 Implement a tool to evaluate how well quantized models preserve the original model's probability distribution using KL divergence. Features: - Computes KLD(base || target) and KLD(target || base) since KL divergence is asymmetric - Reports per-sample and aggregate statistics - Supports any HuggingFace causal LM and dataset - Python API via evaluate_kl_divergence() function - CLI via python -m llmcompressor.evaluation.kl_divergence - Memory-efficient: logits moved to CPU immediately to free GPU VRAM Files added: - src/llmcompressor/evaluation/__init__.py - src/llmcompressor/evaluation/__main__.py - src/llmcompressor/evaluation/kl_divergence.py - tests/llmcompressor/evaluation/test_kl_divergence.py

chatgpt-codex-connector · 2026-03-04T10:32:30Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

github-actions · 2026-03-04T10:32:33Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-04T10:34:28Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust KL-divergence evaluation tool designed to measure the statistical difference between the probability distributions of a base model and its quantized counterpart. By offering both a Python API and a command-line interface, it provides flexible options for developers and researchers to quantify the impact of quantization on model output distributions, thereby aiding in the development and validation of compressed models.

Highlights

New KL-Divergence Evaluation Tool: A new tool has been introduced to evaluate the KL-divergence between a base (unquantized) model and a target (quantized) model. This helps in assessing the quality and fidelity of quantized models.
Forward and Reverse KLD Calculation: The tool computes both forward KLD (base || target) and reverse KLD (target || base) to provide a comprehensive understanding of distribution differences, acknowledging the asymmetric nature of KL-divergence.
Python API and Command-Line Interface (CLI): The KL-divergence evaluation functionality is accessible via a flexible Python API for programmatic use and a convenient command-line interface for direct execution.
Comprehensive Testing: Unit and integration tests have been added to ensure the correctness of the KL-divergence calculations and the overall functionality of the evaluation tool, using synthetic and tiny mock models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/evaluation/init.py
- Initialized the evaluation package.
- Exported KLDivergenceResult and evaluate_kl_divergence for public API access.
src/llmcompressor/evaluation/main.py
- Enabled direct execution of the KL-divergence evaluation tool via python -m llmcompressor.evaluation.kl_divergence.
src/llmcompressor/evaluation/kl_divergence.py
- Implemented the KLDivergenceResult dataclass to store per-sample and aggregate KL-divergence statistics.
- Developed _kl_divergence_per_token function for calculating token-level KL divergence from log-probabilities.
- Created _collect_logits function to efficiently retrieve model logits and move them to CPU.
- Provided the main evaluate_kl_divergence function, handling model loading, dataset processing, and KLD computation.
- Integrated an argparse-based command-line interface (CLI) for easy execution and result output.
tests/llmcompressor/evaluation/test_kl_divergence.py
- Added unit tests for _kl_divergence_per_token to verify KL-divergence calculations with identical, different, and known distributions.
- Included tests for the KLDivergenceResult dataclass, checking summary generation and dictionary conversion.
- Implemented integration-style tests for evaluate_kl_divergence using tiny mock models, ensuring end-to-end functionality and verifying near-zero KLD for identical models.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a new KL-divergence evaluation tool for comparing quantized models against their base counterparts, complete with both a Python API and a CLI. The implementation includes a KLDivergenceResult dataclass for storing statistics, core functions for calculating token-level KL divergence, and a comprehensive evaluate_kl_divergence function. Unit tests cover the core computation and integration with mocked models. Overall, the feature is well-structured and tested, but there are a few areas for improvement regarding tokenizer handling and unused parameters.

gemini-code-assist · 2026-03-04T10:36:07Z

src/llmcompressor/evaluation/kl_divergence.py

+    else:
+        base_model_obj = base_model
+        base_model_id = getattr(base_model, "name_or_path", "base_model")
+        tokenizer = AutoTokenizer.from_pretrained(base_model_id)


When base_model is an already-loaded torch.nn.Module object, base_model_id falls back to the string "base_model" if name_or_path is not present. Subsequently, AutoTokenizer.from_pretrained(base_model_id) will attempt to load a tokenizer using the literal string "base_model", which is unlikely to be a valid HuggingFace model ID or local path. This will lead to a runtime error when trying to load the tokenizer.

To fix this, if a model object is passed, the corresponding tokenizer should either be passed explicitly as an argument to evaluate_kl_divergence, or the base_model object must guarantee a valid name_or_path attribute that can be used to load its tokenizer. A safer approach would be to add a tokenizer parameter to evaluate_kl_divergence that can be optionally provided when base_model is an object.

base_model_id = getattr(base_model, "name_or_path", None) if tokenizer is None and base_model_id is None: raise ValueError( "If base_model is an object and no tokenizer is provided, " "base_model must have a 'name_or_path' attribute to load the tokenizer." ) if tokenizer is None: tokenizer = AutoTokenizer.from_pretrained(base_model_id)

gemini-code-assist · 2026-03-04T10:36:07Z

src/llmcompressor/evaluation/kl_divergence.py

+    text_column: str = "text",
+    num_samples: int = 128,
+    max_seq_length: int = 512,
+    batch_size: int = 1,


The batch_size parameter is currently not utilized in the evaluate_kl_divergence function, as indicated by the comment "not used yet". If batching is not planned for immediate implementation, it might be clearer to remove this parameter for now to avoid confusion and reintroduce it when the functionality is ready. If it's a placeholder for future work, a more explicit comment on its purpose and future plans would be beneficial.

Copilot

Pull request overview

Adds a KL-divergence evaluation utility to compare a quantized/target causal LM against a base model, reporting forward and reverse KLD, with both a Python API and a CLI entrypoint.

Changes:

Introduces evaluate_kl_divergence() + KLDivergenceResult for forward/reverse/symmetric KLD reporting.
Adds a CLI runnable module and exposes the API via llmcompressor.evaluation.
Adds unit tests for the core KL computation plus network-dependent integration-style tests using tiny HF models/datasets.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 10 comments.

File	Description
`src/llmcompressor/evaluation/kl_divergence.py`	Implements the KLD evaluation API, core KL computation, dataset/model loading, and a CLI `main()`.
`src/llmcompressor/evaluation/__main__.py`	Adds a package-level `-m` entrypoint wrapper around the CLI.
`src/llmcompressor/evaluation/__init__.py`	Re-exports the evaluation API from the new evaluation package.
`tests/llmcompressor/evaluation/test_kl_divergence.py`	Adds unit + integration-style tests for the new evaluation functionality.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T10:39:19Z

src/llmcompressor/evaluation/kl_divergence.py

+import argparse
+import json
+import logging
+import sys


sys is imported but never used, which will fail linting (ruff/pyflakes) and adds noise. Remove the unused import.

Suggested change

import sys

Copilot · 2026-03-04T10:39:19Z

src/llmcompressor/evaluation/kl_divergence.py

+        base_model_id = base_model
+    else:
+        base_model_obj = base_model
+        base_model_id = getattr(base_model, "name_or_path", "base_model")


When base_model is passed as an already-loaded torch.nn.Module, the code derives base_model_id via getattr(base_model, "name_or_path", "base_model") and then calls AutoTokenizer.from_pretrained(base_model_id). For non-HF modules (or HF models without a meaningful name_or_path), this will raise at runtime (e.g., trying to load tokenizer for literal "base_model"). Consider either requiring an HF PreTrainedModel here, or add an explicit tokenizer (or tokenizer_id) parameter and use it when models are preloaded.

Suggested change

base_model_id = getattr(base_model, "name_or_path", "base_model")

base_model_id = getattr(base_model, "name_or_path", None)

if not base_model_id:

raise TypeError(

"When passing a preloaded 'base_model', it must be a "

"Hugging Face transformers.PreTrainedModel instance with a "

"valid 'name_or_path' attribute so that the corresponding "

"tokenizer can be loaded."

)

Copilot · 2026-03-04T10:39:20Z

src/llmcompressor/evaluation/kl_divergence.py

+        # Compute KL divergence per token (skip the last position which has no
+        # next-token prediction to compare against, but for logit comparison
+        # all positions are valid)
+        forward_kld = _kl_divergence_per_token(base_log_probs, target_log_probs)
+        reverse_kld = _kl_divergence_per_token(target_log_probs, base_log_probs)


The comment says to "skip the last position" but the implementation computes KL for all positions. Since KL here is just comparing distributions between models (no ground-truth next token needed), either update the comment to match the behavior or actually slice off the last timestep if that’s the intended metric.

Copilot · 2026-03-04T10:39:20Z

src/llmcompressor/evaluation/__main__.py

@@ -0,0 +1,5 @@
+"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``."""


This module docstring says it can be run as python -m llmcompressor.evaluation.kl_divergence, but src/llmcompressor/evaluation/__main__.py is actually executed by python -m llmcompressor.evaluation. Update the docstring to reflect the correct invocation (or drop this file if it’s not intended).

Suggested change

"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``."""

"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation``."""

Copilot · 2026-03-04T10:39:20Z

src/llmcompressor/evaluation/__main__.py

+
+from llmcompressor.evaluation.kl_divergence import main
+
+main()


Calling main() at import time makes llmcompressor.evaluation.__main__ unsafe to import (it will execute the CLI immediately). Even though __main__ is typically only executed via python -m, it’s safer to wrap the call in an if __name__ == "__main__": guard.

Suggested change

main()

if __name__ == "__main__":

main()

Copilot · 2026-03-04T10:39:20Z

tests/llmcompressor/evaluation/test_kl_divergence.py

+class TestKLDivergencePerToken:
+    """Tests for the core KL divergence computation."""
+
+    def test_identical_distributions(self):
+        """KL divergence of identical distributions should be zero."""
+        logits = torch.randn(10, 100)


Most tests in this repo are explicitly marked (e.g. @pytest.mark.unit). Consider marking the pure tensor-level tests in this file as unit to match the suite’s marker conventions.

Copilot · 2026-03-04T10:39:21Z

tests/llmcompressor/evaluation/test_kl_divergence.py

+class TestEvaluateKLDivergenceWithMocks:
+    """Integration-style tests using tiny models."""
+
+    @pytest.fixture
+    def tiny_models(self):
+        """Create two tiny randomly-initialized models for testing."""
+        try:


TestEvaluateKLDivergenceWithMocks performs network-dependent HF config/dataset loads (and conditionally skips). To keep CI selection consistent, it should be marked @pytest.mark.integration like other network/third-party integration tests in this repo.

Copilot · 2026-03-04T10:39:21Z

src/llmcompressor/evaluation/kl_divergence.py

+    # --- Evaluate ---
+    result = KLDivergenceResult(num_samples=len(ds))
+    total_forward_kld = 0.0
+    total_reverse_kld = 0.0
+    total_tokens = 0
+
+    for sample in tqdm(ds, desc="Evaluating KL divergence"):
+        text = sample[text_column]
+        inputs = tokenizer(
+            text,
+            return_tensors="pt",
+            max_length=max_seq_length,
+            truncation=True,
+            padding=False,
+        )
+
+        input_ids = inputs["input_ids"]
+        attention_mask = inputs.get("attention_mask")
+
+        if input_ids.shape[1] < 2:
+            continue
+


KLDivergenceResult.num_samples is initialized to len(ds), but the loop can continue (e.g., sequences that tokenize to <2 tokens). That makes num_samples inconsistent with the actual number of evaluated samples (and with the per-sample arrays lengths). Track an evaluated_samples counter and set result.num_samples from that (or add a separate field if you want both attempted and evaluated counts).

Copilot · 2026-03-04T10:39:21Z

src/llmcompressor/evaluation/kl_divergence.py

+    Run a forward pass and return logits (on CPU to save GPU memory).
+
+    :param model: the causal LM
+    :param input_ids: shape (1, seq_len)
+    :param attention_mask: shape (1, seq_len) or None
+    :return: logits tensor of shape (seq_len, vocab_size)
+    """
+    outputs = model(
+        input_ids=input_ids.to(model.device),
+        attention_mask=(
+            attention_mask.to(model.device) if attention_mask is not None else None
+        ),
+    )
+    # Move to CPU immediately to free GPU memory
+    return outputs.logits[0].float().cpu()
+


_collect_logits always moves full (seq_len, vocab) logits to CPU (.cpu()). For GPU runs this can dominate runtime via device-to-host transfers, especially for large vocabularies. Consider computing log_softmax + KL on the model device and only moving the reduced per-token/per-sample scalars to CPU (or gate CPU offload behind an option).

Copilot · 2026-03-04T10:39:21Z

tests/llmcompressor/evaluation/test_kl_divergence.py

+Unit tests for the KL-divergence evaluation tool.
+
+Tests core computation logic using small synthetic models and tensors.
+Does not require GPU or large model downloads.


The file-level docstring says these tests "do not require GPU or large model downloads", but the integration-style tests fetch a HF config and dataset over the network (and skip if unavailable). Either revise the docstring to reflect the network dependency, or rework the tests to be fully offline.

Suggested change

Unit tests for the KL-divergence evaluation tool.

Tests core computation logic using small synthetic models and tensors.

Does not require GPU or large model downloads.

Unit and integration tests for the KL-divergence evaluation tool.

Tests core computation logic using small synthetic models and tensors, and

includes integration-style tests with tiny models that may download small

configs/datasets over the network (tests are skipped if unavailable). These

tests do not require a GPU or large model downloads.

Copilot AI review requested due to automatic review settings March 4, 2026 10:32

Copilot started reviewing on behalf of NJX-njx March 4, 2026 10:32 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

Copilot AI reviewed Mar 4, 2026

View reviewed changes

dsikka closed this Mar 4, 2026

-        base_model_id = getattr(base_model, "name_or_path", "base_model")
+        base_model_id = getattr(base_model, "name_or_path", None)
+        if not base_model_id:
+            raise TypeError(
+                "When passing a preloaded 'base_model', it must be a "
+                "Hugging Face transformers.PreTrainedModel instance with a "
+                "valid 'name_or_path' attribute so that the corresponding "
+                "tokenizer can be loaded."
+            )

		@@ -0,0 +1,5 @@
		"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``."""


		from llmcompressor.evaluation.kl_divergence import main

		main()

-Unit tests for the KL-divergence evaluation tool.
-Tests core computation logic using small synthetic models and tensors.
-Does not require GPU or large model downloads.
+Unit and integration tests for the KL-divergence evaluation tool.
+Tests core computation logic using small synthetic models and tensors, and
+includes integration-style tests with tiny models that may download small
+configs/datasets over the network (tests are skipped if unavailable). These
+tests do not require a GPU or large model downloads.

Conversation

NJX-njx commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants