Skip to content

feat: add KL-divergence evaluation tool for quantized models#2439

Closed
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/kl-divergence-eval-tool
Closed

feat: add KL-divergence evaluation tool for quantized models#2439
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/kl-divergence-eval-tool

Conversation

@NJX-njx
Copy link

@NJX-njx NJX-njx commented Mar 4, 2026

Ref #2031 - Implements KL-divergence eval tool with both forward and reverse KLD, Python API and CLI

Ref vllm-project#2031

Implement a tool to evaluate how well quantized models preserve the
original model's probability distribution using KL divergence.

Features:
- Computes KLD(base || target) and KLD(target || base) since KL
  divergence is asymmetric
- Reports per-sample and aggregate statistics
- Supports any HuggingFace causal LM and dataset
- Python API via evaluate_kl_divergence() function
- CLI via python -m llmcompressor.evaluation.kl_divergence
- Memory-efficient: logits moved to CPU immediately to free GPU VRAM

Files added:
- src/llmcompressor/evaluation/__init__.py
- src/llmcompressor/evaluation/__main__.py
- src/llmcompressor/evaluation/kl_divergence.py
- tests/llmcompressor/evaluation/test_kl_divergence.py
Copilot AI review requested due to automatic review settings March 4, 2026 10:32
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust KL-divergence evaluation tool designed to measure the statistical difference between the probability distributions of a base model and its quantized counterpart. By offering both a Python API and a command-line interface, it provides flexible options for developers and researchers to quantify the impact of quantization on model output distributions, thereby aiding in the development and validation of compressed models.

Highlights

  • New KL-Divergence Evaluation Tool: A new tool has been introduced to evaluate the KL-divergence between a base (unquantized) model and a target (quantized) model. This helps in assessing the quality and fidelity of quantized models.
  • Forward and Reverse KLD Calculation: The tool computes both forward KLD (base || target) and reverse KLD (target || base) to provide a comprehensive understanding of distribution differences, acknowledging the asymmetric nature of KL-divergence.
  • Python API and Command-Line Interface (CLI): The KL-divergence evaluation functionality is accessible via a flexible Python API for programmatic use and a convenient command-line interface for direct execution.
  • Comprehensive Testing: Unit and integration tests have been added to ensure the correctness of the KL-divergence calculations and the overall functionality of the evaluation tool, using synthetic and tiny mock models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/evaluation/init.py
    • Initialized the evaluation package.
    • Exported KLDivergenceResult and evaluate_kl_divergence for public API access.
  • src/llmcompressor/evaluation/main.py
    • Enabled direct execution of the KL-divergence evaluation tool via python -m llmcompressor.evaluation.kl_divergence.
  • src/llmcompressor/evaluation/kl_divergence.py
    • Implemented the KLDivergenceResult dataclass to store per-sample and aggregate KL-divergence statistics.
    • Developed _kl_divergence_per_token function for calculating token-level KL divergence from log-probabilities.
    • Created _collect_logits function to efficiently retrieve model logits and move them to CPU.
    • Provided the main evaluate_kl_divergence function, handling model loading, dataset processing, and KLD computation.
    • Integrated an argparse-based command-line interface (CLI) for easy execution and result output.
  • tests/llmcompressor/evaluation/test_kl_divergence.py
    • Added unit tests for _kl_divergence_per_token to verify KL-divergence calculations with identical, different, and known distributions.
    • Included tests for the KLDivergenceResult dataclass, checking summary generation and dictionary conversion.
    • Implemented integration-style tests for evaluate_kl_divergence using tiny mock models, ensuring end-to-end functionality and verifying near-zero KLD for identical models.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new KL-divergence evaluation tool for comparing quantized models against their base counterparts, complete with both a Python API and a CLI. The implementation includes a KLDivergenceResult dataclass for storing statistics, core functions for calculating token-level KL divergence, and a comprehensive evaluate_kl_divergence function. Unit tests cover the core computation and integration with mocked models. Overall, the feature is well-structured and tested, but there are a few areas for improvement regarding tokenizer handling and unused parameters.

else:
base_model_obj = base_model
base_model_id = getattr(base_model, "name_or_path", "base_model")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When base_model is an already-loaded torch.nn.Module object, base_model_id falls back to the string "base_model" if name_or_path is not present. Subsequently, AutoTokenizer.from_pretrained(base_model_id) will attempt to load a tokenizer using the literal string "base_model", which is unlikely to be a valid HuggingFace model ID or local path. This will lead to a runtime error when trying to load the tokenizer.

To fix this, if a model object is passed, the corresponding tokenizer should either be passed explicitly as an argument to evaluate_kl_divergence, or the base_model object must guarantee a valid name_or_path attribute that can be used to load its tokenizer. A safer approach would be to add a tokenizer parameter to evaluate_kl_divergence that can be optionally provided when base_model is an object.

        base_model_id = getattr(base_model, "name_or_path", None)
        if tokenizer is None and base_model_id is None:
            raise ValueError(
                "If base_model is an object and no tokenizer is provided, "
                "base_model must have a 'name_or_path' attribute to load the tokenizer."
            )
        if tokenizer is None:
            tokenizer = AutoTokenizer.from_pretrained(base_model_id)

text_column: str = "text",
num_samples: int = 128,
max_seq_length: int = 512,
batch_size: int = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The batch_size parameter is currently not utilized in the evaluate_kl_divergence function, as indicated by the comment "not used yet". If batching is not planned for immediate implementation, it might be clearer to remove this parameter for now to avoid confusion and reintroduce it when the functionality is ready. If it's a placeholder for future work, a more explicit comment on its purpose and future plans would be beneficial.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a KL-divergence evaluation utility to compare a quantized/target causal LM against a base model, reporting forward and reverse KLD, with both a Python API and a CLI entrypoint.

Changes:

  • Introduces evaluate_kl_divergence() + KLDivergenceResult for forward/reverse/symmetric KLD reporting.
  • Adds a CLI runnable module and exposes the API via llmcompressor.evaluation.
  • Adds unit tests for the core KL computation plus network-dependent integration-style tests using tiny HF models/datasets.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 10 comments.

File Description
src/llmcompressor/evaluation/kl_divergence.py Implements the KLD evaluation API, core KL computation, dataset/model loading, and a CLI main().
src/llmcompressor/evaluation/__main__.py Adds a package-level -m entrypoint wrapper around the CLI.
src/llmcompressor/evaluation/__init__.py Re-exports the evaluation API from the new evaluation package.
tests/llmcompressor/evaluation/test_kl_divergence.py Adds unit + integration-style tests for the new evaluation functionality.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import argparse
import json
import logging
import sys
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys is imported but never used, which will fail linting (ruff/pyflakes) and adds noise. Remove the unused import.

Suggested change
import sys

Copilot uses AI. Check for mistakes.
base_model_id = base_model
else:
base_model_obj = base_model
base_model_id = getattr(base_model, "name_or_path", "base_model")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When base_model is passed as an already-loaded torch.nn.Module, the code derives base_model_id via getattr(base_model, "name_or_path", "base_model") and then calls AutoTokenizer.from_pretrained(base_model_id). For non-HF modules (or HF models without a meaningful name_or_path), this will raise at runtime (e.g., trying to load tokenizer for literal "base_model"). Consider either requiring an HF PreTrainedModel here, or add an explicit tokenizer (or tokenizer_id) parameter and use it when models are preloaded.

Suggested change
base_model_id = getattr(base_model, "name_or_path", "base_model")
base_model_id = getattr(base_model, "name_or_path", None)
if not base_model_id:
raise TypeError(
"When passing a preloaded 'base_model', it must be a "
"Hugging Face transformers.PreTrainedModel instance with a "
"valid 'name_or_path' attribute so that the corresponding "
"tokenizer can be loaded."
)

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +267
# Compute KL divergence per token (skip the last position which has no
# next-token prediction to compare against, but for logit comparison
# all positions are valid)
forward_kld = _kl_divergence_per_token(base_log_probs, target_log_probs)
reverse_kld = _kl_divergence_per_token(target_log_probs, base_log_probs)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says to "skip the last position" but the implementation computes KL for all positions. Since KL here is just comparing distributions between models (no ground-truth next token needed), either update the comment to match the behavior or actually slice off the last timestep if that’s the intended metric.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,5 @@
"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``."""
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module docstring says it can be run as python -m llmcompressor.evaluation.kl_divergence, but src/llmcompressor/evaluation/__main__.py is actually executed by python -m llmcompressor.evaluation. Update the docstring to reflect the correct invocation (or drop this file if it’s not intended).

Suggested change
"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation.kl_divergence``."""
"""Allow running KL-divergence evaluation as ``python -m llmcompressor.evaluation``."""

Copilot uses AI. Check for mistakes.

from llmcompressor.evaluation.kl_divergence import main

main()
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling main() at import time makes llmcompressor.evaluation.__main__ unsafe to import (it will execute the CLI immediately). Even though __main__ is typically only executed via python -m, it’s safer to wrap the call in an if __name__ == "__main__": guard.

Suggested change
main()
if __name__ == "__main__":
main()

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +24
class TestKLDivergencePerToken:
"""Tests for the core KL divergence computation."""

def test_identical_distributions(self):
"""KL divergence of identical distributions should be zero."""
logits = torch.randn(10, 100)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tests in this repo are explicitly marked (e.g. @pytest.mark.unit). Consider marking the pure tensor-level tests in this file as unit to match the suite’s marker conventions.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +107
class TestEvaluateKLDivergenceWithMocks:
"""Integration-style tests using tiny models."""

@pytest.fixture
def tiny_models(self):
"""Create two tiny randomly-initialized models for testing."""
try:
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestEvaluateKLDivergenceWithMocks performs network-dependent HF config/dataset loads (and conditionally skips). To keep CI selection consistent, it should be marked @pytest.mark.integration like other network/third-party integration tests in this repo.

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +252
# --- Evaluate ---
result = KLDivergenceResult(num_samples=len(ds))
total_forward_kld = 0.0
total_reverse_kld = 0.0
total_tokens = 0

for sample in tqdm(ds, desc="Evaluating KL divergence"):
text = sample[text_column]
inputs = tokenizer(
text,
return_tensors="pt",
max_length=max_seq_length,
truncation=True,
padding=False,
)

input_ids = inputs["input_ids"]
attention_mask = inputs.get("attention_mask")

if input_ids.shape[1] < 2:
continue

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KLDivergenceResult.num_samples is initialized to len(ds), but the loop can continue (e.g., sequences that tokenize to <2 tokens). That makes num_samples inconsistent with the actual number of evaluated samples (and with the per-sample arrays lengths). Track an evaluated_samples counter and set result.num_samples from that (or add a separate field if you want both attempted and evaluated counts).

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +134
Run a forward pass and return logits (on CPU to save GPU memory).

:param model: the causal LM
:param input_ids: shape (1, seq_len)
:param attention_mask: shape (1, seq_len) or None
:return: logits tensor of shape (seq_len, vocab_size)
"""
outputs = model(
input_ids=input_ids.to(model.device),
attention_mask=(
attention_mask.to(model.device) if attention_mask is not None else None
),
)
# Move to CPU immediately to free GPU memory
return outputs.logits[0].float().cpu()

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_collect_logits always moves full (seq_len, vocab) logits to CPU (.cpu()). For GPU runs this can dominate runtime via device-to-host transfers, especially for large vocabularies. Consider computing log_softmax + KL on the model device and only moving the reduced per-token/per-sample scalars to CPU (or gate CPU offload behind an option).

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +5
Unit tests for the KL-divergence evaluation tool.

Tests core computation logic using small synthetic models and tensors.
Does not require GPU or large model downloads.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file-level docstring says these tests "do not require GPU or large model downloads", but the integration-style tests fetch a HF config and dataset over the network (and skip if unavailable). Either revise the docstring to reflect the network dependency, or rework the tests to be fully offline.

Suggested change
Unit tests for the KL-divergence evaluation tool.
Tests core computation logic using small synthetic models and tensors.
Does not require GPU or large model downloads.
Unit and integration tests for the KL-divergence evaluation tool.
Tests core computation logic using small synthetic models and tensors, and
includes integration-style tests with tiny models that may download small
configs/datasets over the network (tests are skipped if unavailable). These
tests do not require a GPU or large model downloads.

Copilot uses AI. Check for mistakes.
@dsikka dsikka closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants