Skip to content

Releases: cvs-health/uqlm

v0.5.7

13 Mar 16:56
35aa5ef

Choose a tag to compare

What's Changed

Full Changelog: v0.5.6...v0.5.7

v0.5.6

02 Mar 19:18
6ee9312

Choose a tag to compare

Highlights

  • package upgrades from dependabot
  • update badge colors and links
  • update citation information

What's Changed

Full Changelog: v0.5.5...v0.5.6

v0.5.5

25 Feb 20:56
d367798

Choose a tag to compare

Highlights

  • replace agenerate with ainvoke where generation is used, since ainvoke appears to be the better-maintained and better-documented LangChain method
  • replace poetry with uv for dependency management
  • update badges

What's Changed

Full Changelog: v0.5.4...v0.5.5

v0.5.4

30 Jan 17:06
84a6121

Choose a tag to compare

Highlights

1. Add new white-box scorers to UQEnsemble accepted scorers list:

Top-logprobs scorers (3):

  • min_token_negentropy - Minimum negentropy across tokens
  • mean_token_negentropy - Average negentropy across tokens
  • probability_margin - Mean difference between top-2 token probabilities

Sampled-logprobs scorers (4):

  • semantic_negentropy - Entropy based on semantic clustering
  • semantic_density - Density-based confidence measure
  • monte_carlo_probability - Average sequence probability across samples
  • consistency_and_confidence - Cosine similarity × response probability

P(True) scorer (1):

  • p_true - LLM's estimate of P(response is true)

2. Fix embeddings model specification for cosine_sim and consistency_and_confidence, enable with WhiteBoxUQ

Corrects a string error in embedding model specification with sentence_transformer parameter of BlackBoxUQ. Previously, the string was forced to begin with "sentence_transformers" but now the full string is specified with the parameter.

Previous: sentence_transformer=all-MiniLM-L12-v2 was specified and then "sentence-transformers/" was prepended to the string when storing the class attribute.

Now: sentence_transformer=sentence-transformers/all-MiniLM-L12-v2 is specified. This allows other embeddings models that don't start with "sentence_transformers/", such as jinaai/jina-embeddings-v2-base-code to be specified.

Also adds missing sentence_transformer parameter for WhiteBoxUQ

What's Changed

Full Changelog: v0.5.3...v0.5.4

v0.5.3

20 Jan 20:51
6d26749

Choose a tag to compare

Highlights

  • added now demo notebook to illustrate langgraph-uqlm integration
  • upgrade package versions per dependabot
  • fix some LaTeX in docs site
  • fix links in readme

What's Changed

New Contributors

Full Changelog: v0.5.2...v0.5.3

v0.5.2

14 Jan 16:05
7acd188

Choose a tag to compare

Highlights

  • Create uqlm.nli.EntailmentClassifier class for LLM-based entailment classification. This is well-suited for long-text scoring when responses exceed the length that can be handled by the Hugging Face NLI model
  • Update LongTextGraph, LongTexUQ, UnitResponseScorer, GraphScorer and associated notebooks to allow for LLM-based entailment classification.
  • Update unit tests
  • Misc. docs site cleanup

What's Changed

Full Changelog: v0.5.1...v0.5.2

v0.5.1

09 Jan 14:38
7bd62f1

Choose a tag to compare

Highlights

  • fixes rendering of long-form scorer content on the docs site
  • adds missing uqlm/longform subpackage to pyproject.toml so it appears in API reference on docs site
  • misc. docs site cleanup

What's Changed

Full Changelog: v0.5.0...v0.5.1

v0.5.0

08 Jan 17:57
4ce62b2

Choose a tag to compare

New Methods: Long-Form UQ

Short-form UQ methods have been shown to generalize poorly to long-form LLM outputs. Fine-grained methods for long-form UQ address these limitations by first decomposing responses into granular units (sentences or claims) and then scoring each unit.

Response Decomposition

We enable decomposition of responses into sentences or claims using our ResponseDecomposer class. This class implements claim decomposition using an LLM or sentence decomposition using a rule-based approach.

Scoring methods

We add three families of fine-grained scorers for long-form uncertainty quantification: Unit-Response, Matched-Unit, and Unit-QA

1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)

These scorers measure whether sampled responses entail each unit (sentence or claim) in the original response and average across sampled responses to obtain unit-level confidence scores. This is implemented with the uqlm.scorers.longform.LongTextUQ class.
unit-response (1)

2. Matched-Unit (Based on the LUQ-pair method)

These scorers work by matching each original sentence or claim to its most similar counterpart in sampled responses before computing entailment scores. Matched scores are then averaged across sampled responses to obtain a confidence score for each unit in the original response. This is implemented with the uqlm.scorers.longform.LongTextUQ class.
matched-unit

3. Unit-QA (Based on the Longform Semantic Entropy method)

These scorers work by decomposing a response into granular units (sentences or claims), generating questions whose answers are the claims given context, sampling multiple answers, and computes black-box UQ scores across these answers. his is implemented with the uqlm.scorers.longform.LongTextQA class.
unit-qa3 (1)

4. Graph-Based (Based on the Jiang et al., 2024)

Graph-based scorers decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. This is implemented with the uqlm.scorers.longform.LongTextGraph class.
graph-uq3

These scorer classes all share the same parent class: uqlm.scorers.longform.baseclass.LongFormUQ.

Response Refinement with Uncertainty Aware Decoding

Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer parameter) below a specified threshold (specified with response_refinement_threshold parameter) and reconstructing the response from the retained claims. This functionality is available in combination with any of the four methods described above by setting response_refinement=True in the constructor of the corresponding scorer class.

uad_graphic

Performance Evaluation

We enable FactScore-based grading using an LLM. This works by comparing units (sentences or claims) in a generated response to a FactScore question against the corresponding text of the subject's wikipedia article.

New docs site pages

We have added a "Scorer Definitions" tab to the docs site, intended to serve as an 'encyclopedia' of available scoring methods. It provides formal definitions, explanations in simple terms, and code snippets for all available methods.

Other changes

  • uqlm.scorers has now been refactored with two subfolders: uqlm.scorers.shortform (which contains existing scorer classes as of v0.4) and uqlm.scorers.longform which contains classes to implement the above mentioned scoring methods
  • the readme has been updated to reflect new longform scorers, and a new readme has been added inside the examples/ directory to provide more details on the available tutorials
  • various package upgrades to address security vulnerabilities identified by dependabot

Breaking changes

  • normalized_probability has been deprecated from acceptable white-box scorer list in WhiteBoxUQ and UQEnsemble in favor of sequence_probability with length_normalize=True (default). This also affects the key/column names in the returned UQResult object.

What's Changed

Full Changelog: v0.4.5...v0.5.0

v0.4.5

08 Dec 16:18
9d30468

Choose a tag to compare

Highlights

  • fix bug in model name string checking when retrieving logprobs, per issue #284

What's Changed

Full Changelog: v0.4.4...v0.4.5

v0.4.4

04 Dec 15:42
e9305e4

Choose a tag to compare

Highlights

  • max_length parameter to WhiteBoxUQ to avoid the CUDA OutOfMemoryError.
  • updates demo and docstring accordingly

What's Changed

Full Changelog: v0.4.3...v0.4.4