Skip to content

Commit 6d26749

Browse files
Merge pull request #326 from cvs-health/patch/v0.5.3
Patch release: `v0.5.3`
2 parents 7acd188 + 0d9d93e commit 6d26749

File tree

10 files changed

+1010
-106
lines changed

10 files changed

+1010
-106
lines changed
-280 Bytes
Loading

docs/source/_notebooks/examples/long_text_graph_demo.ipynb

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,12 @@
1212
" of how to use these methods with <code>uqlm</code>. The available scorers and papers from which they are adapted are below:\n",
1313
" </p>\n",
1414
" \n",
15-
"* Long-text Uncertainty Quantification (LUQ) ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
16-
"* LUQ-Atomic ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
17-
"* LUQ-pair ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
18-
"* Generalized LUQ-pair ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
15+
"* Closeness Centrality ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
16+
"* Betweenness Centrality ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
17+
"* PageRank ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
18+
"* Degree Centrality ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
19+
"* Harmonic Centrality\n",
20+
"* Laplacian Centrality\n",
1921
"\n",
2022
"</div>\n",
2123
"\n",

docs/source/scorer_definitions/long_text/graph.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Graph-based scorers, proposed by Jiang et al. (2024), decompose original and sam
1111

1212
* **Degree Centrality** - :math:`\frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)` is the average edge weight, measured by entailment probability for claim node `s`.
1313

14-
* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]`, `p = \frac{(|\mathbf{s}| - 1)}{m}`, and `t = (|\mathbf{s}| - 1) \mod m`.
14+
* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]`, :math:`p = \frac{(|\mathbf{s}| - 1)}{m}`, and :math:`t = (|\mathbf{s}| - 1) \mod m`.
1515

1616

1717
* **Closeness Centrality** - :math:`\frac{m + 2(|\mathbf{s}| - 1) }{\sum_{v \neq s}dist(s, v)}` measures the inverse sum of distances to all other nodes, normalized by the minimum possible distance.
@@ -27,7 +27,7 @@ where :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are
2727
**Key Properties:**
2828

2929
- Claim or sententence-level scoring
30-
- Less complex (cost and latency) than other long-form scoring methods
30+
- More complex (cost and latency) than LUQ-style scoring methods
3131
- Score range: :math:`[0, 1]`
3232

3333
How It Works

docs/source/scorer_definitions/long_text/luq.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ Long-Text Uncertainty Quantification (LUQ)
66
Definition
77
----------
88

9-
The Long-text UQ (LUQ) approach demonstrated here is adapted from Zhang et al. (2024). Similar to standard black-box UQ, this approach requires generating a original response and sampled candidate responses to the same prompt. The original response is then decomposed into units (claims or sentences). Unit-level confidence scores are then obtained by averaging entailment probabilities across candidate responses:
9+
The Long-text UQ (LUQ) approach demonstrated here is adapted from Zhang et al. (2024). Similar to standard black-box UQ, this approach requires generating a original response and sampled candidate responses to the same prompt. The original response :math:`y` is then decomposed into units (claims or sentences). A confidence score for each unit :math:`s` is then obtained by averaging entailment probabilities across candidate responses:
1010

1111
.. math::
1212
1313
c_g(s; \mathbf{y}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)
1414
15-
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that :math:`s` is entailed in :math:`y_j`.
15+
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that :math:`s` is entailed in :math:`y_j`.
1616

1717
**Key Properties:**
1818

docs/source/scorer_definitions/long_text/qa.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ QA-Based Uncertainty Quantification (LUQ)
66
Definition
77
----------
88

9-
The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024). It uses an LLM to convert each unit (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer :math:`c_g(s;\cdot)` is defined as follows:
9+
The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024). The original response :math:`y` is decomposed into units (claims or sentences) and LLM is used to convert each unit :math:`s` (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer :math:`c_g(s;\cdot)` is defined as follows:
1010

1111
.. math::
1212
@@ -17,7 +17,7 @@ where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{
1717
**Key Properties:**
1818

1919
- Claim or sententence-level scoring
20-
- Less complex (cost and latency) than other long-form scoring methods
20+
- More complex (cost and latency) than LUQ-style scoring methods
2121
- Score range: :math:`[0, 1]`
2222

2323
How It Works

examples/README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,35 +10,35 @@ The notebooks are organized into core methods, long-form techniques, and advance
1010

1111
| Tutorial | Great fit for... | LLM Compatibility | Added Cost/Latency |
1212
|----------|-------------|-------------------|--------------|
13-
| [Black-Box UQ](black_box_demo.ipynb) | Quick setup with any LLM; no need for model internals | All LLMs (API-only access) | Medium-High (multiple generations and comparisons) |
14-
| [White-Box UQ (Single-Generation)](white_box_single_generation_demo.ipynb) | Fastest and most efficient UQ when you have token probabilities | Requires token probability access | Negligible (single generation) |
15-
| [White-Box UQ (Multi-Generation)](white_box_multi_generation_demo.ipynb) | Higher accuracy UQ when compute budget allows | Requires token probability access | Medium-High (multiple generations) |
16-
| [LLM-as-a-Judge](judges_demo.ipynb) | Leveraging one or more LLMs to assess hallucination likelihood | All LLMs (API-only access) | Low-Medium (depends on which judge(s)) |
17-
| [Train a UQ Ensemble](ensemble_tuning_demo.ipynb) | Maximizing performance by combining multiple UQ methods | Depends on ensemble components | Low-High (depends on selected components) |
13+
| [Black-Box UQ](https://github.com/cvs-health/uqlm/blob/main/examples/black_box_demo.ipynb) | Quick setup with any LLM; no need for model internals | All LLMs (API-only access) | Medium-High (multiple generations and comparisons) |
14+
| [White-Box UQ (Single-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_single_generation_demo.ipynb) | Fastest and most efficient UQ when you have token probabilities | Requires token probability access | Negligible (single generation) |
15+
| [White-Box UQ (Multi-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_multi_generation_demo.ipynb) | Higher accuracy UQ when compute budget allows | Requires token probability access | Medium-High (multiple generations) |
16+
| [LLM-as-a-Judge](https://github.com/cvs-health/uqlm/blob/main/examples/judges_demo.ipynb) | Leveraging one or more LLMs to assess hallucination likelihood | All LLMs (API-only access) | Low-Medium (depends on which judge(s)) |
17+
| [Train a UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_tuning_demo.ipynb) | Maximizing performance by combining multiple UQ methods | Depends on ensemble components | Low-High (depends on selected components) |
1818

1919
### Tutorials for Long-Form Uncertainty Quantification Methods (for long-text outputs)
2020

2121
| Tutorial | Great fit for... | LLM Compatibility | Added Cost/Latency |
2222
|----------|-------------|-------------------|--------------|
23-
| [LUQ method](luq_demo.ipynb) | Detecting claim-level hallucinations in long-form text without model internals | All LLMs (API-only access) | Medium-High (operates over all claims/sentences in original response) |
24-
| [Graph-based method](graph_based_demo.ipynb) | Analyzing claim relationships in complex responses | All LLMs (API-only access) | Very High (operates over all claims/sentences in original response and sampled responses) |
25-
| [Generalized Long-form semantic entropy](long_form_semantic_entropy_demo.ipynb) | Reflexlive, detailed approach to claim-level hallucination detection | All LLMs (API-only access) | High (operates over all claims/sentences in original response) |
23+
| [LUQ method](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_uq_demo.ipynb) | Detecting claim-level hallucinations in long-form text without model internals | All LLMs (API-only access) | Medium-High (operates over all claims/sentences in original response) |
24+
| [Graph-based method](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_graph_demo.ipynb) | Analyzing claim relationships in complex responses | All LLMs (API-only access) | Very High (operates over all claims/sentences in original response and sampled responses) |
25+
| [Generalized Long-form semantic entropy](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_qa_demo.ipynb) | Reflexlive, detailed approach to claim-level hallucination detection | All LLMs (API-only access) | High (operates over all claims/sentences in original response) |
2626

2727
### Other Tutorials and SOTA Method Examples
2828

2929
| Tutorial | Great fit for... | LLM Compatibility | Added Cost/Latency |
3030
|----------|-------------|-------------------|--------------|
31-
| [Multimodal UQ](multimodal_demo.ipynb) | Uncertainty quantification with image+text inputs | Requires image-to-text model | Varies by method |
32-
| [Score Calibration](score_calibration_demo.ipynb) | Converting raw scores to calibrated probabilities as a postprocessing step | Works with any UQ method | Negligible |
33-
| [Semantic Entropy](semantic_entropy_demo.ipynb) | State-of-the-art UQ when token probabilities are available | Requires token probability access | Medium-High (multiple generations and comparisons) |
34-
| [Semantic Density](semantic_density_demo.ipynb) | Newest SOTA method for high-accuracy UQ | Requires token probability access | Medium-High (multiple generations and comparisons) |
35-
| [BS Detector Off-the-Shelf UQ Ensemble](ensemble_off_the_shelf_demo.ipynb) | Ready-to-use ensemble without training | Depends on ensemble components | Medium-High (multiple generations and comparisons) |
31+
| [Multimodal UQ](https://github.com/cvs-health/uqlm/blob/main/examples/multimodal_demo.ipynb) | Uncertainty quantification with image+text inputs | Requires image-to-text model | Varies by method |
32+
| [Score Calibration](https://github.com/cvs-health/uqlm/blob/main/examples/score_calibration_demo.ipynb) | Converting raw scores to calibrated probabilities as a postprocessing step | Works with any UQ method | Negligible |
33+
| [Semantic Entropy](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_entropy_demo.ipynb) | State-of-the-art UQ when token probabilities are available | Requires token probability access | Medium-High (multiple generations and comparisons) |
34+
| [Semantic Density](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_density_demo.ipynb) | Newest SOTA method for high-accuracy UQ | Requires token probability access | Medium-High (multiple generations and comparisons) |
35+
| [BS Detector Off-the-Shelf UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_off_the_shelf_demo.ipynb) | Ready-to-use ensemble without training | Depends on ensemble components | Medium-High (multiple generations and comparisons) |
3636

3737

38-
## Getting Started
38+
## Where should I start?
3939

40-
We recommend starting with the [Black-Box UQ](black_box_demo.ipynb) notebook if you're new to uncertainty quantification or don't have access to model internals.
40+
We recommend starting with the [Black-Box UQ](https://github.com/cvs-health/uqlm/blob/main/examples/black_box_demo.ipynb) notebook if you're new to uncertainty quantification or don't have access to model internals.
4141

42-
For the most efficient approach with minimal compute requirements, try the [White-Box UQ (Single-Generation)](white_box_single_generation_demo.ipynb) notebook if you have access to token probabilities.
42+
For the most efficient approach with minimal compute requirements, try the [White-Box UQ (Single-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_single_generation_demo.ipynb) notebook if you have access to token probabilities.
4343

44-
For long-form text evaluation, the [LUQ method](luq_demo.ipynb) provides a good starting point that works with any LLM API.
44+
For long-form text evaluation, the [LUQ method](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_uq_demo.ipynb) provides a good starting point that works with any LLM API.

0 commit comments

Comments
 (0)