You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/scorer_definitions/long_text/graph.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ Graph-based scorers, proposed by Jiang et al. (2024), decompose original and sam
11
11
12
12
* **Degree Centrality** - :math:`\frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)` is the average edge weight, measured by entailment probability for claim node `s`.
13
13
14
-
* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]`, `p = \frac{(|\mathbf{s}| - 1)}{m}`, and `t = (|\mathbf{s}| - 1) \mod m`.
14
+
* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]`, :math:`p = \frac{(|\mathbf{s}| - 1)}{m}`, and :math:`t = (|\mathbf{s}| - 1) \mod m`.
15
15
16
16
17
17
* **Closeness Centrality** - :math:`\frac{m + 2(|\mathbf{s}| - 1) }{\sum_{v \neq s}dist(s, v)}` measures the inverse sum of distances to all other nodes, normalized by the minimum possible distance.
@@ -27,7 +27,7 @@ where :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are
27
27
**Key Properties:**
28
28
29
29
- Claim or sententence-level scoring
30
-
- Less complex (cost and latency) than other long-form scoring methods
30
+
- More complex (cost and latency) than LUQ-style scoring methods
The Long-text UQ (LUQ) approach demonstrated here is adapted from Zhang et al. (2024). Similar to standard black-box UQ, this approach requires generating a original response and sampled candidate responses to the same prompt. The original response is then decomposed into units (claims or sentences). Unit-level confidence scores are then obtained by averaging entailment probabilities across candidate responses:
9
+
The Long-text UQ (LUQ) approach demonstrated here is adapted from Zhang et al. (2024). Similar to standard black-box UQ, this approach requires generating a original response and sampled candidate responses to the same prompt. The original response :math:`y` is then decomposed into units (claims or sentences). A confidence score for each unit :math:`s` is then obtained by averaging entailment probabilities across candidate responses:
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that :math:`s` is entailed in :math:`y_j`.
15
+
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that :math:`s` is entailed in :math:`y_j`.
The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024). It uses an LLM to convert each unit (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer :math:`c_g(s;\cdot)` is defined as follows:
9
+
The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024). The original response :math:`y` is decomposed into units (claims or sentences) and LLM is used to convert each unit:math:`s` (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer :math:`c_g(s;\cdot)` is defined as follows:
10
10
11
11
.. math::
12
12
@@ -17,7 +17,7 @@ where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{
17
17
**Key Properties:**
18
18
19
19
- Claim or sententence-level scoring
20
-
- Less complex (cost and latency) than other long-form scoring methods
20
+
- More complex (cost and latency) than LUQ-style scoring methods
|[Black-Box UQ](black_box_demo.ipynb)| Quick setup with any LLM; no need for model internals | All LLMs (API-only access) | Medium-High (multiple generations and comparisons) |
14
-
|[White-Box UQ (Single-Generation)](white_box_single_generation_demo.ipynb)| Fastest and most efficient UQ when you have token probabilities | Requires token probability access | Negligible (single generation) |
15
-
|[White-Box UQ (Multi-Generation)](white_box_multi_generation_demo.ipynb)| Higher accuracy UQ when compute budget allows | Requires token probability access | Medium-High (multiple generations) |
16
-
|[LLM-as-a-Judge](judges_demo.ipynb)| Leveraging one or more LLMs to assess hallucination likelihood | All LLMs (API-only access) | Low-Medium (depends on which judge(s)) |
17
-
|[Train a UQ Ensemble](ensemble_tuning_demo.ipynb)| Maximizing performance by combining multiple UQ methods | Depends on ensemble components | Low-High (depends on selected components) |
13
+
|[Black-Box UQ](https://github.com/cvs-health/uqlm/blob/main/examples/black_box_demo.ipynb)| Quick setup with any LLM; no need for model internals | All LLMs (API-only access) | Medium-High (multiple generations and comparisons) |
14
+
|[White-Box UQ (Single-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_single_generation_demo.ipynb)| Fastest and most efficient UQ when you have token probabilities | Requires token probability access | Negligible (single generation) |
15
+
|[White-Box UQ (Multi-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_multi_generation_demo.ipynb)| Higher accuracy UQ when compute budget allows | Requires token probability access | Medium-High (multiple generations) |
16
+
|[LLM-as-a-Judge](https://github.com/cvs-health/uqlm/blob/main/examples/judges_demo.ipynb)| Leveraging one or more LLMs to assess hallucination likelihood | All LLMs (API-only access) | Low-Medium (depends on which judge(s)) |
17
+
|[Train a UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_tuning_demo.ipynb)| Maximizing performance by combining multiple UQ methods | Depends on ensemble components | Low-High (depends on selected components) |
18
18
19
19
### Tutorials for Long-Form Uncertainty Quantification Methods (for long-text outputs)
20
20
21
21
| Tutorial | Great fit for... | LLM Compatibility | Added Cost/Latency |
|[LUQ method](luq_demo.ipynb)| Detecting claim-level hallucinations in long-form text without model internals | All LLMs (API-only access) | Medium-High (operates over all claims/sentences in original response) |
24
-
|[Graph-based method](graph_based_demo.ipynb)| Analyzing claim relationships in complex responses | All LLMs (API-only access) | Very High (operates over all claims/sentences in original response and sampled responses) |
25
-
|[Generalized Long-form semantic entropy](long_form_semantic_entropy_demo.ipynb)| Reflexlive, detailed approach to claim-level hallucination detection | All LLMs (API-only access) | High (operates over all claims/sentences in original response) |
23
+
|[LUQ method](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_uq_demo.ipynb)| Detecting claim-level hallucinations in long-form text without model internals | All LLMs (API-only access) | Medium-High (operates over all claims/sentences in original response) |
24
+
|[Graph-based method](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_graph_demo.ipynb)| Analyzing claim relationships in complex responses | All LLMs (API-only access) | Very High (operates over all claims/sentences in original response and sampled responses) |
25
+
|[Generalized Long-form semantic entropy](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_qa_demo.ipynb)| Reflexlive, detailed approach to claim-level hallucination detection | All LLMs (API-only access) | High (operates over all claims/sentences in original response) |
26
26
27
27
### Other Tutorials and SOTA Method Examples
28
28
29
29
| Tutorial | Great fit for... | LLM Compatibility | Added Cost/Latency |
|[Multimodal UQ](multimodal_demo.ipynb)| Uncertainty quantification with image+text inputs | Requires image-to-text model | Varies by method |
32
-
|[Score Calibration](score_calibration_demo.ipynb)| Converting raw scores to calibrated probabilities as a postprocessing step | Works with any UQ method | Negligible |
33
-
|[Semantic Entropy](semantic_entropy_demo.ipynb)| State-of-the-art UQ when token probabilities are available | Requires token probability access | Medium-High (multiple generations and comparisons) |
34
-
|[Semantic Density](semantic_density_demo.ipynb)| Newest SOTA method for high-accuracy UQ | Requires token probability access | Medium-High (multiple generations and comparisons) |
35
-
|[BS Detector Off-the-Shelf UQ Ensemble](ensemble_off_the_shelf_demo.ipynb)| Ready-to-use ensemble without training | Depends on ensemble components | Medium-High (multiple generations and comparisons) |
31
+
|[Multimodal UQ](https://github.com/cvs-health/uqlm/blob/main/examples/multimodal_demo.ipynb)| Uncertainty quantification with image+text inputs | Requires image-to-text model | Varies by method |
32
+
|[Score Calibration](https://github.com/cvs-health/uqlm/blob/main/examples/score_calibration_demo.ipynb)| Converting raw scores to calibrated probabilities as a postprocessing step | Works with any UQ method | Negligible |
33
+
|[Semantic Entropy](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_entropy_demo.ipynb)| State-of-the-art UQ when token probabilities are available | Requires token probability access | Medium-High (multiple generations and comparisons) |
34
+
|[Semantic Density](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_density_demo.ipynb)| Newest SOTA method for high-accuracy UQ | Requires token probability access | Medium-High (multiple generations and comparisons) |
35
+
|[BS Detector Off-the-Shelf UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_off_the_shelf_demo.ipynb)| Ready-to-use ensemble without training | Depends on ensemble components | Medium-High (multiple generations and comparisons) |
36
36
37
37
38
-
## Getting Started
38
+
## Where should I start?
39
39
40
-
We recommend starting with the [Black-Box UQ](black_box_demo.ipynb) notebook if you're new to uncertainty quantification or don't have access to model internals.
40
+
We recommend starting with the [Black-Box UQ](https://github.com/cvs-health/uqlm/blob/main/examples/black_box_demo.ipynb) notebook if you're new to uncertainty quantification or don't have access to model internals.
41
41
42
-
For the most efficient approach with minimal compute requirements, try the [White-Box UQ (Single-Generation)](white_box_single_generation_demo.ipynb) notebook if you have access to token probabilities.
42
+
For the most efficient approach with minimal compute requirements, try the [White-Box UQ (Single-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_single_generation_demo.ipynb) notebook if you have access to token probabilities.
43
43
44
-
For long-form text evaluation, the [LUQ method](luq_demo.ipynb) provides a good starting point that works with any LLM API.
44
+
For long-form text evaluation, the [LUQ method](https://github.com/cvs-health/uqlm/blob/main/examples/long_text_uq_demo.ipynb) provides a good starting point that works with any LLM API.
0 commit comments