diff --git a/INTEGRATION_GUIDE.md b/INTEGRATION_GUIDE.md new file mode 100644 index 0000000..93df8f2 --- /dev/null +++ b/INTEGRATION_GUIDE.md @@ -0,0 +1,527 @@ +# Semantic Similarity Rating (SSR) - Integration Guide for LLM Applications + +## Overview + +This guide explains how to integrate Semantic Similarity Rating (SSR) logic into applications that generate LLM responses and need to convert them into Likert scale probability distributions. + +**What SSR Does**: Converts free-text LLM responses (e.g., "I somewhat agree") into probability distributions across Likert scale points (e.g., `[0.05, 0.15, 0.35, 0.35, 0.10]` for a 5-point scale from "Strongly Disagree" to "Strongly Agree"). + +**Why This Matters**: Instead of forcing LLMs to output single numeric ratings (1-5), SSR preserves uncertainty and nuance by quantifying the semantic similarity between the response and each scale point. + +--- + +## Core Concepts + +### 1. The SSR Equation + +SSR computes a probability mass function (PMF) using cosine similarity between embeddings: + +``` +p[r] = (similarity[r] - min_similarity + ε) / + (sum_all_similarities - n_points * min_similarity + ε) +``` + +Where: +- `similarity[r]`: Cosine similarity between response embedding and reference statement `r` +- `min_similarity`: Minimum similarity across all scale points (baseline subtraction) +- `n_points`: Number of Likert scale points (typically 5) +- `ε` (epsilon): Optional regularization parameter (default: 0.0) + +**Key Insight**: Subtracting the minimum similarity creates a "relative similarity" that emphasizes distinctions between scale points. + +### 2. Temperature Scaling (Optional) + +After computing the base PMF, you can apply temperature scaling to control distribution sharpness: + +``` +p_scaled[i] = (p[i]^(1/T)) / sum(p[j]^(1/T) for all j) +``` + +- **T = 0**: One-hot encoding (argmax of probabilities) +- **T = 1**: No scaling (identity) +- **T > 1**: Softer distribution (more uniform) +- **T < 1**: Sharper distribution (more peaked) + +--- + +## Implementation Components to Borrow + +### 1. **Core Math Functions** (`compute.py`) + +#### `response_embeddings_to_pmf(matrix_responses, matrix_likert_sentences, epsilon=0.0)` + +**Purpose**: Converts embeddings to probability distributions using SSR equation. + +**Input**: +- `matrix_responses`: numpy array of shape `(n_responses, embedding_dim)` - LLM response embeddings +- `matrix_likert_sentences`: numpy array of shape `(embedding_dim, n_scale_points)` - Reference embeddings (transposed) +- `epsilon`: Regularization parameter (default: 0.0) + +**Output**: +- numpy array of shape `(n_responses, n_scale_points)` - Probability distributions + +**Algorithm**: +```python +# 1. Normalize embeddings (L2 norm) +M_left = matrix_responses / ||matrix_responses|| +M_right = matrix_likert_sentences / ||matrix_likert_sentences|| + +# 2. Compute cosine similarities (scaled to [0, 1]) +cos = (1 + M_left @ M_right) / 2 + +# 3. Subtract minimum similarity per response +cos_min = min(cos, axis=1) +numerator = cos - cos_min + epsilon * kronecker_delta + +# 4. Normalize to sum to 1 +denominator = sum(cos, axis=1) - n_points * cos_min + epsilon +pmf = numerator / denominator +``` + +#### `scale_pmf(pmf, temperature, max_temp=inf)` + +**Purpose**: Apply temperature scaling to control distribution sharpness. + +**Input**: +- `pmf`: 1D array of probabilities +- `temperature`: Scaling parameter (0 to inf) +- `max_temp`: Optional ceiling on temperature + +**Output**: Scaled PMF (still sums to 1) + +--- + +### 2. **Orchestration Class** (`response_rater.py`) + +The `ResponseRater` class provides a higher-level interface that manages: +- Multiple reference sets (different phrasings of Likert scales) +- Automatic embedding computation +- Reference set selection and averaging + +**Key Features for Integration**: + +#### Dual Operating Modes + +**Text Mode** (recommended for most applications): +```python +# Automatically computes embeddings using sentence-transformers +rater = ResponseRater(df_references) # No embedding column +pmfs = rater.get_response_pmfs('set1', ["I agree", "Not sure"]) +``` + +**Embedding Mode** (for custom embedding pipelines): +```python +# Uses pre-computed embeddings +rater = ResponseRater(df_references_with_embeddings) +pmfs = rater.get_response_pmfs('set1', embedding_matrix) +``` + +#### Reference Set Management + +```python +# Use specific reference set +pmfs = rater.get_response_pmfs('set1', responses) + +# Average across all reference sets (more robust) +pmfs = rater.get_response_pmfs('mean', responses) + +# Get survey-level aggregate (average PMFs) +survey_pmf = rater.get_survey_response_pmf(pmfs) +``` + +--- + +## Data Structures + +### Reference Sentences DataFrame + +**Required Structure** (Polars DataFrame or convert from pandas): + +```python +import polars as po + +df_references = po.DataFrame({ + 'id': ['set1', 'set1', 'set1', 'set1', 'set1', # Reference set ID + 'set2', 'set2', 'set2', 'set2', 'set2'], + 'int_response': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5], # Must be 1-5 for each set + 'sentence': [ + # Set 1 (formal phrasing) + 'Strongly disagree', 'Disagree', 'Neutral', + 'Agree', 'Strongly agree', + # Set 2 (informal phrasing) + 'Disagree a lot', 'Kinda disagree', 'Don\'t know', + 'Kinda agree', 'Agree a lot' + ] +}) +``` + +**Validation Requirements**: +- Each reference set (`id`) must have exactly 5 sentences +- `int_response` must be [1, 2, 3, 4, 5] for each set +- Reserved ID: `'mean'` cannot be used (reserved for averaging) +- Optional `embedding` column for pre-computed embeddings + +--- + +## Integration Workflow + +### Minimal Integration (Just the Math) + +If your app already has embeddings: + +```python +import numpy as np +from semantic_similarity_rating.compute import response_embeddings_to_pmf, scale_pmf + +# Your app generates these +llm_response_embeddings = np.array([[...], [...]]) # Shape: (n_responses, 384) +reference_embeddings = np.array([[...], [...], ...]).T # Shape: (384, 5) - TRANSPOSED! + +# Convert to PMFs +pmfs = response_embeddings_to_pmf(llm_response_embeddings, reference_embeddings) + +# Optional: Apply temperature scaling +pmfs_scaled = np.array([scale_pmf(pmf, temperature=0.8) for pmf in pmfs]) + +# Get survey aggregate +survey_pmf = pmfs_scaled.mean(axis=0) +``` + +**Key Detail**: Reference embeddings must be **transposed** (shape: `embedding_dim x n_scale_points`). + +--- + +### Full Integration (Text to PMF) + +If your app generates text responses: + +```python +import polars as po +from semantic_similarity_rating import ResponseRater + +# 1. Set up reference sentences (one-time setup) +df_references = po.DataFrame({ + 'id': ['likert_v1'] * 5, + 'int_response': [1, 2, 3, 4, 5], + 'sentence': [ + 'Strongly disagree', + 'Disagree', + 'Neutral', + 'Agree', + 'Strongly agree' + ] +}) + +# 2. Initialize rater (loads sentence-transformer model) +rater = ResponseRater(df_references, model_name='all-MiniLM-L6-v2') + +# 3. Your app generates LLM responses +llm_responses = [ + "I completely agree with this", + "I'm not really sure about this", + "I strongly disagree" +] + +# 4. Convert to PMFs +pmfs = rater.get_response_pmfs( + reference_set_id='likert_v1', + llm_responses=llm_responses, + temperature=1.0, + epsilon=0.0 +) + +# 5. Get aggregate survey response +survey_pmf = rater.get_survey_response_pmf(pmfs) + +print("Individual PMFs:") +print(pmfs) # Shape: (3, 5) +print("\nSurvey-level PMF:") +print(survey_pmf) # Shape: (5,) +``` + +--- + +## Advanced: Multiple Reference Sets + +Using multiple phrasings improves robustness: + +```python +df_references = po.DataFrame({ + 'id': ['formal'] * 5 + ['casual'] * 5 + ['academic'] * 5, + 'int_response': [1, 2, 3, 4, 5] * 3, + 'sentence': [ + # Formal + 'Strongly disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly agree', + # Casual + 'Totally disagree', 'Kinda disagree', 'Meh', 'Kinda agree', 'Totally agree', + # Academic + 'Reject entirely', 'Reject partially', 'Withhold judgment', + 'Accept partially', 'Accept entirely' + ] +}) + +rater = ResponseRater(df_references) + +# Use specific set +pmfs_formal = rater.get_response_pmfs('formal', responses) + +# Average across all sets (recommended for robustness) +pmfs_averaged = rater.get_response_pmfs('mean', responses) +``` + +--- + +## Technical Considerations + +### 1. **Embedding Model Selection** + +**Default**: `all-MiniLM-L6-v2` (sentence-transformers) +- **Pros**: Fast, lightweight (80MB), good general performance +- **Embedding dim**: 384 +- **Max tokens**: 256 + +**Alternatives**: +- `all-mpnet-base-v2`: Better quality, slower (420MB, 768-dim) +- `paraphrase-multilingual-mpnet-base-v2`: Multilingual support +- Custom models via sentence-transformers or OpenAI/Cohere APIs + +**Integration Tip**: If using custom embeddings (OpenAI, Cohere, etc.), use **embedding mode**: + +```python +# Pre-compute embeddings with your provider +reference_embeddings = openai.embed([ref1, ref2, ref3, ref4, ref5]) + +df_with_embeddings = po.DataFrame({ + 'id': ['set1'] * 5, + 'int_response': [1, 2, 3, 4, 5], + 'sentence': [...], + 'embedding': reference_embeddings # List of lists or arrays +}) + +rater = ResponseRater(df_with_embeddings) # Auto-detects embedding mode +``` + +### 2. **Performance Optimization** + +- **Batch Processing**: `model.encode()` is vectorized - pass lists, not loops +- **GPU Acceleration**: Set `device='cuda'` in ResponseRater constructor +- **Caching**: Pre-compute reference embeddings once, reuse across requests +- **Embedding Size**: Smaller models (384-dim) are 2-3x faster than large (768-dim) + +### 3. **Parameter Tuning** + +**Epsilon (ε)**: +- **Default**: 0.0 (no regularization) +- **Use case**: Prevent numerical instability if all similarities are equal +- **Typical range**: 0.0 to 0.01 +- **Effect**: Adds small uniform probability to all scale points + +**Temperature (T)**: +- **Default**: 1.0 (no scaling) +- **T < 1**: Sharper distributions (more confident) +- **T > 1**: Softer distributions (less confident) +- **T = 0**: One-hot encoding (forces single choice) +- **Typical range**: 0.5 to 2.0 + +### 4. **Edge Cases** + +**Empty responses**: +```python +# Returns empty array +pmfs = response_embeddings_to_pmf(np.empty((0, 384)), reference_matrix) +# Shape: (0, 5) +``` + +**Identical embeddings**: +- All similarities equal → PMF becomes uniform distribution +- Epsilon helps distinguish slightly: `[0.2, 0.2, 0.2, 0.2, 0.2]` + +**Temperature = 0 with ties**: +- If multiple scale points have max probability → returns original PMF +- Otherwise → one-hot at argmax + +--- + +## Example Application Integration + +### Survey Application Flow + +```python +class SurveyProcessor: + def __init__(self, reference_sentences_df): + self.rater = ResponseRater( + reference_sentences_df, + model_name='all-MiniLM-L6-v2', + device='cpu' # or 'cuda' + ) + + def process_survey_question( + self, + question: str, + llm_responses: list[str], + temperature: float = 1.0 + ) -> dict: + """ + Process LLM responses for a single survey question. + + Returns: + { + 'individual_pmfs': array of shape (n_responses, 5), + 'survey_pmf': array of shape (5,), + 'expected_value': float (1-5 scale), + 'distribution_entropy': float + } + """ + # Get PMFs (averaged across reference sets for robustness) + pmfs = self.rater.get_response_pmfs( + reference_set_id='mean', + llm_responses=llm_responses, + temperature=temperature + ) + + # Aggregate to survey level + survey_pmf = self.rater.get_survey_response_pmf(pmfs) + + # Compute summary statistics + scale_points = np.array([1, 2, 3, 4, 5]) + expected_value = np.dot(survey_pmf, scale_points) + entropy = -np.sum(survey_pmf * np.log(survey_pmf + 1e-10)) + + return { + 'individual_pmfs': pmfs, + 'survey_pmf': survey_pmf, + 'expected_value': expected_value, + 'distribution_entropy': entropy + } + + def process_full_survey( + self, + questions: list[str], + responses_per_question: list[list[str]] + ) -> list[dict]: + """Process all questions in a survey.""" + return [ + self.process_survey_question(q, responses) + for q, responses in zip(questions, responses_per_question) + ] +``` + +**Usage**: +```python +# Setup +processor = SurveyProcessor(df_references) + +# Your app generates these +llm_responses = [ + "I think this is pretty good", + "Not convinced about this", + "Absolutely love it!" +] + +# Process +results = processor.process_survey_question( + question="How satisfied are you with the product?", + llm_responses=llm_responses, + temperature=1.0 +) + +print(f"Expected rating: {results['expected_value']:.2f}/5.0") +print(f"Distribution: {results['survey_pmf']}") +print(f"Uncertainty (entropy): {results['distribution_entropy']:.3f}") +``` + +--- + +## Dependencies + +**Minimal** (just the math): +``` +numpy>=1.24.0 +scipy>=1.10.0 +``` + +**Full** (with text embedding): +``` +numpy>=1.24.0 +scipy>=1.10.0 +polars>=0.20.0 +sentence-transformers>=2.2.0 +beartype>=0.15.0 +``` + +**Installation**: +```bash +pip install numpy scipy polars sentence-transformers beartype +``` + +Or install from this repository: +```bash +pip install git+https://github.com/pymc-labs/semantic-similarity-rating.git +``` + +--- + +## Key Files to Reference + +- **`semantic_similarity_rating/compute.py`** (123 lines) + Core mathematical functions - can be extracted as standalone module + +- **`semantic_similarity_rating/response_rater.py`** (368 lines) + Orchestration layer - adapt for your application's needs + +- **`tests/test_compute.py`** (8KB) + Test cases showing expected behavior and edge cases + +- **`tests/test_response_rater.py`** (9KB) + Integration test examples + +--- + +## Citation + +If you use this methodology, please cite: + +``` +Maier, B. F., Aslak, U., Fiaschi, L., Pappas, K., Wiecki, T. (2025). +Measuring Synthetic Consumer Purchase Intent Using Semantic-Similarity Ratings. +``` + +--- + +## License + +MIT License - Free to use and modify for commercial and non-commercial applications. + +--- + +## Quick Reference Card + +| **Task** | **Code** | +|----------|----------| +| Convert text responses to PMFs | `rater.get_response_pmfs('set1', ["response1", "response2"])` | +| Use averaged reference sets | `rater.get_response_pmfs('mean', responses)` | +| Apply temperature scaling | `rater.get_response_pmfs('set1', responses, temperature=0.8)` | +| Get survey aggregate | `rater.get_survey_response_pmf(pmfs)` | +| Just the math (embeddings → PMF) | `compute.response_embeddings_to_pmf(resp_emb, ref_emb)` | +| Scale existing PMF | `compute.scale_pmf(pmf, temperature=0.5)` | +| Use custom embedding model | Include `'embedding'` column in DataFrame | +| Check available reference sets | `rater.available_reference_sets` | +| Get model info | `rater.model_info` | + +--- + +## Common Pitfalls + +1. **Reference embedding shape**: Must be `(embedding_dim, 5)` not `(5, embedding_dim)` - transpose if needed! +2. **Reference dataframe validation**: Each set needs exactly 5 sentences with `int_response` = [1, 2, 3, 4, 5] +3. **Mode confusion**: Text mode expects `list[str]`, embedding mode expects `np.ndarray` +4. **Temperature = 0**: Only use if you want hard classification (one-hot output) +5. **Polars vs Pandas**: Use `polars.DataFrame` or convert: `po.from_pandas(df)` + +--- + +## Support + +- **Original Repository**: https://github.com/pymc-labs/semantic-similarity-rating +- **Paper**: Maier et al. (2025) +- **License**: MIT