wandb · morganmcg1 · May 20, 2025 · May 20, 2025 · May 20, 2025 · May 22, 2025
diff --git a/README.md b/README.md
@@ -262,7 +262,73 @@ You will notice that the data is ingested into the `data/cache` directory and st
 
 These datasets are also stored as wandb artifacts in the project defined in the environment variable `WANDB_PROJECT` and can be accessed from the [wandb dashboard](https://wandb.ai/wandb/wandbot-dev).
 
-#### Ingestion pipeline debugging
+### Evaluating a file with Precomputed Answers 
+
+Instead of hitting the wandbot endpoint, you can also pass a `.json` file of precomputed answers for evaluation by the `WandbotCorrectnessScorer`. To do so, pass a filepath to the `precomputed_answers_json_path` parameter in the `EvalConfig`), it should be a JSON file containing a list of objects. Each object should have a question and a precomputed answer.
+
+The evaluation system will try to match questions from your evaluation dataset (defined by `eval_dataset` in `EvalConfig`) to the `question` field in these objects using exact string matching (after stripping leading/trailing whitespace).
+
+Each object in the JSON list should have the following structure:
+
+**Required Fields:**
+
+*   `question` (string): The question text. This is used to match against questions in the evaluation dataset.
+*   `generated_answer` (string): The precomputed answer text. This will be used as `EvalChatResponse.answer`.
+
+**Fields for Contextual Scoring:**
+To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide the context information through one of the following fields in each JSON object, although its not essential:
+
+*   `retrieved_contexts` (List of Dicts): A list of context documents. Each dictionary in the list should represent a document and ideally have `"source"` (string, URL) and `"content"` (string, text of the document) keys. Minimally, a `"content"` key is needed for the scorer.
+    *Example*: `[{"source": "http://example.com/doc1", "content": "Context snippet 1."}, {"content": "Context snippet 2."}]`
+*   **OR** `source_documents` (string): A raw string representation of source documents that can be parsed by the system (specifically, by the `parse_text_to_json` function in `eval.py`). This string usually contains multiple documents, each prefixed by something like "source: http://...".
+
+If only `source_documents` (string) is provided and `retrieved_contexts` (list) is not, the system will attempt to parse `source_documents` to populate the `retrieved_contexts` field for the `EvalChatResponse`. If neither is provided, context-based scoring for that precomputed answer will operate with empty context.
+
+**Optional Fields (to fully populate `EvalChatResponse` and mimic live API calls):**
+
+*   `system_prompt` (string): The system prompt used.
+*   `sources` (string): A string listing sources (can be similar to `source_documents` or a different format).
+*   `model` (string): The name of the model that generated the answer (e.g., "precomputed_gpt-4").
+*   `total_tokens` (int): Total tokens used.
+*   `prompt_tokens` (int): Prompt tokens used.
+*   `completion_tokens` (int): Completion tokens used.
+*   `time_taken` (float): Time taken for the call.
+*   `api_call_statuses` (dict): Dictionary of API call statuses.
+*   `start_time` (string): ISO 8601 formatted start time of the call (e.g., `"2023-10-27T10:00:00Z"`).
+*   `end_time` (string): ISO 8601 formatted end time of the call.
+*   `has_error` (boolean): Set to `true` if this precomputed item represents an error response.
+*   `error_message` (string): The error message if `has_error` is `true`.
+
+**Example Object:**
+```json
+{
+  "question": "What is Weights & Biases?",
+  "generated_answer": "Weights & Biases is an MLOps platform.",
+  "retrieved_contexts": [
+      {"source": "http://example.com/docA", "content": "Content of document A talking about W&B."},
+      {"content": "Another piece of context."}
+  ],
+  "model": "precomputed_from_file",
+  "system_prompt": "You are a helpful assistant.",
+  "total_tokens": 50,
+  "has_error": false
+}
+```
+Or using `source_documents`:
+```json
+{
+  "question": "What is Weights & Biases?",
+  "generated_answer": "Weights & Biases is an MLOps platform.",
+  "source_documents": "source: http://example.com/docA\\nContent of document A talking about W&B.\\nsource: http://example.com/docB\\nContent of document B.",
+  "model": "precomputed_from_file",
+  "system_prompt": "You are a helpful assistant.",
+  "total_tokens": 50,
+  "has_error": false
+}
+```
+
+
+### Ingestion pipeline debugging
 
 To help with debugging, you can use the `steps` and `include_sources` flags to specify only sub-components of the pipeline and only certain documents sources to run. For example if you wanted to stop the pipeline before it creates the vector db and creates the artifacts and W&B report AND you only wanted to process the Weave documentation, you would do the following:
 
@@ -291,4 +357,4 @@ B. If you don't compute a diff or want a simple way to do this
 [] **Deployment:** Clone the repo to a prod environment. Deploy updated version. Test via cli and slackbot that the endpoint is working and the correctg response is received.
 [] [] **GitHub:** Update evaluation table at top of README with latest eval score, weave Eval link and data ingestion Report link
 [] **GitHub:** Update git tag
-[] **GitHub:** Create gthub release
+[] **GitHub:** Create gthub release
diff --git a/src/wandbot/configs/chat_config.py b/src/wandbot/configs/chat_config.py
@@ -44,10 +44,10 @@ class ChatConfig(BaseSettings):
 
     # Response synthesis model settings
     response_synthesizer_provider: str = "anthropic"
-    response_synthesizer_model: str = "claude-3-7-sonnet-20250219" 
+    response_synthesizer_model: str = "claude-4-opus-20250514" # "claude-4-sonnet-20250514"  #"claude-3-7-sonnet-20250219" 
     response_synthesizer_temperature: float = 0.1
     response_synthesizer_fallback_provider: str = "anthropic"
-    response_synthesizer_fallback_model: str = "claude-3-7-sonnet-20250219" 
+    response_synthesizer_fallback_model: str = "claude-4-opus-20250514" # "claude-4-sonnet-20250514"  # "claude-3-7-sonnet-20250219" 
     response_synthesizer_fallback_temperature: float = 0.1
 
     # Translation models settings

diff --git a/src/wandbot/evaluation/data_utils.py b/src/wandbot/evaluation/data_utils.py
@@ -0,0 +1,139 @@
+import json
+import logging
+from typing import Dict, List, Optional, Any
+
+import weave
+
+# From eval.py
+def sanitize_precomputed_item_recursive(item: Any) -> Any:
+    """Recursively sanitize an item by converting None values to empty strings."""
+    if isinstance(item, dict):
+        return {k: sanitize_precomputed_item_recursive(v) for k, v in item.items()}
+    elif isinstance(item, list):
+        return [sanitize_precomputed_item_recursive(elem) for elem in item]
+    elif item is None:
+        return ""
+    return item
+
+
+def load_and_prepare_precomputed_data(
+    file_path: Optional[str], logger: logging.Logger
+) -> Optional[Dict[str, Dict]]:
+    """Loads, sanitizes, and prepares precomputed answers from a JSON file into a map."""
+    if not file_path:
+        return None
+
+    logger.info(f"Loading precomputed answers from: {file_path}")
+    try:
+        with open(file_path, "r") as f:
+            loaded_answers_raw = json.load(f)
+
+        if not isinstance(loaded_answers_raw, list):
+            raise ValueError("Precomputed answers JSON must be a list of items.")
+
+        loaded_answers_sanitized = []
+        for raw_item in loaded_answers_raw:
+            if not isinstance(raw_item, dict):
+                raise ValueError(f"Skipping non-dictionary item in precomputed answers: {raw_item}")
+            sanitized_item = sanitize_precomputed_item_recursive(raw_item)
+            loaded_answers_sanitized.append(sanitized_item)
+            logger.debug(f"Sanitized precomputed item: {sanitized_item}")
+
+        precomputed_answers_map = {}
+        for i, item in enumerate(loaded_answers_sanitized):
+            if not isinstance(item, dict):
+                raise ValueError(
+                    f"Item at original index {i} in precomputed answers (post-sanitization) is not a dictionary."
+                )
+
+            item_index_str: str
+            raw_item_index = item.get("index")
+
+            if raw_item_index is None or str(raw_item_index).strip() == "":
+                logger.warning(
+                    f"Item (original index {i}) is missing 'index' or index is empty after sanitization. "
+                    f"Content: {str(item.get('question', 'N/A'))[:50]+'...'}. Using list index {i} as fallback string key."
+                )
+                item_index_str = str(i)
+            else:
+                item_index_str = str(raw_item_index).strip()
+                if not item_index_str:
+                    logger.warning(
+                        f"Item (original index {i}) had whitespace-only 'index' after sanitization. "
+                        f"Content: {str(item.get('question', 'N/A'))[:50]+'...'}. Using list index {i} as fallback string key."
+                    )
+                    item_index_str = str(i)
+
+            if item_index_str in precomputed_answers_map:
+                logger.warning(
+                    f"Duplicate string index '{item_index_str}' found in precomputed answers. "
+                    f"Overwriting with item from original list at index {i}."
+                )
+            precomputed_answers_map[item_index_str] = item
+
-            if item_index_str in precomputed_answers_map:
-                logger.warning(
-                    f"Duplicate string index '{item_index_str}' found in precomputed answers. "
-                    f"Overwriting with item from original list at index {i}."
-                )
-            precomputed_answers_map[item_index_str] = item
-        
+            if item_index_str in precomputed_answers_map:
+                raise ValueError(
+                    f"Duplicate string index '{item_index_str}' encountered (item positions "
+                    f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
+                )
+            # store original position for better error messages later
+            item["__orig_pos__"] = i
+            precomputed_answers_map[item_index_str] = item
-            if item_index_str in precomputed_answers_map:
-                logger.warning(
-                    f"Duplicate string index '{item_index_str}' found in precomputed answers. "
-                    f"Overwriting with item from original list at index {i}."
-                )
-            precomputed_answers_map[item_index_str] = item
-        
+            if item_index_str in precomputed_answers_map:
+                raise ValueError(
+                    f"Duplicate string index '{item_index_str}' encountered (item positions "
+                    f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
+                )
+            # store original position for better error messages later
+            item["__orig_pos__"] = i
+            precomputed_answers_map[item_index_str] = item
+        logger.info(
+            f"Loaded {len(precomputed_answers_map)} precomputed answers into map from {len(loaded_answers_sanitized)} sanitized items."
+        )
+        return precomputed_answers_map
+
+    except FileNotFoundError:
+        logger.error(f"Precomputed answers JSON file not found: {file_path}")
+        raise
+    except ValueError as e:
+        logger.error(f"Invalid format in precomputed answers JSON: {e}")
+        raise
+    except Exception as e:
+        logger.error(f"Failed to load or parse precomputed answers JSON: {e}")
+        raise
+
+
+def load_and_prepare_dataset_rows(
+    dataset_ref_uri: str, is_debug: bool, n_debug_samples: int, logger: logging.Logger
+) -> List[Dict]:
+    """Loads dataset rows from a Weave reference, applies debug sampling, and prepares them for evaluation."""
+    dataset_ref = weave.ref(dataset_ref_uri).get()
+    question_rows = dataset_ref.rows
+
+    if is_debug:
+        question_rows = question_rows[:n_debug_samples]
+
+    question_rows_for_eval = []
+    for i, row in enumerate(question_rows):
+        if not isinstance(row, dict):
+            logger.warning(f"Dataset item at original index {i} is not a dictionary, skipping: {row}")
+            continue
+
+        dataset_row_index_str: str
+        raw_dataset_index = row.get("index")
+        if raw_dataset_index is None or str(raw_dataset_index).strip() == "":
+            logger.warning(
+                f"Dataset item (original list index {i}, question: {str(row.get('question', 'N/A'))[:50] + '...'}) "
+                f"is missing 'index' or index is empty. Using list index {i} as fallback string key."
+            )
+            dataset_row_index_str = str(i)
+        else:
+            dataset_row_index_str = str(raw_dataset_index).strip()
+
+        question = row.get("question")
+        ground_truth = row.get("answer")
+        notes = row.get("notes")
+
+        if question is None:
+            logger.warning(f"Dataset item at index {dataset_row_index_str} is missing 'question'. Using empty string.")
+            question = ""
+        if ground_truth is None:
+            logger.warning(f"Dataset item at index {dataset_row_index_str} is missing 'answer'. Using empty string.")
+            ground_truth = ""
+        if notes is None:
+            logger.warning(f"Dataset item at index {dataset_row_index_str} is missing 'notes'. Using empty string.")
+            notes = ""
+
+        question_rows_for_eval.append(
+            {
+                "index": dataset_row_index_str,
+                "question": str(question),
+                "ground_truth": str(ground_truth),
+                "notes": str(notes),
+            }
+        )
+    return question_rows_for_eval