Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 68 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,73 @@ You will notice that the data is ingested into the `data/cache` directory and st

These datasets are also stored as wandb artifacts in the project defined in the environment variable `WANDB_PROJECT` and can be accessed from the [wandb dashboard](https://wandb.ai/wandb/wandbot-dev).

#### Ingestion pipeline debugging
### Evaluating a file with Precomputed Answers

Instead of hitting the wandbot endpoint, you can also pass a `.json` file of precomputed answers for evaluation by the `WandbotCorrectnessScorer`. To do so, pass a filepath to the `precomputed_answers_json_path` parameter in the `EvalConfig`), it should be a JSON file containing a list of objects. Each object should have a question and a precomputed answer.

The evaluation system will try to match questions from your evaluation dataset (defined by `eval_dataset` in `EvalConfig`) to the `question` field in these objects using exact string matching (after stripping leading/trailing whitespace).

Each object in the JSON list should have the following structure:

**Required Fields:**

* `question` (string): The question text. This is used to match against questions in the evaluation dataset.
* `generated_answer` (string): The precomputed answer text. This will be used as `EvalChatResponse.answer`.

**Fields for Contextual Scoring:**
To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide the context information through one of the following fields in each JSON object, although its not essential:

* `retrieved_contexts` (List of Dicts): A list of context documents. Each dictionary in the list should represent a document and ideally have `"source"` (string, URL) and `"content"` (string, text of the document) keys. Minimally, a `"content"` key is needed for the scorer.
*Example*: `[{"source": "http://example.com/doc1", "content": "Context snippet 1."}, {"content": "Context snippet 2."}]`
* **OR** `source_documents` (string): A raw string representation of source documents that can be parsed by the system (specifically, by the `parse_text_to_json` function in `eval.py`). This string usually contains multiple documents, each prefixed by something like "source: http://...".

If only `source_documents` (string) is provided and `retrieved_contexts` (list) is not, the system will attempt to parse `source_documents` to populate the `retrieved_contexts` field for the `EvalChatResponse`. If neither is provided, context-based scoring for that precomputed answer will operate with empty context.

**Optional Fields (to fully populate `EvalChatResponse` and mimic live API calls):**

* `system_prompt` (string): The system prompt used.
* `sources` (string): A string listing sources (can be similar to `source_documents` or a different format).
* `model` (string): The name of the model that generated the answer (e.g., "precomputed_gpt-4").
* `total_tokens` (int): Total tokens used.
* `prompt_tokens` (int): Prompt tokens used.
* `completion_tokens` (int): Completion tokens used.
* `time_taken` (float): Time taken for the call.
* `api_call_statuses` (dict): Dictionary of API call statuses.
* `start_time` (string): ISO 8601 formatted start time of the call (e.g., `"2023-10-27T10:00:00Z"`).
* `end_time` (string): ISO 8601 formatted end time of the call.
* `has_error` (boolean): Set to `true` if this precomputed item represents an error response.
* `error_message` (string): The error message if `has_error` is `true`.

**Example Object:**
```json
{
"question": "What is Weights & Biases?",
"generated_answer": "Weights & Biases is an MLOps platform.",
"retrieved_contexts": [
{"source": "http://example.com/docA", "content": "Content of document A talking about W&B."},
{"content": "Another piece of context."}
],
"model": "precomputed_from_file",
"system_prompt": "You are a helpful assistant.",
"total_tokens": 50,
"has_error": false
}
```
Or using `source_documents`:
```json
{
"question": "What is Weights & Biases?",
"generated_answer": "Weights & Biases is an MLOps platform.",
"source_documents": "source: http://example.com/docA\\nContent of document A talking about W&B.\\nsource: http://example.com/docB\\nContent of document B.",
"model": "precomputed_from_file",
"system_prompt": "You are a helpful assistant.",
"total_tokens": 50,
"has_error": false
}
```


### Ingestion pipeline debugging

To help with debugging, you can use the `steps` and `include_sources` flags to specify only sub-components of the pipeline and only certain documents sources to run. For example if you wanted to stop the pipeline before it creates the vector db and creates the artifacts and W&B report AND you only wanted to process the Weave documentation, you would do the following:

Expand Down Expand Up @@ -291,4 +357,4 @@ B. If you don't compute a diff or want a simple way to do this
[] **Deployment:** Clone the repo to a prod environment. Deploy updated version. Test via cli and slackbot that the endpoint is working and the correctg response is received.
[] [] **GitHub:** Update evaluation table at top of README with latest eval score, weave Eval link and data ingestion Report link
[] **GitHub:** Update git tag
[] **GitHub:** Create gthub release
[] **GitHub:** Create gthub release
4 changes: 2 additions & 2 deletions src/wandbot/configs/chat_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,10 @@ class ChatConfig(BaseSettings):

# Response synthesis model settings
response_synthesizer_provider: str = "anthropic"
response_synthesizer_model: str = "claude-3-7-sonnet-20250219"
response_synthesizer_model: str = "claude-4-opus-20250514" # "claude-4-sonnet-20250514" #"claude-3-7-sonnet-20250219"
response_synthesizer_temperature: float = 0.1
response_synthesizer_fallback_provider: str = "anthropic"
response_synthesizer_fallback_model: str = "claude-3-7-sonnet-20250219"
response_synthesizer_fallback_model: str = "claude-4-opus-20250514" # "claude-4-sonnet-20250514" # "claude-3-7-sonnet-20250219"
response_synthesizer_fallback_temperature: float = 0.1

# Translation models settings
Expand Down
139 changes: 139 additions & 0 deletions src/wandbot/evaluation/data_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
import json
import logging
from typing import Dict, List, Optional, Any

import weave

# From eval.py
def sanitize_precomputed_item_recursive(item: Any) -> Any:
"""Recursively sanitize an item by converting None values to empty strings."""
if isinstance(item, dict):
return {k: sanitize_precomputed_item_recursive(v) for k, v in item.items()}
elif isinstance(item, list):
return [sanitize_precomputed_item_recursive(elem) for elem in item]
elif item is None:
return ""
return item


def load_and_prepare_precomputed_data(
file_path: Optional[str], logger: logging.Logger
) -> Optional[Dict[str, Dict]]:
"""Loads, sanitizes, and prepares precomputed answers from a JSON file into a map."""
if not file_path:
return None

logger.info(f"Loading precomputed answers from: {file_path}")
try:
with open(file_path, "r") as f:
loaded_answers_raw = json.load(f)

if not isinstance(loaded_answers_raw, list):
raise ValueError("Precomputed answers JSON must be a list of items.")

loaded_answers_sanitized = []
for raw_item in loaded_answers_raw:
if not isinstance(raw_item, dict):
raise ValueError(f"Skipping non-dictionary item in precomputed answers: {raw_item}")
sanitized_item = sanitize_precomputed_item_recursive(raw_item)
loaded_answers_sanitized.append(sanitized_item)
logger.debug(f"Sanitized precomputed item: {sanitized_item}")

precomputed_answers_map = {}
for i, item in enumerate(loaded_answers_sanitized):
if not isinstance(item, dict):
raise ValueError(
f"Item at original index {i} in precomputed answers (post-sanitization) is not a dictionary."
)

item_index_str: str
raw_item_index = item.get("index")

if raw_item_index is None or str(raw_item_index).strip() == "":
logger.warning(
f"Item (original index {i}) is missing 'index' or index is empty after sanitization. "
f"Content: {str(item.get('question', 'N/A'))[:50]+'...'}. Using list index {i} as fallback string key."
)
item_index_str = str(i)
else:
item_index_str = str(raw_item_index).strip()
if not item_index_str:
logger.warning(
f"Item (original index {i}) had whitespace-only 'index' after sanitization. "
f"Content: {str(item.get('question', 'N/A'))[:50]+'...'}. Using list index {i} as fallback string key."
)
item_index_str = str(i)

if item_index_str in precomputed_answers_map:
logger.warning(
f"Duplicate string index '{item_index_str}' found in precomputed answers. "
f"Overwriting with item from original list at index {i}."
)
precomputed_answers_map[item_index_str] = item

Comment on lines +67 to +73
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider failing fast on duplicate indices

When the same index appears twice, the later entry silently overwrites the former.
For evaluation reproducibility it’s usually safer to raise, or at least surface a stronger warning, because duplicates often indicate a bug in the pre-computed file.

-            if item_index_str in precomputed_answers_map:
-                logger.warning(
-                    f"Duplicate string index '{item_index_str}' found in precomputed answers. "
-                    f"Overwriting with item from original list at index {i}."
-                )
-            precomputed_answers_map[item_index_str] = item
+            if item_index_str in precomputed_answers_map:
+                raise ValueError(
+                    f"Duplicate string index '{item_index_str}' encountered (item positions "
+                    f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
+                )
+            # store original position for better error messages later
+            item["__orig_pos__"] = i
+            precomputed_answers_map[item_index_str] = item
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if item_index_str in precomputed_answers_map:
logger.warning(
f"Duplicate string index '{item_index_str}' found in precomputed answers. "
f"Overwriting with item from original list at index {i}."
)
precomputed_answers_map[item_index_str] = item
if item_index_str in precomputed_answers_map:
raise ValueError(
f"Duplicate string index '{item_index_str}' encountered (item positions "
f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
)
# store original position for better error messages later
item["__orig_pos__"] = i
precomputed_answers_map[item_index_str] = item
🤖 Prompt for AI Agents
In src/wandbot/evaluation/data_utils.py around lines 67 to 73, the code
currently logs a warning and overwrites entries when duplicate indices are found
in precomputed_answers_map. To fail fast and improve reproducibility, replace
the warning and overwrite behavior with raising an exception when a duplicate
index is detected. This will immediately surface the issue and prevent silent
data overwrites.

logger.info(
f"Loaded {len(precomputed_answers_map)} precomputed answers into map from {len(loaded_answers_sanitized)} sanitized items."
)
return precomputed_answers_map

except FileNotFoundError:
logger.error(f"Precomputed answers JSON file not found: {file_path}")
raise
except ValueError as e:
logger.error(f"Invalid format in precomputed answers JSON: {e}")
raise
except Exception as e:
logger.error(f"Failed to load or parse precomputed answers JSON: {e}")
raise


def load_and_prepare_dataset_rows(
dataset_ref_uri: str, is_debug: bool, n_debug_samples: int, logger: logging.Logger
) -> List[Dict]:
"""Loads dataset rows from a Weave reference, applies debug sampling, and prepares them for evaluation."""
dataset_ref = weave.ref(dataset_ref_uri).get()
question_rows = dataset_ref.rows

if is_debug:
question_rows = question_rows[:n_debug_samples]

question_rows_for_eval = []
for i, row in enumerate(question_rows):
if not isinstance(row, dict):
logger.warning(f"Dataset item at original index {i} is not a dictionary, skipping: {row}")
continue

dataset_row_index_str: str
raw_dataset_index = row.get("index")
if raw_dataset_index is None or str(raw_dataset_index).strip() == "":
logger.warning(
f"Dataset item (original list index {i}, question: {str(row.get('question', 'N/A'))[:50] + '...'}) "
f"is missing 'index' or index is empty. Using list index {i} as fallback string key."
)
dataset_row_index_str = str(i)
else:
dataset_row_index_str = str(raw_dataset_index).strip()

question = row.get("question")
ground_truth = row.get("answer")
notes = row.get("notes")

if question is None:
logger.warning(f"Dataset item at index {dataset_row_index_str} is missing 'question'. Using empty string.")
question = ""
if ground_truth is None:
logger.warning(f"Dataset item at index {dataset_row_index_str} is missing 'answer'. Using empty string.")
ground_truth = ""
if notes is None:
logger.warning(f"Dataset item at index {dataset_row_index_str} is missing 'notes'. Using empty string.")
notes = ""

question_rows_for_eval.append(
{
"index": dataset_row_index_str,
"question": str(question),
"ground_truth": str(ground_truth),
"notes": str(notes),
}
)
return question_rows_for_eval
Loading