Conversation
WalkthroughThis update introduces modular data loading and evaluation utilities, an extensible evaluation schema, and robust Wandbot model and scorer classes for correctness evaluation. The main evaluation script is refactored to use these new modules, and configuration options are expanded to support precomputed answer files. Documentation is updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant EvalMain
participant DataUtils
participant WeaveModel
participant WeaveScorer
participant WandbotAPI
User->>EvalMain: Run evaluation script
EvalMain->>DataUtils: Load dataset rows
alt Precomputed answers provided
EvalMain->>DataUtils: Load precomputed answers
EvalMain->>WeaveModel: Initialize with precomputed data
else
EvalMain->>WeaveModel: Initialize with Wandbot config
end
loop For each dataset row
EvalMain->>WeaveModel: predict(index, question)
alt Precomputed data
WeaveModel-->>EvalMain: Return precomputed answer
else
WeaveModel->>WandbotAPI: Query for answer
WandbotAPI-->>WeaveModel: Return API response
WeaveModel-->>EvalMain: Return parsed answer
end
EvalMain->>WeaveScorer: score(index, question, ground_truth, notes, model_output)
WeaveScorer-->>EvalMain: Return score result
end
EvalMain->>EvalMain: Save results to JSON
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (11)
src/wandbot/evaluation/eval_config.py (1)
27-33: Avoid hard-coding URIs inside the propertyThe weave URIs are effectively configuration. Placing them inline makes them hard to update and re-use (e.g., when the dataset version bumps again).
A tiny refactor keeps the data in a constant mapping, making future updates one-liner changes and enabling potential CLI overrides.
- def eval_dataset(self) -> str: - if self.lang == "ja": - return "weave:///wandbot/wandbot-eval/object/wandbot_eval_data_jp:I2BlFnw1VnPn8lFG72obBWN1sCokB3EYk4G4vSKg23g" - return "weave:///wandbot/wandbot-eval/object/wandbot_eval_data:ZZUQa2CCAqPDFWiB90VANCm4EdT8qtc125NazaWUrdI" + def eval_dataset(self) -> str: + DATASET_URIS = { + "en": "weave:///wandbot/wandbot-eval/object/wandbot_eval_data:ZZUQa2CCAqPDFWiB90VANCm4EdT8qtc125NazaWUrdI", + "ja": "weave:///wandbot/wandbot-eval/object/wandbot_eval_data_jp:I2BlFnw1VnPn8lFG72obBWN1sCokB3EYk4G4vSKg23g", + } + return DATASET_URIS[self.lang]src/wandbot/evaluation/eval_schemas.py (2)
16-31: pydantic-v2 config style mismatchWhen using Pydantic v2, the recommended way to enable attribute loading is via
model_config = ConfigDict(from_attributes=True)instead of the v1-styleclass Config.
While the current approach still works (Pydantic keeps a shim), adopting the v2 idiom prevents deprecation warnings down the road.-from datetime import datetime +# pyright: reportUnknownArgumentType=ignore +from datetime import datetime @@ -class Config: - from_attributes = True +from pydantic import ConfigDict + +model_config = ConfigDict(from_attributes=True)
33-38: Minor:retrieved_contextsdefault mutability
default_factory=listis 👍🏻 for avoiding a shared default, but because the list elements are themselves dicts the scorer might mutate them.
If mutation is unintended, consider returning a tuple (default_factory=tuple) or usingtyping.Sequenceto make intent explicit.README.md (1)
265-286: Grammar & Markdown lintSmall wording tweak (“its” → “it’s”) and switch to dash markers to appease
markdownlint(MD004).-To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide the context information through one of the following fields in each JSON object, although its not essential: -* `retrieved_contexts` ... +To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide context information through one of the following fields in each JSON object, although **it’s** not essential: +- `retrieved_contexts` ...🧰 Tools
🪛 LanguageTool
[uncategorized] ~279-~279: “its” (belonging to it) seems less likely than “it’s” (it is)
Context: ...ng fields in each JSON object, although its not essential: * `retrieved_contexts...(AI_HYDRA_LEO_CPT_ITS_ITIS)
🪛 markdownlint-cli2 (0.17.2)
275-275: Unordered list style
Expected: dash; Actual: asterisk(MD004, ul-style)
276-276: Unordered list style
Expected: dash; Actual: asterisk(MD004, ul-style)
281-281: Unordered list style
Expected: dash; Actual: asterisk(MD004, ul-style)
283-283: Unordered list style
Expected: dash; Actual: asterisk(MD004, ul-style)
src/wandbot/evaluation/eval.py (3)
7-8:requestsimport no longer needed
get_wandbot_configsis the only consumer ofrequests, and that call is skipped whenprecomputed_answers_json_pathis set.
Switching tohttpxasync client (already used elsewhere) or deferring the import inside the function would remove an otherwise unused hard dependency.
96-104: Typo in attribute key
"wandbot_vectore_store_config"contains an extra “e”.
Down-stream dashboards querying this attribute will miss the data.- "wandbot_vectore_store_config": wandbot_info.get("vector_store_config", {}) ... + "wandbot_vector_store_config": wandbot_info.get("vector_store_config", {}) ...
110-113: Multiline f-string readabilityThe back-slash continuation inside the f-string injects a literal newline character that splits the log message.
Prefer explicit\nor separatelogger.infocalls for clarity.- logger.info( - f"Starting evaluation of {len(question_rows_for_eval)} samples with {config.n_trials} trials, \ -{len(question_rows_for_eval) * config.n_trials} calls in total." - ) + total_calls = len(question_rows_for_eval) * config.n_trials + logger.info( + f"Starting evaluation of {len(question_rows_for_eval)} samples " + f"with {config.n_trials} trials ({total_calls} total requests)." + )src/wandbot/evaluation/data_utils.py (1)
28-30: Use explicit UTF-8 encoding when opening JSON files
open(file_path, "r")relies on the system default encoding. On non-UTF-8 locales (e.g. Windows with cp1252) this will break for Unicode content.- with open(file_path, "r") as f: + with open(file_path, "r", encoding="utf-8") as f:src/wandbot/evaluation/weave_scorer.py (2)
35-38:indexis declared asintbut used as a string elsewhere
load_and_prepare_dataset_rowsstores indices as strings andWandbotModel.predict
expects astr. Declaring the argument asintmay confuse type-checkers and
future readers.- self, index: int, question: str, ground_truth: str, notes: str, model_output: dict + self, index: str, question: str, ground_truth: str, notes: str, model_output: dict
113-117:json.dumps(model_output)may raise for non-serialisable objectsIf
model_outputever contains datetimes or custom classes the dump will raise
and break evaluation. Considerdefault=strororjson.dumpswith option
to handle non-serialisable types gracefully.- "notes": notes, "model_output": json.dumps(model_output), + "notes": notes, "model_output": json.dumps(model_output, default=str),src/wandbot/evaluation/weave_model.py (1)
201-204: Parsing numeric fields can raise – protect withtry/exceptDirect
int(...)/float(...)casts will explode if the source contains
non-numeric text (e.g.,"N/A"). A helper with safe coercion avoids crashing
the whole run.def _to_int(val, default=0): try: return int(val) except (TypeError, ValueError): return default … "total_tokens": _to_int(item.get("total_tokens")),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(2 hunks)src/wandbot/evaluation/data_utils.py(1 hunks)src/wandbot/evaluation/eval.py(2 hunks)src/wandbot/evaluation/eval_config.py(1 hunks)src/wandbot/evaluation/eval_schemas.py(1 hunks)src/wandbot/evaluation/weave_model.py(1 hunks)src/wandbot/evaluation/weave_scorer.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/wandbot/evaluation/weave_model.py (3)
tests/evaluation/test_eval.py (2)
json(43-46)raise_for_status(48-50)src/wandbot/evaluation/eval_schemas.py (1)
EvalChatResponse(9-41)src/wandbot/utils.py (2)
get_logger(52-98)model(222-225)
🪛 LanguageTool
README.md
[uncategorized] ~279-~279: “its” (belonging to it) seems less likely than “it’s” (it is)
Context: ...ng fields in each JSON object, although its not essential: * `retrieved_contexts...
(AI_HYDRA_LEO_CPT_ITS_ITIS)
🪛 markdownlint-cli2 (0.17.2)
README.md
275-275: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
276-276: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
281-281: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
283-283: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
289-289: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
290-290: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
291-291: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
292-292: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
293-293: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
294-294: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
295-295: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
296-296: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
297-297: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
298-298: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
299-299: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
300-300: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
| precomputed_answers_json_path: str | None = sp.field(default=None, help="Path to a JSON file containing precomputed answers. If provided, network calls to wandbot will be skipped.") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider validating the path at parse-time
precomputed_answers_json_path is declared but never validated.
If a user fat-fingers the path, the evaluation will silently fall back to live calls, which might be surprising and very expensive.
+ @property
+ def precomputed_answers_path(self) -> Path | None:
+ """Return a `Path` object and verify the file exists (if provided)."""
+ if self.precomputed_answers_json_path is None:
+ return None
+ p = Path(self.precomputed_answers_json_path).expanduser()
+ if not p.is_file():
+ raise FileNotFoundError(f"Pre-computed answers file not found: {p}")
+ return pCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/wandbot/evaluation/eval_config.py around lines 25 to 26, the
precomputed_answers_json_path field lacks validation, which can cause silent
fallback to live calls if the path is incorrect. Add validation logic during
parsing to check if the provided path exists and is a valid file. If the path is
invalid, raise an appropriate error or warning to prevent unexpected behavior
and costly live calls.
| # Load environment variables from .env in project root | ||
| ENV_PATH = Path(__file__).parent.parent.parent.parent / '.env' | ||
| ENV_PATH = Path(__file__).parent.parent.parent.parent / ".env" | ||
| load_dotenv(ENV_PATH, override=True) | ||
|
|
||
| logger = get_logger(__name__) |
There was a problem hiding this comment.
Set WEAVE_PARALLELISM before weave.init
weave.init() reads WEAVE_PARALLELISM once at import-time.
Currently the env var is exported on line 56, after initialization, so the desired parallelism is ignored.
- weave.init(f"{config.wandb_entity}/{config.wandb_project}")
- os.environ["WEAVE_PARALLELISM"] = str(config.n_weave_parallelism)
+ os.environ["WEAVE_PARALLELISM"] = str(config.n_weave_parallelism)
+ weave.init(f"{config.wandb_entity}/{config.wandb_project}")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/wandbot/evaluation/eval.py around lines 20 to 24, the environment
variable WEAVE_PARALLELISM must be set before calling weave.init() because
weave.init() reads this variable at import time. Move the code that sets or
exports WEAVE_PARALLELISM to a position before weave.init() is called to ensure
the desired parallelism setting is applied correctly.
| def get_wandbot_configs(config: EvalConfig) -> dict: # config is now required | ||
| """Get wandbot's configs and repo git info""" | ||
| url = config.wandbot_url if config else "http://0.0.0.0:8000" | ||
| url = config.wandbot_url # No default, use from config | ||
| try: | ||
| response = requests.get(f"{url}/configs") | ||
| response.raise_for_status() | ||
| return response.json() | ||
| except requests.RequestException as e: | ||
| logger.error(f"Error making request to wandbot configs: {e}") | ||
| return { | ||
| "chat_config": {}, | ||
| "vector_store_config": {}, | ||
| "git_info": {}, | ||
| "app_config": {} | ||
| } | ||
|
|
||
| @retry( | ||
| stop=stop_after_attempt(5), | ||
| wait=wait_exponential(multiplier=1, min=10, max=300), | ||
| retry=retry_if_exception_type(httpx.HTTPError), | ||
| before_sleep=lambda retry_state: logger.warning( | ||
| f"Attempt {retry_state.attempt_number} failed. Retrying in {retry_state.next_action.sleep} seconds..." | ||
| ), | ||
| after=after_log(logger, logging.ERROR) | ||
| ) | ||
| async def make_request(url: str, question: str, application: str = "api-eval", language: str = "en") -> dict: | ||
| """Make HTTP request to wandbot API with retry logic.""" | ||
| request_timeout = 120.0 | ||
| request_connect_timeout = 30.0 | ||
| async with httpx.AsyncClient(timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)) as client: | ||
| try: | ||
| response = await client.post( | ||
| f"{url}/chat/query", | ||
| json={"question": question, "application": application, "language": language} | ||
| ) | ||
| response.raise_for_status() | ||
| return response.json() | ||
| except httpx.ReadTimeout: | ||
| logger.error(f"Request timed out after {request_timeout} seconds") | ||
| raise | ||
| except httpx.ConnectTimeout: | ||
| logger.error(f"Connection timed out after {request_connect_timeout} seconds") | ||
| raise | ||
|
|
||
| async def get_answer(question: str, wandbot_url: str, application: str = "api-eval", language: str = "en") -> str: | ||
| """Get answer from wandbot API.""" | ||
| try: | ||
| result = await make_request(wandbot_url, question, application, language) | ||
| return json.dumps(result) | ||
| except Exception as e: | ||
| logger.error(f"Failed to get answer: {str(e)}") | ||
| return json.dumps({ | ||
| "error": str(e), | ||
| "answer": "", | ||
| "system_prompt": "", | ||
| "source_documents": "", | ||
| "model": "", | ||
| "total_tokens": 0, | ||
| "prompt_tokens": 0, | ||
| "completion_tokens": 0, | ||
| "time_taken": 0 | ||
| }) | ||
|
|
||
| def parse_text_to_json(text): | ||
| # Split the text into documents | ||
| documents = re.split(r"source: https?://", text)[1:] | ||
| result = [] | ||
| for doc in documents: | ||
| source_url = "https://" + doc.split("\n")[0].strip() | ||
| content = "\n".join(doc.split("\n")[1:]).strip() | ||
| document = {"source": source_url, "content": content} | ||
| result.append(document) | ||
| return result | ||
|
|
||
|
|
||
| @weave.op | ||
| async def get_record(question: str, wandbot_url: str, application: str = "api-eval", language: str = "en") -> dict: | ||
| try: | ||
| response = await get_answer(question, wandbot_url=wandbot_url, application=application, language=language) | ||
| response_dict = json.loads(response) | ||
|
|
||
| if not response_dict: | ||
| try: | ||
| error_data = json.loads(response) | ||
| error_msg = error_data.get("error", "Unknown API error") | ||
| except json.JSONDecodeError: | ||
| error_msg = response if response else "Empty response from API" | ||
|
|
||
| logger.error(error_msg) | ||
| return { | ||
| "system_prompt": "", | ||
| "generated_answer": "", | ||
| "response_synthesis_llm_messages": [], | ||
| "retrieved_contexts": [], | ||
| "model": "", | ||
| "total_tokens": 0, | ||
| "prompt_tokens": 0, | ||
| "completion_tokens": 0, | ||
| "time_taken": 0, | ||
| "has_error": True, | ||
| "api_call_statuses": {}, | ||
| "error_message": error_msg | ||
| } | ||
|
|
||
| return { | ||
| "system_prompt": response_dict.get("system_prompt", ""), | ||
| "generated_answer": response_dict.get("answer", ""), | ||
| "response_synthesis_llm_messages": response_dict.get("response_synthesis_llm_messages", []), | ||
| "retrieved_contexts": parse_text_to_json( | ||
| response_dict.get("source_documents", "") | ||
| ), | ||
| "model": response_dict.get("model", ""), | ||
| "total_tokens": response_dict.get("total_tokens", 0), | ||
| "prompt_tokens": response_dict.get("prompt_tokens", 0), | ||
| "completion_tokens": response_dict.get("completion_tokens", 0), | ||
| "time_taken": response_dict.get("time_taken", 0), | ||
| "api_call_statuses": response_dict.get("api_call_statuses", {}), | ||
| "has_error": False, | ||
| "error_message": None | ||
| } | ||
| except Exception as e: | ||
| error_msg = f"Error getting response from wandbotAPI: {str(e)}" | ||
| logger.error(error_msg) | ||
| return { | ||
| "system_prompt": "", | ||
| "generated_answer": "", | ||
| "response_synthesis_llm_messages": [], | ||
| "retrieved_contexts": [], | ||
| "model": "", | ||
| "total_tokens": 0, | ||
| "prompt_tokens": 0, | ||
| "completion_tokens": 0, | ||
| "time_taken": 0, | ||
| "has_error": True, | ||
| "api_call_statuses": {}, | ||
| "error_message": error_msg | ||
| } | ||
|
|
||
|
|
||
| class WandbotModel(weave.Model): | ||
| language: str = "en" | ||
| application: str = "api-eval" | ||
| wandbot_url: str = "http://0.0.0.0:8000" | ||
| wandbot_config: dict = {} | ||
|
|
||
| @weave.op | ||
| async def predict(self, question: str) -> dict: | ||
| prediction = await get_record(question, | ||
| wandbot_url=self.wandbot_url, | ||
| application=self.application, | ||
| language=self.language) | ||
| return prediction | ||
|
|
||
|
|
||
| class WandbotCorrectnessScorer(weave.Scorer): | ||
| config: EvalConfig | ||
| correctness_evaluator: WandBotCorrectnessEvaluator = None | ||
| debug: bool = False | ||
|
|
||
| def __init__(self, config: EvalConfig): | ||
| super().__init__(config=config) | ||
| self.debug = config.debug | ||
| self.correctness_evaluator = WandBotCorrectnessEvaluator( | ||
| provider=config.eval_judge_provider, | ||
| model_name=config.eval_judge_model, | ||
| temperature=config.eval_judge_temperature, | ||
| max_retries=config.max_evaluator_retries, | ||
| timeout=config.evaluator_timeout, | ||
| ) | ||
|
|
||
| @weave.op | ||
| async def score(self, question: str, ground_truth: str, notes: str, model_output: dict) -> dict: | ||
|
|
||
| if self.debug: | ||
| if model_output is not None: | ||
| logger.debug(f"In WandbotCorrectnessScorer, model_output keys:\n{model_output.keys()}") | ||
| else: | ||
| logger.error("model_output is None") | ||
|
|
||
| try: | ||
| contexts = [c["content"] for c in model_output.get("retrieved_contexts", [])] if model_output.get("retrieved_contexts") else [] | ||
|
|
||
| if model_output.get("generated_answer", "") == "": | ||
| error_msg = "Generated answer is empty" | ||
| logger.error(error_msg) | ||
| return { | ||
| "answer_correct": False, | ||
| "reasoning": error_msg, | ||
| "score": 1.0, | ||
| "has_error": True, | ||
| "error_message": error_msg | ||
| } | ||
| if model_output.get("has_error", False): | ||
| error_msg = model_output.get("error_message", "Unknown error") | ||
| logger.error(error_msg) | ||
| return { | ||
| "answer_correct": False, | ||
| "reasoning": error_msg, | ||
| "score": 1.0, | ||
| "has_error": True, | ||
| "error_message": error_msg | ||
| } | ||
|
|
||
| # If not error from wandbot generation, run the correctness evaluator | ||
| return await self.correctness_evaluator.aevaluate( | ||
| query=question, | ||
| response=model_output.get("generated_answer", ""), | ||
| reference=ground_truth, | ||
| contexts=contexts, | ||
| reference_notes=notes, | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| error_msg = f"Error evaluating answer: {str(e)}" | ||
| logger.error(error_msg) | ||
| return CorrectnessEvaluationResult( | ||
| query=question, | ||
| response=model_output.get("generated_answer", ""), | ||
| contexts=contexts, | ||
| passing=False, | ||
| score=1.0, | ||
| reasoning=error_msg, | ||
| has_error=True, | ||
| error_message=error_msg | ||
| ) | ||
| return {"chat_config": {}, "vector_store_config": {}, "git_info": {}, "app_config": {}} | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Network call still executed when only pre-computed answers are used
Even when precomputed_answers_json_path is supplied, get_wandbot_configs issues an HTTP request that is guaranteed to fail in offline workflows and needlessly clutters the logs.
A quick guard avoids the call and keeps the logic simple:
- wandbot_info = get_wandbot_configs(config) # Pass config directly
+ wandbot_info = (
+ {} if config.precomputed_answers_json_path
+ else get_wandbot_configs(config)
+ )Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/wandbot/evaluation/eval.py around lines 27 to 37, the function
get_wandbot_configs makes a network request even when
precomputed_answers_json_path is provided, causing unnecessary failures and log
clutter in offline workflows. Add a guard at the start of the function to check
if precomputed_answers_json_path is set in the config, and if so, skip the HTTP
request and return an appropriate default or empty config dictionary
immediately.
| if item_index_str in precomputed_answers_map: | ||
| logger.warning( | ||
| f"Duplicate string index '{item_index_str}' found in precomputed answers. " | ||
| f"Overwriting with item from original list at index {i}." | ||
| ) | ||
| precomputed_answers_map[item_index_str] = item | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider failing fast on duplicate indices
When the same index appears twice, the later entry silently overwrites the former.
For evaluation reproducibility it’s usually safer to raise, or at least surface a stronger warning, because duplicates often indicate a bug in the pre-computed file.
- if item_index_str in precomputed_answers_map:
- logger.warning(
- f"Duplicate string index '{item_index_str}' found in precomputed answers. "
- f"Overwriting with item from original list at index {i}."
- )
- precomputed_answers_map[item_index_str] = item
+ if item_index_str in precomputed_answers_map:
+ raise ValueError(
+ f"Duplicate string index '{item_index_str}' encountered (item positions "
+ f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
+ )
+ # store original position for better error messages later
+ item["__orig_pos__"] = i
+ precomputed_answers_map[item_index_str] = item📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if item_index_str in precomputed_answers_map: | |
| logger.warning( | |
| f"Duplicate string index '{item_index_str}' found in precomputed answers. " | |
| f"Overwriting with item from original list at index {i}." | |
| ) | |
| precomputed_answers_map[item_index_str] = item | |
| if item_index_str in precomputed_answers_map: | |
| raise ValueError( | |
| f"Duplicate string index '{item_index_str}' encountered (item positions " | |
| f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})." | |
| ) | |
| # store original position for better error messages later | |
| item["__orig_pos__"] = i | |
| precomputed_answers_map[item_index_str] = item |
🤖 Prompt for AI Agents
In src/wandbot/evaluation/data_utils.py around lines 67 to 73, the code
currently logs a warning and overwrites entries when duplicate indices are found
in precomputed_answers_map. To fail fast and improve reproducibility, replace
the warning and overwrite behavior with raising an exception when a duplicate
index is detected. This will immediately surface the issue and prevent silent
data overwrites.
| score_result_dict = CorrectnessEvaluationResult( | ||
| reason=_error_msg, | ||
| score=1.0, | ||
| decision="incorrect", | ||
| answer_correct=False, | ||
| has_error=True, | ||
| error_message=_error_msg | ||
| ).model_dump(warnings=False) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Scoring semantics look inverted (score=1.0 for incorrect answers)
CorrectnessEvaluationResult is instantiated with score=1.0 in error/incorrect
paths. Unless your convention is “1 = worst”, this will mark every failure as
perfect. Double-check the metric definition; you probably want 0.0 here.
🤖 Prompt for AI Agents
In src/wandbot/evaluation/weave_scorer.py around lines 56 to 63, the score is
set to 1.0 for incorrect answers, which likely inverts the intended scoring
semantics. Change the score value from 1.0 to 0.0 in the error or incorrect
paths to correctly reflect that these are failing cases, assuming the convention
is 1.0 for perfect and 0.0 for worst.
| "question": str(item.get("question", "")), | ||
| "system_prompt": str(item.get("system_prompt", "")), | ||
| "answer": str(item.get("generated_answer", "")), | ||
| "model": str(item.get("model", "precomputed")), | ||
| "sources": str(item.get("sources", "")), |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Fall back to 'answer' when pre-computed item lacks 'generated_answer'
Many datasets still use the API’s original answer key. Relying solely on
generated_answer will treat such rows as empty answers → false negatives.
- "answer": str(item.get("generated_answer", "")),
+ "answer": str(item.get("generated_answer") or item.get("answer", "")),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "question": str(item.get("question", "")), | |
| "system_prompt": str(item.get("system_prompt", "")), | |
| "answer": str(item.get("generated_answer", "")), | |
| "model": str(item.get("model", "precomputed")), | |
| "sources": str(item.get("sources", "")), | |
| "question": str(item.get("question", "")), | |
| "system_prompt": str(item.get("system_prompt", "")), | |
| "answer": str(item.get("generated_answer") or item.get("answer", "")), | |
| "model": str(item.get("model", "precomputed")), | |
| "sources": str(item.get("sources", "")), |
🤖 Prompt for AI Agents
In src/wandbot/evaluation/weave_model.py around lines 195 to 199, the code
currently uses only the 'generated_answer' key to get the answer, which causes
issues with datasets that use the original 'answer' key. Modify the code to
first try 'generated_answer' and if it is missing or empty, fall back to using
the 'answer' key instead to avoid false negatives.
| async with httpx.AsyncClient( | ||
| timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout) | ||
| ) as client: |
There was a problem hiding this comment.
Incorrect httpx.Timeout keyword – will raise TypeError
httpx.Timeout expects total=…, not timeout=….
As written, the first request will blow up before any retry logic kicks in.
- timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)
+ timeout=httpx.Timeout(total=request_timeout, connect=request_connect_timeout)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async with httpx.AsyncClient( | |
| timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout) | |
| ) as client: | |
| async with httpx.AsyncClient( | |
| timeout=httpx.Timeout(total=request_timeout, connect=request_connect_timeout) | |
| ) as client: |
🤖 Prompt for AI Agents
In src/wandbot/evaluation/weave_model.py around lines 31 to 33, the
httpx.AsyncClient is incorrectly using the keyword argument 'timeout' instead of
'total' when creating the httpx.Timeout object. Replace
'timeout=request_timeout' with 'total=request_timeout' to correctly specify the
total timeout duration and prevent a TypeError from being raised.
Summary by CodeRabbit
New Features
Documentation
Refactor
Bug Fixes