Skip to content

add passing precomputed answers for eval#96

Open
morganmcg1 wants to merge 4 commits intomainfrom
add_passing_precomputed_answers_for_eval
Open

add passing precomputed answers for eval#96
morganmcg1 wants to merge 4 commits intomainfrom
add_passing_precomputed_answers_for_eval

Conversation

@morganmcg1
Copy link
Member

@morganmcg1 morganmcg1 commented May 22, 2025

  • update wandbot Evaluation datasets to dataset version with an "index" column
  • add readme instructions how to pass precomputed answers
  • add ability to pass precomputed answers json file and refactor eval script and helpers

Summary by CodeRabbit

  • New Features

    • Added support for evaluating files with precomputed answers, allowing users to bypass live API calls by providing a JSON file.
    • Introduced new configuration options to specify precomputed answer files and updated dataset references.
    • Enhanced evaluation workflows to support both live and precomputed answer modes.
    • Integrated a new correctness scorer with detailed error handling and result tracking.
    • Updated response synthesizer models to newer versions for improved performance.
  • Documentation

    • Expanded the README with detailed instructions and examples for using precomputed answers in evaluations.
  • Refactor

    • Streamlined evaluation logic for improved modularity and maintainability, delegating data loading and scoring to dedicated modules.
    • Removed internal Wandbot API implementations, replacing them with external modular components.
  • Bug Fixes

    • Improved error handling and logging during evaluation and scoring processes.

@coderabbitai
Copy link

coderabbitai bot commented May 22, 2025

Walkthrough

This update introduces modular data loading and evaluation utilities, an extensible evaluation schema, and robust Wandbot model and scorer classes for correctness evaluation. The main evaluation script is refactored to use these new modules, and configuration options are expanded to support precomputed answer files. Documentation is updated accordingly.

Changes

File(s) Change Summary
README.md Added a detailed section on evaluating with precomputed answers, explaining JSON structure, optional fields, and usage examples. Adjusted heading levels in the debugging section.
src/wandbot/evaluation/data_utils.py New module providing functions to sanitize, load, and prepare precomputed answer data and evaluation datasets, with structured logging and error handling.
src/wandbot/evaluation/eval.py Refactored to delegate data loading, Wandbot API/model, and scoring logic to new modules. Rewrote evaluation orchestration to use modular helpers, updated config handling, and added result persistence with error handling.
src/wandbot/evaluation/eval_config.py Added precomputed_answers_json_path to EvalConfig for specifying precomputed answers. Updated dataset URIs in the eval_dataset property.
src/wandbot/evaluation/eval_schemas.py Introduced EvalChatResponse, extending ChatResponse with optional and error-tracking fields, token/timing metrics, and parsed context support for evaluation.
src/wandbot/evaluation/weave_model.py New module implementing robust async Wandbot API interaction with retry logic, error handling, response parsing, and a Weave model supporting both live and precomputed answers for evaluation.
src/wandbot/evaluation/weave_scorer.py New scorer class for correctness evaluation, with async scoring logic, error handling, result tracking, and integration with Weave.
src/wandbot/configs/chat_config.py Updated default response synthesizer model and fallback model to a newer version.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant EvalMain
    participant DataUtils
    participant WeaveModel
    participant WeaveScorer
    participant WandbotAPI

    User->>EvalMain: Run evaluation script
    EvalMain->>DataUtils: Load dataset rows
    alt Precomputed answers provided
        EvalMain->>DataUtils: Load precomputed answers
        EvalMain->>WeaveModel: Initialize with precomputed data
    else
        EvalMain->>WeaveModel: Initialize with Wandbot config
    end
    loop For each dataset row
        EvalMain->>WeaveModel: predict(index, question)
        alt Precomputed data
            WeaveModel-->>EvalMain: Return precomputed answer
        else
            WeaveModel->>WandbotAPI: Query for answer
            WandbotAPI-->>WeaveModel: Return API response
            WeaveModel-->>EvalMain: Return parsed answer
        end
        EvalMain->>WeaveScorer: score(index, question, ground_truth, notes, model_output)
        WeaveScorer-->>EvalMain: Return score result
    end
    EvalMain->>EvalMain: Save results to JSON
Loading

Poem

In the warren of code, new modules appear,
For loading and scoring, the purpose is clear.
Precomputed or live, answers now flow,
With schemas and configs, evaluations grow.
The rabbits rejoice, their work now refined—
Hopping through data, correctness in mind!
🐇✨

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceca3ac and 337b766.

📒 Files selected for processing (1)
  • src/wandbot/configs/chat_config.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/wandbot/configs/chat_config.py
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (11)
src/wandbot/evaluation/eval_config.py (1)

27-33: Avoid hard-coding URIs inside the property

The weave URIs are effectively configuration. Placing them inline makes them hard to update and re-use (e.g., when the dataset version bumps again).

A tiny refactor keeps the data in a constant mapping, making future updates one-liner changes and enabling potential CLI overrides.

-    def eval_dataset(self) -> str:
-        if self.lang == "ja":
-            return "weave:///wandbot/wandbot-eval/object/wandbot_eval_data_jp:I2BlFnw1VnPn8lFG72obBWN1sCokB3EYk4G4vSKg23g"
-        return "weave:///wandbot/wandbot-eval/object/wandbot_eval_data:ZZUQa2CCAqPDFWiB90VANCm4EdT8qtc125NazaWUrdI"
+    def eval_dataset(self) -> str:
+        DATASET_URIS = {
+            "en": "weave:///wandbot/wandbot-eval/object/wandbot_eval_data:ZZUQa2CCAqPDFWiB90VANCm4EdT8qtc125NazaWUrdI",
+            "ja": "weave:///wandbot/wandbot-eval/object/wandbot_eval_data_jp:I2BlFnw1VnPn8lFG72obBWN1sCokB3EYk4G4vSKg23g",
+        }
+        return DATASET_URIS[self.lang]
src/wandbot/evaluation/eval_schemas.py (2)

16-31: pydantic-v2 config style mismatch

When using Pydantic v2, the recommended way to enable attribute loading is via model_config = ConfigDict(from_attributes=True) instead of the v1-style class Config.
While the current approach still works (Pydantic keeps a shim), adopting the v2 idiom prevents deprecation warnings down the road.

-from datetime import datetime
+# pyright: reportUnknownArgumentType=ignore
+from datetime import datetime
@@
-class Config:
-    from_attributes = True
+from pydantic import ConfigDict
+
+model_config = ConfigDict(from_attributes=True)

33-38: Minor: retrieved_contexts default mutability

default_factory=list is 👍🏻 for avoiding a shared default, but because the list elements are themselves dicts the scorer might mutate them.
If mutation is unintended, consider returning a tuple (default_factory=tuple) or using typing.Sequence to make intent explicit.

README.md (1)

265-286: Grammar & Markdown lint

Small wording tweak (“its” → “it’s”) and switch to dash markers to appease markdownlint (MD004).

-To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide the context information through one of the following fields in each JSON object, although its not essential:
-*   `retrieved_contexts` ...
+To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide context information through one of the following fields in each JSON object, although **it’s** not essential:
+- `retrieved_contexts` ...
🧰 Tools
🪛 LanguageTool

[uncategorized] ~279-~279: “its” (belonging to it) seems less likely than “it’s” (it is)
Context: ...ng fields in each JSON object, although its not essential: * `retrieved_contexts...

(AI_HYDRA_LEO_CPT_ITS_ITIS)

🪛 markdownlint-cli2 (0.17.2)

275-275: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


276-276: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


281-281: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


283-283: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

src/wandbot/evaluation/eval.py (3)

7-8: requests import no longer needed

get_wandbot_configs is the only consumer of requests, and that call is skipped when precomputed_answers_json_path is set.
Switching to httpx async client (already used elsewhere) or deferring the import inside the function would remove an otherwise unused hard dependency.


96-104: Typo in attribute key

"wandbot_vectore_store_config" contains an extra “e”.
Down-stream dashboards querying this attribute will miss the data.

-        "wandbot_vectore_store_config": wandbot_info.get("vector_store_config", {}) ...
+        "wandbot_vector_store_config": wandbot_info.get("vector_store_config", {}) ...

110-113: Multiline f-string readability

The back-slash continuation inside the f-string injects a literal newline character that splits the log message.
Prefer explicit \n or separate logger.info calls for clarity.

-    logger.info(
-        f"Starting evaluation of {len(question_rows_for_eval)} samples with {config.n_trials} trials, \
-{len(question_rows_for_eval) * config.n_trials} calls in total."
-    )
+    total_calls = len(question_rows_for_eval) * config.n_trials
+    logger.info(
+        f"Starting evaluation of {len(question_rows_for_eval)} samples "
+        f"with {config.n_trials} trials ({total_calls} total requests)."
+    )
src/wandbot/evaluation/data_utils.py (1)

28-30: Use explicit UTF-8 encoding when opening JSON files

open(file_path, "r") relies on the system default encoding. On non-UTF-8 locales (e.g. Windows with cp1252) this will break for Unicode content.

-        with open(file_path, "r") as f:
+        with open(file_path, "r", encoding="utf-8") as f:
src/wandbot/evaluation/weave_scorer.py (2)

35-38: index is declared as int but used as a string elsewhere

load_and_prepare_dataset_rows stores indices as strings and WandbotModel.predict
expects a str. Declaring the argument as int may confuse type-checkers and
future readers.

-        self, index: int, question: str, ground_truth: str, notes: str, model_output: dict
+        self, index: str, question: str, ground_truth: str, notes: str, model_output: dict

113-117: json.dumps(model_output) may raise for non-serialisable objects

If model_output ever contains datetimes or custom classes the dump will raise
and break evaluation. Consider default=str or orjson.dumps with option
to handle non-serialisable types gracefully.

-            "notes": notes, "model_output": json.dumps(model_output),
+            "notes": notes, "model_output": json.dumps(model_output, default=str),
src/wandbot/evaluation/weave_model.py (1)

201-204: Parsing numeric fields can raise – protect with try/except

Direct int(...) / float(...) casts will explode if the source contains
non-numeric text (e.g., "N/A"). A helper with safe coercion avoids crashing
the whole run.

def _to_int(val, default=0):
    try:
        return int(val)
    except (TypeError, ValueError):
        return default"total_tokens": _to_int(item.get("total_tokens")),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29d248b and ceca3ac.

📒 Files selected for processing (7)
  • README.md (2 hunks)
  • src/wandbot/evaluation/data_utils.py (1 hunks)
  • src/wandbot/evaluation/eval.py (2 hunks)
  • src/wandbot/evaluation/eval_config.py (1 hunks)
  • src/wandbot/evaluation/eval_schemas.py (1 hunks)
  • src/wandbot/evaluation/weave_model.py (1 hunks)
  • src/wandbot/evaluation/weave_scorer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/wandbot/evaluation/weave_model.py (3)
tests/evaluation/test_eval.py (2)
  • json (43-46)
  • raise_for_status (48-50)
src/wandbot/evaluation/eval_schemas.py (1)
  • EvalChatResponse (9-41)
src/wandbot/utils.py (2)
  • get_logger (52-98)
  • model (222-225)
🪛 LanguageTool
README.md

[uncategorized] ~279-~279: “its” (belonging to it) seems less likely than “it’s” (it is)
Context: ...ng fields in each JSON object, although its not essential: * `retrieved_contexts...

(AI_HYDRA_LEO_CPT_ITS_ITIS)

🪛 markdownlint-cli2 (0.17.2)
README.md

275-275: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


276-276: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


281-281: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


283-283: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


289-289: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


290-290: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


291-291: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


292-292: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


293-293: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


294-294: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


295-295: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


296-296: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


297-297: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


298-298: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


299-299: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


300-300: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

Comment on lines +25 to 26
precomputed_answers_json_path: str | None = sp.field(default=None, help="Path to a JSON file containing precomputed answers. If provided, network calls to wandbot will be skipped.")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider validating the path at parse-time

precomputed_answers_json_path is declared but never validated.
If a user fat-fingers the path, the evaluation will silently fall back to live calls, which might be surprising and very expensive.

+    @property
+    def precomputed_answers_path(self) -> Path | None:
+        """Return a `Path` object and verify the file exists (if provided)."""
+        if self.precomputed_answers_json_path is None:
+            return None
+        p = Path(self.precomputed_answers_json_path).expanduser()
+        if not p.is_file():
+            raise FileNotFoundError(f"Pre-computed answers file not found: {p}")
+        return p

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/wandbot/evaluation/eval_config.py around lines 25 to 26, the
precomputed_answers_json_path field lacks validation, which can cause silent
fallback to live calls if the path is incorrect. Add validation logic during
parsing to check if the provided path exists and is a valid file. If the path is
invalid, raise an appropriate error or warning to prevent unexpected behavior
and costly live calls.

Comment on lines 20 to 24
# Load environment variables from .env in project root
ENV_PATH = Path(__file__).parent.parent.parent.parent / '.env'
ENV_PATH = Path(__file__).parent.parent.parent.parent / ".env"
load_dotenv(ENV_PATH, override=True)

logger = get_logger(__name__)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Set WEAVE_PARALLELISM before weave.init

weave.init() reads WEAVE_PARALLELISM once at import-time.
Currently the env var is exported on line 56, after initialization, so the desired parallelism is ignored.

-    weave.init(f"{config.wandb_entity}/{config.wandb_project}")
-    os.environ["WEAVE_PARALLELISM"] = str(config.n_weave_parallelism)
+    os.environ["WEAVE_PARALLELISM"] = str(config.n_weave_parallelism)
+    weave.init(f"{config.wandb_entity}/{config.wandb_project}")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/wandbot/evaluation/eval.py around lines 20 to 24, the environment
variable WEAVE_PARALLELISM must be set before calling weave.init() because
weave.init() reads this variable at import time. Move the code that sets or
exports WEAVE_PARALLELISM to a position before weave.init() is called to ensure
the desired parallelism setting is applied correctly.

Comment on lines +27 to 37
def get_wandbot_configs(config: EvalConfig) -> dict: # config is now required
"""Get wandbot's configs and repo git info"""
url = config.wandbot_url if config else "http://0.0.0.0:8000"
url = config.wandbot_url # No default, use from config
try:
response = requests.get(f"{url}/configs")
response.raise_for_status()
return response.json()
except requests.RequestException as e:
logger.error(f"Error making request to wandbot configs: {e}")
return {
"chat_config": {},
"vector_store_config": {},
"git_info": {},
"app_config": {}
}

@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=10, max=300),
retry=retry_if_exception_type(httpx.HTTPError),
before_sleep=lambda retry_state: logger.warning(
f"Attempt {retry_state.attempt_number} failed. Retrying in {retry_state.next_action.sleep} seconds..."
),
after=after_log(logger, logging.ERROR)
)
async def make_request(url: str, question: str, application: str = "api-eval", language: str = "en") -> dict:
"""Make HTTP request to wandbot API with retry logic."""
request_timeout = 120.0
request_connect_timeout = 30.0
async with httpx.AsyncClient(timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)) as client:
try:
response = await client.post(
f"{url}/chat/query",
json={"question": question, "application": application, "language": language}
)
response.raise_for_status()
return response.json()
except httpx.ReadTimeout:
logger.error(f"Request timed out after {request_timeout} seconds")
raise
except httpx.ConnectTimeout:
logger.error(f"Connection timed out after {request_connect_timeout} seconds")
raise

async def get_answer(question: str, wandbot_url: str, application: str = "api-eval", language: str = "en") -> str:
"""Get answer from wandbot API."""
try:
result = await make_request(wandbot_url, question, application, language)
return json.dumps(result)
except Exception as e:
logger.error(f"Failed to get answer: {str(e)}")
return json.dumps({
"error": str(e),
"answer": "",
"system_prompt": "",
"source_documents": "",
"model": "",
"total_tokens": 0,
"prompt_tokens": 0,
"completion_tokens": 0,
"time_taken": 0
})

def parse_text_to_json(text):
# Split the text into documents
documents = re.split(r"source: https?://", text)[1:]
result = []
for doc in documents:
source_url = "https://" + doc.split("\n")[0].strip()
content = "\n".join(doc.split("\n")[1:]).strip()
document = {"source": source_url, "content": content}
result.append(document)
return result


@weave.op
async def get_record(question: str, wandbot_url: str, application: str = "api-eval", language: str = "en") -> dict:
try:
response = await get_answer(question, wandbot_url=wandbot_url, application=application, language=language)
response_dict = json.loads(response)

if not response_dict:
try:
error_data = json.loads(response)
error_msg = error_data.get("error", "Unknown API error")
except json.JSONDecodeError:
error_msg = response if response else "Empty response from API"

logger.error(error_msg)
return {
"system_prompt": "",
"generated_answer": "",
"response_synthesis_llm_messages": [],
"retrieved_contexts": [],
"model": "",
"total_tokens": 0,
"prompt_tokens": 0,
"completion_tokens": 0,
"time_taken": 0,
"has_error": True,
"api_call_statuses": {},
"error_message": error_msg
}

return {
"system_prompt": response_dict.get("system_prompt", ""),
"generated_answer": response_dict.get("answer", ""),
"response_synthesis_llm_messages": response_dict.get("response_synthesis_llm_messages", []),
"retrieved_contexts": parse_text_to_json(
response_dict.get("source_documents", "")
),
"model": response_dict.get("model", ""),
"total_tokens": response_dict.get("total_tokens", 0),
"prompt_tokens": response_dict.get("prompt_tokens", 0),
"completion_tokens": response_dict.get("completion_tokens", 0),
"time_taken": response_dict.get("time_taken", 0),
"api_call_statuses": response_dict.get("api_call_statuses", {}),
"has_error": False,
"error_message": None
}
except Exception as e:
error_msg = f"Error getting response from wandbotAPI: {str(e)}"
logger.error(error_msg)
return {
"system_prompt": "",
"generated_answer": "",
"response_synthesis_llm_messages": [],
"retrieved_contexts": [],
"model": "",
"total_tokens": 0,
"prompt_tokens": 0,
"completion_tokens": 0,
"time_taken": 0,
"has_error": True,
"api_call_statuses": {},
"error_message": error_msg
}


class WandbotModel(weave.Model):
language: str = "en"
application: str = "api-eval"
wandbot_url: str = "http://0.0.0.0:8000"
wandbot_config: dict = {}

@weave.op
async def predict(self, question: str) -> dict:
prediction = await get_record(question,
wandbot_url=self.wandbot_url,
application=self.application,
language=self.language)
return prediction


class WandbotCorrectnessScorer(weave.Scorer):
config: EvalConfig
correctness_evaluator: WandBotCorrectnessEvaluator = None
debug: bool = False

def __init__(self, config: EvalConfig):
super().__init__(config=config)
self.debug = config.debug
self.correctness_evaluator = WandBotCorrectnessEvaluator(
provider=config.eval_judge_provider,
model_name=config.eval_judge_model,
temperature=config.eval_judge_temperature,
max_retries=config.max_evaluator_retries,
timeout=config.evaluator_timeout,
)

@weave.op
async def score(self, question: str, ground_truth: str, notes: str, model_output: dict) -> dict:

if self.debug:
if model_output is not None:
logger.debug(f"In WandbotCorrectnessScorer, model_output keys:\n{model_output.keys()}")
else:
logger.error("model_output is None")

try:
contexts = [c["content"] for c in model_output.get("retrieved_contexts", [])] if model_output.get("retrieved_contexts") else []

if model_output.get("generated_answer", "") == "":
error_msg = "Generated answer is empty"
logger.error(error_msg)
return {
"answer_correct": False,
"reasoning": error_msg,
"score": 1.0,
"has_error": True,
"error_message": error_msg
}
if model_output.get("has_error", False):
error_msg = model_output.get("error_message", "Unknown error")
logger.error(error_msg)
return {
"answer_correct": False,
"reasoning": error_msg,
"score": 1.0,
"has_error": True,
"error_message": error_msg
}

# If not error from wandbot generation, run the correctness evaluator
return await self.correctness_evaluator.aevaluate(
query=question,
response=model_output.get("generated_answer", ""),
reference=ground_truth,
contexts=contexts,
reference_notes=notes,
)

except Exception as e:
error_msg = f"Error evaluating answer: {str(e)}"
logger.error(error_msg)
return CorrectnessEvaluationResult(
query=question,
response=model_output.get("generated_answer", ""),
contexts=contexts,
passing=False,
score=1.0,
reasoning=error_msg,
has_error=True,
error_message=error_msg
)
return {"chat_config": {}, "vector_store_config": {}, "git_info": {}, "app_config": {}}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Network call still executed when only pre-computed answers are used

Even when precomputed_answers_json_path is supplied, get_wandbot_configs issues an HTTP request that is guaranteed to fail in offline workflows and needlessly clutters the logs.

A quick guard avoids the call and keeps the logic simple:

-    wandbot_info = get_wandbot_configs(config)  # Pass config directly
+    wandbot_info = (
+        {} if config.precomputed_answers_json_path
+        else get_wandbot_configs(config)
+    )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/wandbot/evaluation/eval.py around lines 27 to 37, the function
get_wandbot_configs makes a network request even when
precomputed_answers_json_path is provided, causing unnecessary failures and log
clutter in offline workflows. Add a guard at the start of the function to check
if precomputed_answers_json_path is set in the config, and if so, skip the HTTP
request and return an appropriate default or empty config dictionary
immediately.

Comment on lines +67 to +73
if item_index_str in precomputed_answers_map:
logger.warning(
f"Duplicate string index '{item_index_str}' found in precomputed answers. "
f"Overwriting with item from original list at index {i}."
)
precomputed_answers_map[item_index_str] = item

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider failing fast on duplicate indices

When the same index appears twice, the later entry silently overwrites the former.
For evaluation reproducibility it’s usually safer to raise, or at least surface a stronger warning, because duplicates often indicate a bug in the pre-computed file.

-            if item_index_str in precomputed_answers_map:
-                logger.warning(
-                    f"Duplicate string index '{item_index_str}' found in precomputed answers. "
-                    f"Overwriting with item from original list at index {i}."
-                )
-            precomputed_answers_map[item_index_str] = item
+            if item_index_str in precomputed_answers_map:
+                raise ValueError(
+                    f"Duplicate string index '{item_index_str}' encountered (item positions "
+                    f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
+                )
+            # store original position for better error messages later
+            item["__orig_pos__"] = i
+            precomputed_answers_map[item_index_str] = item
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if item_index_str in precomputed_answers_map:
logger.warning(
f"Duplicate string index '{item_index_str}' found in precomputed answers. "
f"Overwriting with item from original list at index {i}."
)
precomputed_answers_map[item_index_str] = item
if item_index_str in precomputed_answers_map:
raise ValueError(
f"Duplicate string index '{item_index_str}' encountered (item positions "
f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."
)
# store original position for better error messages later
item["__orig_pos__"] = i
precomputed_answers_map[item_index_str] = item
🤖 Prompt for AI Agents
In src/wandbot/evaluation/data_utils.py around lines 67 to 73, the code
currently logs a warning and overwrites entries when duplicate indices are found
in precomputed_answers_map. To fail fast and improve reproducibility, replace
the warning and overwrite behavior with raising an exception when a duplicate
index is detected. This will immediately surface the issue and prevent silent
data overwrites.

Comment on lines +56 to +63
score_result_dict = CorrectnessEvaluationResult(
reason=_error_msg,
score=1.0,
decision="incorrect",
answer_correct=False,
has_error=True,
error_message=_error_msg
).model_dump(warnings=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Scoring semantics look inverted (score=1.0 for incorrect answers)

CorrectnessEvaluationResult is instantiated with score=1.0 in error/incorrect
paths. Unless your convention is “1 = worst”, this will mark every failure as
perfect. Double-check the metric definition; you probably want 0.0 here.

🤖 Prompt for AI Agents
In src/wandbot/evaluation/weave_scorer.py around lines 56 to 63, the score is
set to 1.0 for incorrect answers, which likely inverts the intended scoring
semantics. Change the score value from 1.0 to 0.0 in the error or incorrect
paths to correctly reflect that these are failing cases, assuming the convention
is 1.0 for perfect and 0.0 for worst.

Comment on lines +195 to +199
"question": str(item.get("question", "")),
"system_prompt": str(item.get("system_prompt", "")),
"answer": str(item.get("generated_answer", "")),
"model": str(item.get("model", "precomputed")),
"sources": str(item.get("sources", "")),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fall back to 'answer' when pre-computed item lacks 'generated_answer'

Many datasets still use the API’s original answer key. Relying solely on
generated_answer will treat such rows as empty answers → false negatives.

-            "answer": str(item.get("generated_answer", "")),
+            "answer": str(item.get("generated_answer") or item.get("answer", "")),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"question": str(item.get("question", "")),
"system_prompt": str(item.get("system_prompt", "")),
"answer": str(item.get("generated_answer", "")),
"model": str(item.get("model", "precomputed")),
"sources": str(item.get("sources", "")),
"question": str(item.get("question", "")),
"system_prompt": str(item.get("system_prompt", "")),
"answer": str(item.get("generated_answer") or item.get("answer", "")),
"model": str(item.get("model", "precomputed")),
"sources": str(item.get("sources", "")),
🤖 Prompt for AI Agents
In src/wandbot/evaluation/weave_model.py around lines 195 to 199, the code
currently uses only the 'generated_answer' key to get the answer, which causes
issues with datasets that use the original 'answer' key. Modify the code to
first try 'generated_answer' and if it is missing or empty, fall back to using
the 'answer' key instead to avoid false negatives.

Comment on lines +31 to +33
async with httpx.AsyncClient(
timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)
) as client:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect httpx.Timeout keyword – will raise TypeError

httpx.Timeout expects total=…, not timeout=….
As written, the first request will blow up before any retry logic kicks in.

-        timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)
+        timeout=httpx.Timeout(total=request_timeout, connect=request_connect_timeout)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async with httpx.AsyncClient(
timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)
) as client:
async with httpx.AsyncClient(
timeout=httpx.Timeout(total=request_timeout, connect=request_connect_timeout)
) as client:
🤖 Prompt for AI Agents
In src/wandbot/evaluation/weave_model.py around lines 31 to 33, the
httpx.AsyncClient is incorrectly using the keyword argument 'timeout' instead of
'total' when creating the httpx.Timeout object. Replace
'timeout=request_timeout' with 'total=request_timeout' to correctly specify the
total timeout duration and prevent a TypeError from being raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant