add passing precomputed answers for eval by morganmcg1 · Pull Request #96 · wandb/wandbot

morganmcg1 · 2025-05-22T17:11:12Z

update wandbot Evaluation datasets to dataset version with an "index" column
add readme instructions how to pass precomputed answers
add ability to pass precomputed answers json file and refactor eval script and helpers

Summary by CodeRabbit

New Features
- Added support for evaluating files with precomputed answers, allowing users to bypass live API calls by providing a JSON file.
- Introduced new configuration options to specify precomputed answer files and updated dataset references.
- Enhanced evaluation workflows to support both live and precomputed answer modes.
- Integrated a new correctness scorer with detailed error handling and result tracking.
- Updated response synthesizer models to newer versions for improved performance.
Documentation
- Expanded the README with detailed instructions and examples for using precomputed answers in evaluations.
Refactor
- Streamlined evaluation logic for improved modularity and maintainability, delegating data loading and scoring to dedicated modules.
- Removed internal Wandbot API implementations, replacing them with external modular components.
Bug Fixes
- Improved error handling and logging during evaluation and scoring processes.

… column

…cript and helpers

coderabbitai · 2025-05-22T17:11:21Z

Walkthrough

This update introduces modular data loading and evaluation utilities, an extensible evaluation schema, and robust Wandbot model and scorer classes for correctness evaluation. The main evaluation script is refactored to use these new modules, and configuration options are expanded to support precomputed answer files. Documentation is updated accordingly.

Changes

File(s)	Change Summary
README.md	Added a detailed section on evaluating with precomputed answers, explaining JSON structure, optional fields, and usage examples. Adjusted heading levels in the debugging section.
src/wandbot/evaluation/data_utils.py	New module providing functions to sanitize, load, and prepare precomputed answer data and evaluation datasets, with structured logging and error handling.
src/wandbot/evaluation/eval.py	Refactored to delegate data loading, Wandbot API/model, and scoring logic to new modules. Rewrote evaluation orchestration to use modular helpers, updated config handling, and added result persistence with error handling.
src/wandbot/evaluation/eval_config.py	Added `precomputed_answers_json_path` to `EvalConfig` for specifying precomputed answers. Updated dataset URIs in the `eval_dataset` property.
src/wandbot/evaluation/eval_schemas.py	Introduced `EvalChatResponse`, extending `ChatResponse` with optional and error-tracking fields, token/timing metrics, and parsed context support for evaluation.
src/wandbot/evaluation/weave_model.py	New module implementing robust async Wandbot API interaction with retry logic, error handling, response parsing, and a Weave model supporting both live and precomputed answers for evaluation.
src/wandbot/evaluation/weave_scorer.py	New scorer class for correctness evaluation, with async scoring logic, error handling, result tracking, and integration with Weave.
src/wandbot/configs/chat_config.py	Updated default response synthesizer model and fallback model to a newer version.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant EvalMain
    participant DataUtils
    participant WeaveModel
    participant WeaveScorer
    participant WandbotAPI

    User->>EvalMain: Run evaluation script
    EvalMain->>DataUtils: Load dataset rows
    alt Precomputed answers provided
        EvalMain->>DataUtils: Load precomputed answers
        EvalMain->>WeaveModel: Initialize with precomputed data
    else
        EvalMain->>WeaveModel: Initialize with Wandbot config
    end
    loop For each dataset row
        EvalMain->>WeaveModel: predict(index, question)
        alt Precomputed data
            WeaveModel-->>EvalMain: Return precomputed answer
        else
            WeaveModel->>WandbotAPI: Query for answer
            WandbotAPI-->>WeaveModel: Return API response
            WeaveModel-->>EvalMain: Return parsed answer
        end
        EvalMain->>WeaveScorer: score(index, question, ground_truth, notes, model_output)
        WeaveScorer-->>EvalMain: Return score result
    end
    EvalMain->>EvalMain: Save results to JSON

Poem

In the warren of code, new modules appear,
For loading and scoring, the purpose is clear.
Precomputed or live, answers now flow,
With schemas and configs, evaluations grow.
The rabbits rejoice, their work now refined—
Hopping through data, correctness in mind!
🐇✨

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceca3ac and 337b766.

📒 Files selected for processing (1)

src/wandbot/configs/chat_config.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

src/wandbot/configs/chat_config.py

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (11)

src/wandbot/evaluation/eval_config.py (1)
27-33: Avoid hard-coding URIs inside the property

The weave URIs are effectively configuration. Placing them inline makes them hard to update and re-use (e.g., when the dataset version bumps again).

A tiny refactor keeps the data in a constant mapping, making future updates one-liner changes and enabling potential CLI overrides.
-    def eval_dataset(self) -> str:
-        if self.lang == "ja":
-            return "weave:///wandbot/wandbot-eval/object/wandbot_eval_data_jp:I2BlFnw1VnPn8lFG72obBWN1sCokB3EYk4G4vSKg23g"
-        return "weave:///wandbot/wandbot-eval/object/wandbot_eval_data:ZZUQa2CCAqPDFWiB90VANCm4EdT8qtc125NazaWUrdI"
+    def eval_dataset(self) -> str:
+        DATASET_URIS = {
+            "en": "weave:///wandbot/wandbot-eval/object/wandbot_eval_data:ZZUQa2CCAqPDFWiB90VANCm4EdT8qtc125NazaWUrdI",
+            "ja": "weave:///wandbot/wandbot-eval/object/wandbot_eval_data_jp:I2BlFnw1VnPn8lFG72obBWN1sCokB3EYk4G4vSKg23g",
+        }
+        return DATASET_URIS[self.lang]
src/wandbot/evaluation/eval_schemas.py (2)
16-31: pydantic-v2 config style mismatch

When using Pydantic v2, the recommended way to enable attribute loading is via model_config = ConfigDict(from_attributes=True) instead of the v1-style class Config.
While the current approach still works (Pydantic keeps a shim), adopting the v2 idiom prevents deprecation warnings down the road.
-from datetime import datetime
+# pyright: reportUnknownArgumentType=ignore
+from datetime import datetime
@@
-class Config:
-    from_attributes = True
+from pydantic import ConfigDict
+
+model_config = ConfigDict(from_attributes=True)
33-38: Minor: retrieved_contexts default mutability

default_factory=list is 👍🏻 for avoiding a shared default, but because the list elements are themselves dicts the scorer might mutate them.
If mutation is unintended, consider returning a tuple (default_factory=tuple) or using typing.Sequence to make intent explicit.
README.md (1)
265-286: Grammar & Markdown lint

Small wording tweak (“its” → “it’s”) and switch to dash markers to appease markdownlint (MD004).
-To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide the context information through one of the following fields in each JSON object, although its not essential:
-*   `retrieved_contexts` ...
+To enable context-based scoring (e.g., by `WandbotCorrectnessScorer`), you can provide context information through one of the following fields in each JSON object, although **it’s** not essential:
+- `retrieved_contexts` ...
🧰 Tools

🪛 LanguageTool

[uncategorized] ~279-~279: “its” (belonging to it) seems less likely than “it’s” (it is)
Context: ...ng fields in each JSON object, although its not essential: * `retrieved_contexts...

(AI_HYDRA_LEO_CPT_ITS_ITIS)

🪛 markdownlint-cli2 (0.17.2)

275-275: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

276-276: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

281-281: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

283-283: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)
src/wandbot/evaluation/eval.py (3)
7-8: requests import no longer needed

get_wandbot_configs is the only consumer of requests, and that call is skipped when precomputed_answers_json_path is set.
Switching to httpx async client (already used elsewhere) or deferring the import inside the function would remove an otherwise unused hard dependency.

96-104: Typo in attribute key

"wandbot_vectore_store_config" contains an extra “e”.
Down-stream dashboards querying this attribute will miss the data.
-        "wandbot_vectore_store_config": wandbot_info.get("vector_store_config", {}) ...
+        "wandbot_vector_store_config": wandbot_info.get("vector_store_config", {}) ...
110-113: Multiline f-string readability

The back-slash continuation inside the f-string injects a literal newline character that splits the log message.
Prefer explicit \n or separate logger.info calls for clarity.
-    logger.info(
-        f"Starting evaluation of {len(question_rows_for_eval)} samples with {config.n_trials} trials, \
-{len(question_rows_for_eval) * config.n_trials} calls in total."
-    )
+    total_calls = len(question_rows_for_eval) * config.n_trials
+    logger.info(
+        f"Starting evaluation of {len(question_rows_for_eval)} samples "
+        f"with {config.n_trials} trials ({total_calls} total requests)."
+    )
src/wandbot/evaluation/data_utils.py (1)
28-30: Use explicit UTF-8 encoding when opening JSON files

open(file_path, "r") relies on the system default encoding. On non-UTF-8 locales (e.g. Windows with cp1252) this will break for Unicode content.
-        with open(file_path, "r") as f:
+        with open(file_path, "r", encoding="utf-8") as f:
src/wandbot/evaluation/weave_scorer.py (2)
35-38: index is declared as int but used as a string elsewhere

load_and_prepare_dataset_rows stores indices as strings and WandbotModel.predict
expects a str. Declaring the argument as int may confuse type-checkers and
future readers.
-        self, index: int, question: str, ground_truth: str, notes: str, model_output: dict
+        self, index: str, question: str, ground_truth: str, notes: str, model_output: dict
113-117: json.dumps(model_output) may raise for non-serialisable objects

If model_output ever contains datetimes or custom classes the dump will raise
and break evaluation. Consider default=str or orjson.dumps with option
to handle non-serialisable types gracefully.
-            "notes": notes, "model_output": json.dumps(model_output),
+            "notes": notes, "model_output": json.dumps(model_output, default=str),
src/wandbot/evaluation/weave_model.py (1)
201-204: Parsing numeric fields can raise – protect with try/except

Direct int(...) / float(...) casts will explode if the source contains
non-numeric text (e.g., "N/A"). A helper with safe coercion avoids crashing
the whole run.
def _to_int(val, default=0):
    try:
        return int(val)
    except (TypeError, ValueError):
        return default
…
"total_tokens": _to_int(item.get("total_tokens")),

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29d248b and ceca3ac.

📒 Files selected for processing (7)

README.md (2 hunks)
src/wandbot/evaluation/data_utils.py (1 hunks)
src/wandbot/evaluation/eval.py (2 hunks)
src/wandbot/evaluation/eval_config.py (1 hunks)
src/wandbot/evaluation/eval_schemas.py (1 hunks)
src/wandbot/evaluation/weave_model.py (1 hunks)
src/wandbot/evaluation/weave_scorer.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/wandbot/evaluation/weave_model.py (3)

tests/evaluation/test_eval.py (2)

json (43-46)

raise_for_status (48-50)

src/wandbot/evaluation/eval_schemas.py (1)

EvalChatResponse (9-41)

src/wandbot/utils.py (2)

get_logger (52-98)

model (222-225)

🪛 LanguageTool

README.md

[uncategorized] ~279-~279: “its” (belonging to it) seems less likely than “it’s” (it is)
Context: ...ng fields in each JSON object, although its not essential: * `retrieved_contexts...

(AI_HYDRA_LEO_CPT_ITS_ITIS)

🪛 markdownlint-cli2 (0.17.2)

README.md

275-275: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

276-276: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

281-281: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

283-283: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

289-289: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

290-290: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

291-291: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

292-292: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

293-293: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

294-294: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

295-295: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

296-296: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

297-297: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

298-298: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

299-299: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

300-300: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

coderabbitai · 2025-05-22T17:14:47Z

src/wandbot/evaluation/eval_config.py

+    precomputed_answers_json_path: str | None = sp.field(default=None, help="Path to a JSON file containing precomputed answers. If provided, network calls to wandbot will be skipped.")



🛠️ Refactor suggestion

Consider validating the path at parse-time

precomputed_answers_json_path is declared but never validated.
If a user fat-fingers the path, the evaluation will silently fall back to live calls, which might be surprising and very expensive.

+ @property + def precomputed_answers_path(self) -> Path | None: + """Return a `Path` object and verify the file exists (if provided).""" + if self.precomputed_answers_json_path is None: + return None + p = Path(self.precomputed_answers_json_path).expanduser() + if not p.is_file(): + raise FileNotFoundError(f"Pre-computed answers file not found: {p}") + return p

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/wandbot/evaluation/eval_config.py around lines 25 to 26, the precomputed_answers_json_path field lacks validation, which can cause silent fallback to live calls if the path is incorrect. Add validation logic during parsing to check if the provided path exists and is a valid file. If the path is invalid, raise an appropriate error or warning to prevent unexpected behavior and costly live calls.

coderabbitai · 2025-05-22T17:14:47Z

src/wandbot/evaluation/eval.py

 # Load environment variables from .env in project root
-ENV_PATH = Path(__file__).parent.parent.parent.parent / '.env'
+ENV_PATH = Path(__file__).parent.parent.parent.parent / ".env"
 load_dotenv(ENV_PATH, override=True)

 logger = get_logger(__name__)


⚠️ Potential issue

Set WEAVE_PARALLELISM before weave.init

weave.init() reads WEAVE_PARALLELISM once at import-time.
Currently the env var is exported on line 56, after initialization, so the desired parallelism is ignored.

- weave.init(f"{config.wandb_entity}/{config.wandb_project}") - os.environ["WEAVE_PARALLELISM"] = str(config.n_weave_parallelism) + os.environ["WEAVE_PARALLELISM"] = str(config.n_weave_parallelism) + weave.init(f"{config.wandb_entity}/{config.wandb_project}")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/wandbot/evaluation/eval.py around lines 20 to 24, the environment variable WEAVE_PARALLELISM must be set before calling weave.init() because weave.init() reads this variable at import time. Move the code that sets or exports WEAVE_PARALLELISM to a position before weave.init() is called to ensure the desired parallelism setting is applied correctly.

coderabbitai · 2025-05-22T17:14:47Z

src/wandbot/evaluation/eval.py

+def get_wandbot_configs(config: EvalConfig) -> dict:  # config is now required
    """Get wandbot's configs and repo git info"""
-    url = config.wandbot_url if config else "http://0.0.0.0:8000"
+    url = config.wandbot_url  # No default, use from config
    try:
        response = requests.get(f"{url}/configs")
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        logger.error(f"Error making request to wandbot configs: {e}")
-        return {
-            "chat_config": {},
-            "vector_store_config": {},
-            "git_info": {},
-            "app_config": {}
-        }
-
-@retry(
-    stop=stop_after_attempt(5),
-    wait=wait_exponential(multiplier=1, min=10, max=300),
-    retry=retry_if_exception_type(httpx.HTTPError),
-    before_sleep=lambda retry_state: logger.warning(
-        f"Attempt {retry_state.attempt_number} failed. Retrying in {retry_state.next_action.sleep} seconds..."
-    ),
-    after=after_log(logger, logging.ERROR)
-)
-async def make_request(url: str, question: str, application: str = "api-eval", language: str = "en") -> dict:
-    """Make HTTP request to wandbot API with retry logic."""
-    request_timeout = 120.0
-    request_connect_timeout = 30.0
-    async with httpx.AsyncClient(timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)) as client:
-        try:
-            response = await client.post(
-                f"{url}/chat/query",
-                json={"question": question, "application": application, "language": language}
-            )
-            response.raise_for_status()
-            return response.json()
-        except httpx.ReadTimeout:
-            logger.error(f"Request timed out after {request_timeout} seconds")
-            raise
-        except httpx.ConnectTimeout:
-            logger.error(f"Connection timed out after {request_connect_timeout} seconds")
-            raise
-
-async def get_answer(question: str, wandbot_url: str, application: str = "api-eval", language: str = "en") -> str:
-    """Get answer from wandbot API."""
-    try:
-        result = await make_request(wandbot_url, question, application, language)
-        return json.dumps(result)
-    except Exception as e:
-        logger.error(f"Failed to get answer: {str(e)}")
-        return json.dumps({
-            "error": str(e),
-            "answer": "",
-            "system_prompt": "",
-            "source_documents": "",
-            "model": "",
-            "total_tokens": 0,
-            "prompt_tokens": 0,
-            "completion_tokens": 0,
-            "time_taken": 0
-        })
-
-def parse_text_to_json(text):
-    # Split the text into documents
-    documents = re.split(r"source: https?://", text)[1:]
-    result = []
-    for doc in documents:
-        source_url = "https://" + doc.split("\n")[0].strip()
-        content = "\n".join(doc.split("\n")[1:]).strip()
-        document = {"source": source_url, "content": content}
-        result.append(document)
-    return result
-
-
-@weave.op
-async def get_record(question: str, wandbot_url: str, application: str = "api-eval", language: str = "en") -> dict:
-    try:
-        response = await get_answer(question, wandbot_url=wandbot_url, application=application, language=language)
-        response_dict = json.loads(response)
-
-        if not response_dict:
-            try:
-                error_data = json.loads(response)
-                error_msg = error_data.get("error", "Unknown API error")
-            except json.JSONDecodeError:
-                error_msg = response if response else "Empty response from API"
-
-            logger.error(error_msg)
-            return {
-                "system_prompt": "",
-                "generated_answer": "",
-                "response_synthesis_llm_messages": [],
-                "retrieved_contexts": [],
-                "model": "",
-                "total_tokens": 0,
-                "prompt_tokens": 0,
-                "completion_tokens": 0,
-                "time_taken": 0,
-                "has_error": True,
-                "api_call_statuses": {},
-                "error_message": error_msg
-            }
-
-        return {
-            "system_prompt": response_dict.get("system_prompt", ""),
-            "generated_answer": response_dict.get("answer", ""),
-            "response_synthesis_llm_messages": response_dict.get("response_synthesis_llm_messages", []),
-            "retrieved_contexts": parse_text_to_json(
-                response_dict.get("source_documents", "")
-            ),
-            "model": response_dict.get("model", ""),
-            "total_tokens": response_dict.get("total_tokens", 0),
-            "prompt_tokens": response_dict.get("prompt_tokens", 0),
-            "completion_tokens": response_dict.get("completion_tokens", 0),
-            "time_taken": response_dict.get("time_taken", 0),
-            "api_call_statuses": response_dict.get("api_call_statuses", {}),
-            "has_error": False,
-            "error_message": None
-        }
-    except Exception as e:
-        error_msg = f"Error getting response from wandbotAPI: {str(e)}"
-        logger.error(error_msg)
-        return {
-            "system_prompt": "",
-            "generated_answer": "",
-            "response_synthesis_llm_messages": [],
-            "retrieved_contexts": [],
-            "model": "",
-            "total_tokens": 0,
-            "prompt_tokens": 0,
-            "completion_tokens": 0,
-            "time_taken": 0,
-            "has_error": True,
-            "api_call_statuses": {},
-            "error_message": error_msg
-        }
-
-
-class WandbotModel(weave.Model):
-    language: str = "en"
-    application: str = "api-eval"
-    wandbot_url: str = "http://0.0.0.0:8000"
-    wandbot_config: dict = {}
-
-    @weave.op
-    async def predict(self, question: str) -> dict:
-        prediction = await get_record(question, 
-                                      wandbot_url=self.wandbot_url, 
-                                      application=self.application,
-                                      language=self.language)
-        return prediction
-
-
-class WandbotCorrectnessScorer(weave.Scorer):
-    config: EvalConfig
-    correctness_evaluator: WandBotCorrectnessEvaluator = None
-    debug: bool = False
-
-    def __init__(self, config: EvalConfig):
-        super().__init__(config=config)
-        self.debug = config.debug
-        self.correctness_evaluator = WandBotCorrectnessEvaluator(
-            provider=config.eval_judge_provider,
-            model_name=config.eval_judge_model,
-            temperature=config.eval_judge_temperature,
-            max_retries=config.max_evaluator_retries,
-            timeout=config.evaluator_timeout,
-        )
-
-    @weave.op
-    async def score(self, question: str, ground_truth: str, notes: str, model_output: dict) -> dict:
-
-        if self.debug:
-            if model_output is not None:
-                logger.debug(f"In WandbotCorrectnessScorer, model_output keys:\n{model_output.keys()}")
-            else:
-                logger.error("model_output is None")
-
-        try:
-            contexts = [c["content"] for c in model_output.get("retrieved_contexts", [])] if model_output.get("retrieved_contexts") else []
-
-            if model_output.get("generated_answer", "") == "":
-                error_msg = "Generated answer is empty"
-                logger.error(error_msg)
-                return {
-                    "answer_correct": False,
-                    "reasoning": error_msg,
-                    "score": 1.0,
-                    "has_error": True,
-                    "error_message": error_msg
-                }
-            if model_output.get("has_error", False):
-                error_msg = model_output.get("error_message", "Unknown error")
-                logger.error(error_msg)
-                return {
-                    "answer_correct": False,
-                    "reasoning": error_msg,
-                    "score": 1.0,
-                    "has_error": True,
-                    "error_message": error_msg
-                }
-
-            # If not error from wandbot generation, run the correctness evaluator
-            return await self.correctness_evaluator.aevaluate(
-                query=question,
-                response=model_output.get("generated_answer", ""),
-                reference=ground_truth,
-                contexts=contexts,
-                reference_notes=notes,
-            )
-
-        except Exception as e:
-            error_msg = f"Error evaluating answer: {str(e)}"
-            logger.error(error_msg)
-            return CorrectnessEvaluationResult(
-                query=question,
-                response=model_output.get("generated_answer", ""),
-                contexts=contexts,
-                passing=False,
-                score=1.0,
-                reasoning=error_msg,
-                has_error=True,
-                error_message=error_msg
-            )
+        return {"chat_config": {}, "vector_store_config": {}, "git_info": {}, "app_config": {}}



🛠️ Refactor suggestion

Network call still executed when only pre-computed answers are used

Even when precomputed_answers_json_path is supplied, get_wandbot_configs issues an HTTP request that is guaranteed to fail in offline workflows and needlessly clutters the logs.

A quick guard avoids the call and keeps the logic simple:

- wandbot_info = get_wandbot_configs(config) # Pass config directly + wandbot_info = ( + {} if config.precomputed_answers_json_path + else get_wandbot_configs(config) + )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/wandbot/evaluation/eval.py around lines 27 to 37, the function get_wandbot_configs makes a network request even when precomputed_answers_json_path is provided, causing unnecessary failures and log clutter in offline workflows. Add a guard at the start of the function to check if precomputed_answers_json_path is set in the config, and if so, skip the HTTP request and return an appropriate default or empty config dictionary immediately.

coderabbitai · 2025-05-22T17:14:47Z

src/wandbot/evaluation/data_utils.py

+            if item_index_str in precomputed_answers_map:
+                logger.warning(
+                    f"Duplicate string index '{item_index_str}' found in precomputed answers. "
+                    f"Overwriting with item from original list at index {i}."
+                )
+            precomputed_answers_map[item_index_str] = item
+


🛠️ Refactor suggestion

Consider failing fast on duplicate indices

When the same index appears twice, the later entry silently overwrites the former.
For evaluation reproducibility it’s usually safer to raise, or at least surface a stronger warning, because duplicates often indicate a bug in the pre-computed file.

- if item_index_str in precomputed_answers_map: - logger.warning( - f"Duplicate string index '{item_index_str}' found in precomputed answers. " - f"Overwriting with item from original list at index {i}." - ) - precomputed_answers_map[item_index_str] = item + if item_index_str in precomputed_answers_map: + raise ValueError( + f"Duplicate string index '{item_index_str}' encountered (item positions " + f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})." + ) + # store original position for better error messages later + item["__orig_pos__"] = i + precomputed_answers_map[item_index_str] = item

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if item_index_str in precomputed_answers_map:

logger.warning(

f"Duplicate string index '{item_index_str}' found in precomputed answers. "

f"Overwriting with item from original list at index {i}."

)

precomputed_answers_map[item_index_str] = item

if item_index_str in precomputed_answers_map:

raise ValueError(

f"Duplicate string index '{item_index_str}' encountered (item positions "

f"{precomputed_answers_map[item_index_str].get('__orig_pos__', 'N/A')} and {i})."

)

# store original position for better error messages later

item["__orig_pos__"] = i

precomputed_answers_map[item_index_str] = item

🤖 Prompt for AI Agents

In src/wandbot/evaluation/data_utils.py around lines 67 to 73, the code currently logs a warning and overwrites entries when duplicate indices are found in precomputed_answers_map. To fail fast and improve reproducibility, replace the warning and overwrite behavior with raising an exception when a duplicate index is detected. This will immediately surface the issue and prevent silent data overwrites.

coderabbitai · 2025-05-22T17:14:47Z

src/wandbot/evaluation/weave_scorer.py

+                score_result_dict = CorrectnessEvaluationResult(
+                    reason=_error_msg,
+                    score=1.0,
+                    decision="incorrect",
+                    answer_correct=False,
+                    has_error=True,
+                    error_message=_error_msg
+                ).model_dump(warnings=False)


🛠️ Refactor suggestion

Scoring semantics look inverted (score=1.0 for incorrect answers)

CorrectnessEvaluationResult is instantiated with score=1.0 in error/incorrect
paths. Unless your convention is “1 = worst”, this will mark every failure as
perfect. Double-check the metric definition; you probably want 0.0 here.

🤖 Prompt for AI Agents

In src/wandbot/evaluation/weave_scorer.py around lines 56 to 63, the score is set to 1.0 for incorrect answers, which likely inverts the intended scoring semantics. Change the score value from 1.0 to 0.0 in the error or incorrect paths to correctly reflect that these are failing cases, assuming the convention is 1.0 for perfect and 0.0 for worst.

coderabbitai · 2025-05-22T17:14:47Z

src/wandbot/evaluation/weave_model.py

+            "question": str(item.get("question", "")),
+            "system_prompt": str(item.get("system_prompt", "")),
+            "answer": str(item.get("generated_answer", "")),
+            "model": str(item.get("model", "precomputed")),
+            "sources": str(item.get("sources", "")),


🛠️ Refactor suggestion

Fall back to 'answer' when pre-computed item lacks 'generated_answer'

Many datasets still use the API’s original answer key. Relying solely on
generated_answer will treat such rows as empty answers → false negatives.

- "answer": str(item.get("generated_answer", "")), + "answer": str(item.get("generated_answer") or item.get("answer", "")),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"question": str(item.get("question", "")),

"system_prompt": str(item.get("system_prompt", "")),

"answer": str(item.get("generated_answer", "")),

"model": str(item.get("model", "precomputed")),

"sources": str(item.get("sources", "")),

"question": str(item.get("question", "")),

"system_prompt": str(item.get("system_prompt", "")),

"answer": str(item.get("generated_answer") or item.get("answer", "")),

"model": str(item.get("model", "precomputed")),

"sources": str(item.get("sources", "")),

🤖 Prompt for AI Agents

In src/wandbot/evaluation/weave_model.py around lines 195 to 199, the code currently uses only the 'generated_answer' key to get the answer, which causes issues with datasets that use the original 'answer' key. Modify the code to first try 'generated_answer' and if it is missing or empty, fall back to using the 'answer' key instead to avoid false negatives.

coderabbitai · 2025-05-22T17:14:48Z

src/wandbot/evaluation/weave_model.py

+    async with httpx.AsyncClient(
+        timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)
+    ) as client:


⚠️ Potential issue

Incorrect httpx.Timeout keyword – will raise TypeError

httpx.Timeout expects total=…, not timeout=….
As written, the first request will blow up before any retry logic kicks in.

- timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout) + timeout=httpx.Timeout(total=request_timeout, connect=request_connect_timeout)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async with httpx.AsyncClient(

timeout=httpx.Timeout(timeout=request_timeout, connect=request_connect_timeout)

) as client:

async with httpx.AsyncClient(

timeout=httpx.Timeout(total=request_timeout, connect=request_connect_timeout)

) as client:

🤖 Prompt for AI Agents

In src/wandbot/evaluation/weave_model.py around lines 31 to 33, the httpx.AsyncClient is incorrectly using the keyword argument 'timeout' instead of 'total' when creating the httpx.Timeout object. Replace 'timeout=request_timeout' with 'total=request_timeout' to correctly specify the total timeout duration and prevent a TypeError from being raised.

morganmcg1 added 3 commits May 20, 2025 19:49

update wandbot Evaluation datasets to dataset version with an "index"…

e467245

… column

add readme instructions how to pass precomputed answers

9c7939d

add ability to pass precomputed answers json file and refactor eval s…

ceca3ac

…cript and helpers

coderabbitai bot reviewed May 22, 2025

View reviewed changes

Trial Claude 4 Opus

337b766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add passing precomputed answers for eval#96

add passing precomputed answers for eval#96
morganmcg1 wants to merge 4 commits intomainfrom
add_passing_precomputed_answers_for_eval

morganmcg1 commented May 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented May 22, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

coderabbitai bot May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		precomputed_answers_json_path: str \| None = sp.field(default=None, help="Path to a JSON file containing precomputed answers. If provided, network calls to wandbot will be skipped.")

Conversation

morganmcg1 commented May 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

morganmcg1 commented May 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)