feat: Add comprehensive benchmarking system for standard-agent by aviralgarg05 · Pull Request #107 · jentic/standard-agent

aviralgarg05 · 2025-10-01T12:09:31Z

Implement BenchmarkRunner with support for ReACT and ReWOO agents
Add deterministic testing with mock components (DeterministicLLM, DeterministicTools, DeterministicReasoner)
Provide CLI interface with configurable scenarios, iterations, and output formats
Include statistical analysis (mean, median, std dev) for performance metrics
Add comprehensive documentation with usage examples and best practices
Integrate Codacy configuration for code quality analysis
Support JSON output for automated CI/CD integration
Implement robust error handling and edge case management

This benchmarking system enables consistent performance measurement and comparison across different agent configurations, supporting both development and production monitoring use cases.

This is for Hacktoberfest, solving issue #90

Summary by CodeRabbit

New Features
- Added a comprehensive benchmarking tool with deterministic mode, CLI, iterations, summaries, and JSON output.
Documentation
- Added a benchmarking guide detailing usage, CLI options, scenarios, metrics, examples, and best practices.
Tests
- Stabilized tests by specifying explicit test models and using tolerant floating‑point assertions.
Chores
- Added static analysis configuration and updated ignore patterns.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Implement BenchmarkRunner with support for ReACT and ReWOO agents - Add deterministic testing with mock components (DeterministicLLM, DeterministicTools, DeterministicReasoner) - Provide CLI interface with configurable scenarios, iterations, and output formats - Include statistical analysis (mean, median, std dev) for performance metrics - Add comprehensive documentation with usage examples and best practices - Integrate Codacy configuration for code quality analysis - Support JSON output for automated CI/CD integration - Implement robust error handling and edge case management This benchmarking system enables consistent performance measurement and comparison across different agent configurations, supporting both development and production monitoring use cases.

rishikesh-jentic · 2025-10-03T11:36:49Z

Thanks so much for this PR and for taking the time to contribute to Standard Agent — welcome aboard 🙌
We’ll review and share feedback soon! 🚀

examples/benchmark.py

scripts/benchmark.py

aviralgarg05 · 2025-10-03T16:43:05Z

I will go through the review, and let you know what all can be done

docs/benchmarking.md

rishikesh-jentic · 2025-10-03T16:45:04Z

Hey @aviralgarg05! 👋

Thanks so much for putting this together and contributing to Standard Agent. I can see you've invested significant effort here - the code is well-structured, the docs are thorough, and the implementation is clean.

I have left a few comments on the PR, I would be happy to know what you think about them when you get a chance.

Welcome to Standard Agent team ! 🙌🚀

- Update benchmarking.md to clarify real mode is default, deterministic is opt-in - Fix LiteLLM tests to explicitly pass model parameter - Use pytest.approx for float comparisons in tests

rishikesh-jentic · 2026-01-14T17:42:07Z

@CodeRabbit review this PR

coderabbitai · 2026-01-14T17:42:23Z

@rishikesh-jentic: I'll review the PR changes now! 🔍

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-01-14T17:42:29Z

📝 Walkthrough

Walkthrough

Adds a new benchmarking CLI and runner with deterministic test doubles, a benchmarking guide, test adjustments to explicit model usage and approximate assertions, and Codacy configuration plus a .gitignore update. All changes are additive.

Changes

Cohort / File(s)	Summary
Configuration & Ignore `.codacy/codacy.yaml`, `.gitignore`	Add Codacy config specifying `runtimes: python@3.11.11` and tools `pylint@4.0.4`, `semgrep@1.145.0`, `trivy@0.68.2`; add `.github/instructions/codacy.instructions.md` to `.gitignore`.
Benchmarking System `examples/benchmark.py`	New comprehensive benchmarking script: `BenchmarkResult`, `BenchmarkSummary`, deterministic test doubles (`DeterministicLLM`, `DeterministicTool`, `DeterministicTools`), `BenchmarkRunner` (init/goal/memory benchmarking, measurement, aggregation, JSON output), `DEFAULT_MODEL`, and CLI `main()`. Review for correctness of deterministic mocks, memory measurement, and JSON schema.
Benchmarking Documentation `examples/benchmarking.md`	New user guide explaining CLI usage, scenarios (initialization, goal solving, memory), deterministic vs real modes, examples, extending benchmarks, and CI integration.
Tests `tests/llm/test_litellm.py`	Replace default `LiteLLM()` calls with explicit `model="test-model"`, change exact temperature asserts to `pytest.approx()`, and remove environment-driven model selection. Verify tests remain consistent with BaseLLM requirements.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Interface
    participant Runner as BenchmarkRunner
    participant Mem as MemoryTracker
    participant Agent as StandardAgent
    participant LLM as DeterministicLLM
    participant Tools as DeterministicTools

    CLI->>Runner: start(scenarios, options)
    loop per scenario
        loop per iteration
            Runner->>Mem: start measurement
            Runner->>Agent: create agent (deterministic? real?)
            Agent->>LLM: prompt / completion
            Agent->>Tools: search / load / execute
            Tools-->>Agent: tool results
            LLM-->>Agent: responses
            Runner->>Agent: run operation (init/solve/memory)
            Runner->>Mem: stop & sample peak/current
            Runner->>Runner: record BenchmarkResult
        end
    end
    Runner->>Runner: aggregate -> BenchmarkSummary
    Runner-->>CLI: return summaries
    CLI->>CLI: print / save output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I timed my hops and counted each part,

Deterministic echoes mimic the heart,
Metrics gleam in JSON and plot,
CLI hums, I hop — benchmarked a lot,
Hop, report, and celebrate the art.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'feat: Add comprehensive benchmarking system for standard-agent' directly and clearly describes the main change—adding a benchmarking system to the standard-agent library.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @.codacy/codacy.yaml:
- Around line 1-15: Update the pinned tool versions in the runtimes/tools list:
bump pylint from 3.3.6 to the suggested 4.0.4, semgrep from 1.78.0 to 1.145.0,
and trivy from 0.66.0 to 0.68.2 in the .codacy/codacy.yaml file (update the
entries under "tools" for pylint@..., semgrep@..., and trivy@...); ensure syntax
remains valid YAML and run a quick CI lint to verify no formatting/compatibility
issues after the version changes.

In `@docs/benchmarking.md`:
- Around line 90-94: Docs claim deterministic mode uses "Real Reasoners" but the
codepath in scripts/benchmark.py uses DeterministicReasoner (a simplified/mock),
so update one of the two: either change the docs to state deterministic mode
uses Simplified/DeterministicReasoner mocks (mention DeterministicReasoner,
DeterministicLLM, DeterministicTools) and remove "Real Reasoners" phrasing, or
change the benchmark implementation to instantiate the real ReACT/ReWOO
reasoners with mocked LLM/tools when the --deterministic flag is set (modify
scripts/benchmark.py to construct the real reasoner classes instead of
DeterministicReasoner and wire in DeterministicLLM/DeterministicTools). Ensure
the documentation and scripts/benchmark.py behavior are consistent.

In `@scripts/benchmark.py`:
- Around line 454-455: The file write uses open(filename, 'w') without an
explicit encoding which can lead to inconsistent behavior; update the open call
in the write block that handles json.dump(output_data, ...) to include
encoding='utf-8' (i.e., with open(filename, 'w', encoding='utf-8') as f:) so the
file output is consistently UTF-8 encoded across platforms.
- Around line 69-74: DeterministicLLM omits calling BaseLLM.__init__, which can
leave attributes like self.model and self.temperature unset; update
DeterministicLLM.__init__ to call super().__init__(...) with appropriate
defaults (e.g., model and temperature) consistent with LiteLLM's signature so
inherited methods work, and then initialize self.response_time_ms and
self.call_count; ensure the super call appears before accessing any base
attributes.

🧹 Nitpick comments (4)

.gitignore (1)
25-26: Minor: Add space after # for comment consistency.

Other comments in this file use # (with a space). Consider aligning this comment with the same style.
-#Ignore vscode AI rules
+# Ignore vscode AI rules
scripts/benchmark.py (2)
23-29: Consider removing unused noqa directives.

Static analysis indicates E402 is not enabled, making these # noqa: E402 comments unnecessary. If this is intentional for compatibility with other linting setups, you can ignore this suggestion.
♻️ Suggested cleanup
-from agents.standard_agent import StandardAgent  # noqa: E402
-from agents.prebuilt import ReACTAgent, ReWOOAgent  # noqa: E402
-from agents.reasoner.base import BaseReasoner, ReasoningResult  # noqa: E402
-from agents.llm.base_llm import BaseLLM  # noqa: E402
-from agents.tools.base import JustInTimeToolingBase, ToolBase  # noqa: E402
-from agents.memory.dict_memory import DictMemory  # noqa: E402
-from utils.logger import get_logger  # noqa: E402
+from agents.standard_agent import StandardAgent
+from agents.prebuilt import ReACTAgent, ReWOOAgent
+from agents.reasoner.base import BaseReasoner, ReasoningResult
+from agents.llm.base_llm import BaseLLM
+from agents.tools.base import JustInTimeToolingBase, ToolBase
+from agents.memory.dict_memory import DictMemory
+from utils.logger import get_logger
246-246: Consider using structured logging parameters.

For better observability and log parsing, consider using structured logging parameters instead of f-string interpolation:
-self.logger.info(f"Running benchmark: {scenario_name}/{operation} ({iterations} iterations)")
+self.logger.info("Running benchmark", scenario=scenario_name, operation=operation, iterations=iterations)
docs/benchmarking.md (1)
131-131: Add language specifier to fenced code block.

Static analysis suggests adding a language specifier. For plain text output, use text:
-```
+```text
 BENCHMARK RESULTS

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a663e3 and e8714f4.

📒 Files selected for processing (5)

.codacy/codacy.yaml
.gitignore
docs/benchmarking.md
scripts/benchmark.py
tests/llm/test_litellm.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/llm/test_litellm.py (1)

agents/llm/litellm.py (1)

LiteLLM (9-101)

🪛 markdownlint-cli2 (0.18.1)

docs/benchmarking.md

131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.11)

scripts/benchmark.py

23-23: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

24-24: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

25-25: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

26-26: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

27-27: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

28-28: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

29-29: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

76-76: Unused method argument: messages

(ARG002)

76-76: Unused method argument: kwargs

(ARG002)

86-86: Unused method argument: text

(ARG002)

86-86: Unused method argument: max_retries

(ARG002)

131-131: Unused method argument: query

(ARG002)

268-268: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)

tests/llm/test_litellm.py (2)

22-22: Good practice: Using pytest.approx for floating-point comparison.

This properly handles floating-point precision issues in temperature comparisons.

151-152: Good improvement: Explicit model parameter for test isolation.

Providing model="test-model" explicitly removes the dependency on environment variables, making tests more deterministic and self-contained. This pattern is correctly applied across all affected test methods.

scripts/benchmark.py (1)

35-67: Well-structured dataclasses for benchmark results.

The BenchmarkResult and BenchmarkSummary dataclasses provide a clean, typed representation of benchmark data with appropriate fields for duration, memory, and metadata.

docs/benchmarking.md (1)

1-7: Comprehensive and well-organized documentation.

The documentation provides thorough coverage of the benchmarking system, including quick start guides, CLI options, scenario explanations, result interpretation, and troubleshooting. This will be valuable for users adopting the benchmarking workflow.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

.codacy/codacy.yaml

docs/benchmarking.md

examples/benchmark.py

scripts/benchmark.py

.codacy/codacy.yaml

rishikesh-jentic · 2026-01-16T18:10:00Z

@aviralgarg05 Apologies for the delay in PR review.

I have left a few comments and unresolved a few issues that I have concerns about.

Also according to the design principles of standard agent, it would be best to keep the root directory as clean as possible. Adding the benchmark to the examples folder seems more apt and also the doc for the benchmark should co exist with the script in the same folder is what I feel.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@examples/benchmarking.md`:
- Around line 269-277: In the troubleshooting examples at the end of the
benchmarking markdown, update the incorrect references to "scripts/benchmark.py"
so they point to the actual example script "examples/benchmark.py"; specifically
change any occurrences in the Permission Denied and Memory Errors snippets to
use examples/benchmark.py and ensure the chmod and python commands reflect that
corrected path.
- Around line 230-233: The docs incorrectly reference a non-existent file name;
update the documentation to point to the actual module and class by replacing
references to benchmark_runner.py with examples/benchmark.py and the
BenchmarkRunner class (e.g., use BenchmarkRunner in examples/benchmark.py) and
ensure any steps that say "Add scenario to `benchmark_runner.py`" or "Implement
agent creation logic" refer to the correct module and symbols in
examples/benchmark.py (e.g., the BenchmarkRunner class and its methods) and
adjust CLI argument docs to match the real implementation.
- Around line 202-222: The documentation uses the wrong script paths: update all
occurrences of "scripts/benchmark.py" to "examples/benchmark.py" and likewise
change "scripts/compare_benchmarks.py" to "examples/compare_benchmarks.py" in
the benchmarking docs so commands in the examples and CI section point to the
actual scripts; search for these string literals in examples/benchmarking.md and
replace them accordingly to ensure the CLI examples (including flags like
--output, --deterministic, --iterations) reference the correct example scripts.

♻️ Duplicate comments (1)

examples/benchmark.py (1)

151-193: Deterministic mode still bypasses real reasoning logic.

The DeterministicReasoner implementation simulates a Think→Act→Observe loop but doesn't exercise the actual ReACTAgent or ReWOOAgent reasoning paths. As discussed in prior review comments, this means deterministic benchmarks measure mock overhead rather than actual agent framework performance.

The author previously agreed to refactor _create_deterministic_agent to use real reasoners with DeterministicLLM and DeterministicTools injected as dependencies.

🧹 Nitpick comments (2)

examples/benchmark.py (2)
346-347: Minor: Inconsistent indentation.

Line 347 has an extra space before history.append.
📝 Suggested fix
             for i in range(5):
-                 history.append({"goal": f"test_goal_{i}", "result": f"test_result_{i}" * 50})
+                history.append({"goal": f"test_goal_{i}", "result": f"test_result_{i}" * 50})
449-449: Minor: Missing space in output formatting.

The "Avg Memory Current" line is missing a space before the value, unlike the other lines.
📝 Suggested fix
-            print(f"Avg Memory Current:{summary.avg_memory_current_mb:.2f}MB")
+            print(f"Avg Memory Current: {summary.avg_memory_current_mb:.2f}MB")

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c765d08 and e82cfb3.

📒 Files selected for processing (3)

.codacy/codacy.yaml
examples/benchmark.py
examples/benchmarking.md

🚧 Files skipped from review as they are similar to previous changes (1)

.codacy/codacy.yaml

🧰 Additional context used

🪛 Ruff (0.14.11)

examples/benchmark.py

77-77: Unused method argument: messages

(ARG002)

77-77: Unused method argument: kwargs

(ARG002)

87-87: Unused method argument: text

(ARG002)

87-87: Unused method argument: max_retries

(ARG002)

132-132: Unused method argument: query

(ARG002)

269-269: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (6)

examples/benchmark.py (6)

35-67: LGTM!

The BenchmarkResult and BenchmarkSummary dataclasses are well-structured with appropriate fields for capturing benchmark metrics including duration, memory usage, success/failure tracking, and metadata.

69-91: LGTM!

The DeterministicLLM implementation correctly calls super().__init__() and provides consistent mock responses with simulated latency. The unused arguments (flagged by static analysis) are intentional since this is a mock implementation conforming to the BaseLLM interface.

93-148: LGTM!

The DeterministicTool and DeterministicTools classes provide a well-structured mock tooling interface with varied simulated execution times (25ms, 75ms, 50ms) for realistic benchmarking.

242-301: LGTM!

The run_benchmark method has solid structure:

Proper memory tracing with tracemalloc

Accurate timing with time.perf_counter()

Comprehensive result capture including metadata extraction from ReasoningResult

Appropriate error handling for benchmark robustness (the broad exception catch is acceptable here since benchmarks should be resilient to various failure modes)

362-431: LGTM!

The summary generation logic is well-factored with good edge-case handling:

stdev is correctly guarded for single-element lists (line 377)

_create_summary_from_results returns None for all-failed scenarios (line 394)

Clean separation of concerns between grouping, stats calculation, and summary creation

470-540: LGTM!

The CLI implementation is well-designed with:

Sensible defaults for scenarios, iterations, and goals

Clear help text for each argument

Smart runtime management (limiting goals to 2 and halving iterations for goal-solving)

Proper orchestration of benchmarks across scenarios

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

examples/benchmarking.md

- Removed DeterministicReasoner class that bypassed actual agent logic - Updated _create_deterministic_agent to instantiate real ReACTReasoner/ReWOOReasoner - Inject DeterministicLLM and DeterministicTools as dependencies - Now exercises complete agent code paths (prompt construction, state management, reasoning loops) - Maintains deterministic behavior for consistent benchmarking - Addresses PR review feedback to ensure benchmarks measure real agent performance

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@examples/benchmark.py`:
- Around line 309-310: The history.append line inside the for loop has an extra
leading space causing inconsistent indentation; edit the loop in
examples/benchmark.py so the statement that references history
(history.append({"goal": f"test_goal_{i}", "result": f"test_result_{i}" * 50}))
is indented to match the for loop body (same indentation as the for's first
child) to fix the mismatch.
- Around line 79-93: The mock prompt_to_json() must return JSON shapes that
match what ReWOOReasoner.reflect() and ReACTReasoner expect (not just {"mock":
True, "response_id": N}); update the prompt_to_json implementation to include an
"action" key and appropriate additional fields depending on intent (e.g.,
{"action":"give_up"} for immediate give-up decisions,
{"action":"rephrase_step","rephrase":"..."} for rephrase requests,
{"action":"change_tool","tool":"tool_name","params":{...}} for tool-switches,
and {"action":"retry_params","retry_params":{...}} for parameter retries), and
ensure completion()/prompt() still increment call_count and simulate
latency—this keeps the benchmark mocks compatible with ReWOOReasoner and
ReACTReasoner reflection and parameter-generation checks by providing the
expected keys and structures.

examples/benchmark.py

- Fixed indentation issue on line 310 - Updated prompt_to_json to return proper JSON structures with: - step_type=STOP and text field for ReACT reasoner - action field for ReWOO reflection - Mock parameters for parameter generation - Now completes without errors when exercising real ReACT/ReWOO logic - Addresses CodeRabbit review feedback

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/benchmark.py`:
- Around line 368-397: The function _create_summary_from_results currently
returns None when all runs failed, dropping those scenarios; change it to always
return a BenchmarkSummary: detect if successful is empty and in that case set
duration_stats to sensible defaults (e.g., avg/median/min/max/stddev = 0 or
None), set avg_memory_peak_mb and avg_memory_current_mb to 0.0 (or None), set
success_rate = 0.0, and build metadata from either successful or group_results
as appropriate; ensure calls to _calculate_duration_stats and
_calculate_metadata handle empty input (call _calculate_duration_stats only when
durations non-empty or handle empty inside that function) and return values
match the keys used when constructing BenchmarkSummary so no KeyError occurs.

🧹 Nitpick comments (2)

examples/benchmark.py (2)

79-90: Silence Ruff ARG002 unused-argument warnings for overrides.
completion, prompt_to_json, and search intentionally ignore some parameters; consider explicitly marking them as used to keep lint clean.

♻️ Suggested tweak

     def completion(self, messages: List[Dict[str, str]], **kwargs) -> str:
+        _ = messages, kwargs
         time.sleep(self.response_time_ms / 1000)  # Simulate network latency
         self.call_count += 1
         return f"Mock response #{self.call_count}"

     def prompt_to_json(self, text: str, max_retries: int = 0) -> Dict[str, Any]:
+        _ = max_retries
         time.sleep(self.response_time_ms / 1000)
         self.call_count += 1
@@
     def search(self, query: str, *, top_k: int = 10) -> List[ToolBase]:
         """Search for tools matching a query - returns all tools for testing."""
+        _ = query
         return self.tools[:min(top_k, len(self.tools))]

Also applies to: 149-151

433-444: Create the output directory before writing results.
If --output points to a non-existent folder, open() will fail.

♻️ Suggested tweak

-        with open(filename, 'w', encoding='utf-8') as f:
+        output_path = Path(filename)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        with output_path.open('w', encoding='utf-8') as f:
             json.dump(output_data, f, indent=2)

examples/benchmark.py

- Silence Ruff ARG002 warnings by marking intentionally unused parameters - Handle all-failed benchmark scenarios with sensible defaults - Ensure output directory is created before writing results - Addresses final CodeRabbit review feedback

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/benchmark.py`:
- Around line 326-332: The trimming simulation is unreachable because the loop
"for i in range(5)" appends exactly 5 items so the condition "if len(history) >
5" never holds; fix by making the loop append more than 5 entries (e.g., use
range(6) or range(N>5)) or adjust the condition to trigger at 5 (e.g., "if
len(history) > 4" or "if len(history) >= 5") so that the simulated window
trimming of agent.memory["conversation_history"] runs; update whichever of the
two (the loop or the condition) in the examples/benchmark.py snippet to ensure
trimming behavior is exercised.

🧹 Nitpick comments (2)

examples/benchmark.py (2)
90-109: Pattern-based prompt detection is functional but fragile.

The keyword matching ("think" in text.lower(), "reflect" in text.lower()) works for current reasoner prompts but could break if prompt templates change. This is acceptable for benchmark mocking, but consider documenting these assumptions or using more robust detection (e.g., checking for specific JSON schema requests in the prompt).

250-255: Bare exception catch is acceptable for benchmarking, but consider logging the traceback.

The static analysis flags except Exception, but for benchmarking this is intentional—we want to capture all failure types. However, the traceback is lost, making debugging difficult if benchmarks fail unexpectedly.
💡 Optional: Preserve traceback for debugging
             except Exception as e:
                 success = False
                 error_message = str(e)
-                metadata = {"error_type": type(e).__name__}
+                metadata = {
+                    "error_type": type(e).__name__,
+                    "traceback": traceback.format_exc()
+                }
                 iterations_count = 0
                 tool_calls_count = 0
This would require adding import traceback at the top of the file.

examples/benchmark.py

…nd fix loop range

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/benchmark.py`:
- Around line 292-318: The except block that catches all exceptions for each
iteration (surrounding benchmark_fn) should explicitly log the exception
traceback so the linter knows the swallowing is intentional; in the except for
benchmark_fn, call the module logger (or use traceback.format_exc()) to record
the full exception and traceback along with context (e.g., iteration index,
benchmark_fn.__name__), then continue to set success=False,
error_message=str(e), metadata={"error_type": type(e).__name__} and reset
iterations_count and tool_calls_count as currently done; update the except block
that handles the ReasoningResult extraction to include this logging so BLE001 is
satisfied.

examples/benchmark.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/benchmark.py`:
- Around line 140-143: The RETURNS in the ACT/STOP branch are over-indented
causing an IndentationError; within the method that checks self.call_count (the
if self.call_count % 4 != 0 / else block), align the two return statements to
the same indentation level as the if/else so they are direct children of the
blocks and return the dictionaries ({"step_type": "ACT", "text": ...} and
{"step_type": "STOP", "text": ...}) without the extra leading whitespace.

examples/benchmark.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/benchmark.py`:
- Around line 33-35: DEFAULT_MODEL is hard-coded to "gpt-3.5-turbo"; make the
model configurable via an environment variable and CLI flag and thread it into
the real agent creation so benchmarks don't break if a model is deprecated.
Update CLI arg parsing (or use os.environ) to accept a --model / MODEL env var,
pass that value into _create_real_agent (and any callers of _create_real_agent),
and default to DEFAULT_MODEL only when no override is provided; also replace
other hard-coded uses of "gpt-3.5-turbo" in the file (references around
_create_real_agent and lines noted) to use the configured value.

🧹 Nitpick comments (1)

examples/benchmark.py (1)

17-20: Avoid mutating sys.path on import.

This executes on import and can leak into tests/tools that import this module. Consider moving the path bootstrap + project imports into main() or guarding under if __name__ == "__main__".

examples/benchmark.py

aviralgarg05 requested a review from a team as a code owner October 1, 2025 12:09

rishikesh-jentic reviewed Oct 3, 2025

View reviewed changes

examples/benchmark.py Outdated Show resolved Hide resolved

rishikesh-jentic reviewed Oct 3, 2025

View reviewed changes

scripts/benchmark.py Outdated Show resolved Hide resolved

rishikesh-jentic reviewed Oct 3, 2025

View reviewed changes

docs/benchmarking.md Outdated Show resolved Hide resolved

Address PR review comments: update docs and fix LiteLLM tests

e8714f4

- Update benchmarking.md to clarify real mode is default, deterministic is opt-in - Fix LiteLLM tests to explicitly pass model parameter - Use pytest.approx for float comparisons in tests

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

.codacy/codacy.yaml Outdated Show resolved Hide resolved

docs/benchmarking.md Outdated Show resolved Hide resolved

examples/benchmark.py Show resolved Hide resolved

scripts/benchmark.py Outdated Show resolved Hide resolved

chore: address PR review comments

c765d08

rishikesh-jentic reviewed Jan 16, 2026

View reviewed changes

.codacy/codacy.yaml Outdated Show resolved Hide resolved

chore: refactor benchmark script and clean up config

e82cfb3

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

examples/benchmarking.md Show resolved Hide resolved

examples/benchmarking.md Show resolved Hide resolved

examples/benchmarking.md Show resolved Hide resolved

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

examples/benchmark.py Show resolved Hide resolved

examples/benchmark.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

examples/benchmark.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

examples/benchmark.py Show resolved Hide resolved

fix: address PR feedback - use real reasoners in deterministic mode a…

ff9de00

…nd fix loop range

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

examples/benchmark.py Outdated Show resolved Hide resolved

fix: add exception logging in benchmark loop

e1ba181

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

examples/benchmark.py Outdated Show resolved Hide resolved

fix: resolve indentation error in prompt_to_json

68f73e9

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

examples/benchmark.py Show resolved Hide resolved

refactor: make model configurable and improve imports handling

579e985

Conversation

aviralgarg05 commented Oct 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

rishikesh-jentic commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

aviralgarg05 commented Oct 3, 2025

Uh oh!

Uh oh!

rishikesh-jentic commented Oct 3, 2025

Uh oh!

rishikesh-jentic commented Jan 14, 2026

Uh oh!

coderabbitai bot commented Jan 14, 2026

Uh oh!

coderabbitai bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rishikesh-jentic commented Jan 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aviralgarg05 commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 14, 2026 •

edited

Loading