feat: Add comprehensive benchmarking system for standard-agent#107
feat: Add comprehensive benchmarking system for standard-agent#107aviralgarg05 wants to merge 11 commits intojentic:mainfrom
Conversation
- Implement BenchmarkRunner with support for ReACT and ReWOO agents - Add deterministic testing with mock components (DeterministicLLM, DeterministicTools, DeterministicReasoner) - Provide CLI interface with configurable scenarios, iterations, and output formats - Include statistical analysis (mean, median, std dev) for performance metrics - Add comprehensive documentation with usage examples and best practices - Integrate Codacy configuration for code quality analysis - Support JSON output for automated CI/CD integration - Implement robust error handling and edge case management This benchmarking system enables consistent performance measurement and comparison across different agent configurations, supporting both development and production monitoring use cases.
|
Thanks so much for this PR and for taking the time to contribute to Standard Agent — welcome aboard 🙌 |
|
I will go through the review, and let you know what all can be done |
|
Hey @aviralgarg05! 👋 Thanks so much for putting this together and contributing to Standard Agent. I can see you've invested significant effort here - the code is well-structured, the docs are thorough, and the implementation is clean. I have left a few comments on the PR, I would be happy to know what you think about them when you get a chance. Welcome to Standard Agent team ! 🙌🚀 |
- Update benchmarking.md to clarify real mode is default, deterministic is opt-in - Fix LiteLLM tests to explicitly pass model parameter - Use pytest.approx for float comparisons in tests
|
@CodeRabbit review this PR |
|
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughAdds a new benchmarking CLI and runner with deterministic test doubles, a benchmarking guide, test adjustments to explicit model usage and approximate assertions, and Codacy configuration plus a .gitignore update. All changes are additive. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI Interface
participant Runner as BenchmarkRunner
participant Mem as MemoryTracker
participant Agent as StandardAgent
participant LLM as DeterministicLLM
participant Tools as DeterministicTools
CLI->>Runner: start(scenarios, options)
loop per scenario
loop per iteration
Runner->>Mem: start measurement
Runner->>Agent: create agent (deterministic? real?)
Agent->>LLM: prompt / completion
Agent->>Tools: search / load / execute
Tools-->>Agent: tool results
LLM-->>Agent: responses
Runner->>Agent: run operation (init/solve/memory)
Runner->>Mem: stop & sample peak/current
Runner->>Runner: record BenchmarkResult
end
end
Runner->>Runner: aggregate -> BenchmarkSummary
Runner-->>CLI: return summaries
CLI->>CLI: print / save output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.codacy/codacy.yaml:
- Around line 1-15: Update the pinned tool versions in the runtimes/tools list:
bump pylint from 3.3.6 to the suggested 4.0.4, semgrep from 1.78.0 to 1.145.0,
and trivy from 0.66.0 to 0.68.2 in the .codacy/codacy.yaml file (update the
entries under "tools" for pylint@..., semgrep@..., and trivy@...); ensure syntax
remains valid YAML and run a quick CI lint to verify no formatting/compatibility
issues after the version changes.
In `@docs/benchmarking.md`:
- Around line 90-94: Docs claim deterministic mode uses "Real Reasoners" but the
codepath in scripts/benchmark.py uses DeterministicReasoner (a simplified/mock),
so update one of the two: either change the docs to state deterministic mode
uses Simplified/DeterministicReasoner mocks (mention DeterministicReasoner,
DeterministicLLM, DeterministicTools) and remove "Real Reasoners" phrasing, or
change the benchmark implementation to instantiate the real ReACT/ReWOO
reasoners with mocked LLM/tools when the --deterministic flag is set (modify
scripts/benchmark.py to construct the real reasoner classes instead of
DeterministicReasoner and wire in DeterministicLLM/DeterministicTools). Ensure
the documentation and scripts/benchmark.py behavior are consistent.
In `@scripts/benchmark.py`:
- Around line 454-455: The file write uses open(filename, 'w') without an
explicit encoding which can lead to inconsistent behavior; update the open call
in the write block that handles json.dump(output_data, ...) to include
encoding='utf-8' (i.e., with open(filename, 'w', encoding='utf-8') as f:) so the
file output is consistently UTF-8 encoded across platforms.
- Around line 69-74: DeterministicLLM omits calling BaseLLM.__init__, which can
leave attributes like self.model and self.temperature unset; update
DeterministicLLM.__init__ to call super().__init__(...) with appropriate
defaults (e.g., model and temperature) consistent with LiteLLM's signature so
inherited methods work, and then initialize self.response_time_ms and
self.call_count; ensure the super call appears before accessing any base
attributes.
🧹 Nitpick comments (4)
.gitignore (1)
25-26: Minor: Add space after#for comment consistency.Other comments in this file use
#(with a space). Consider aligning this comment with the same style.-#Ignore vscode AI rules +# Ignore vscode AI rulesscripts/benchmark.py (2)
23-29: Consider removing unusednoqadirectives.Static analysis indicates E402 is not enabled, making these
# noqa: E402comments unnecessary. If this is intentional for compatibility with other linting setups, you can ignore this suggestion.♻️ Suggested cleanup
-from agents.standard_agent import StandardAgent # noqa: E402 -from agents.prebuilt import ReACTAgent, ReWOOAgent # noqa: E402 -from agents.reasoner.base import BaseReasoner, ReasoningResult # noqa: E402 -from agents.llm.base_llm import BaseLLM # noqa: E402 -from agents.tools.base import JustInTimeToolingBase, ToolBase # noqa: E402 -from agents.memory.dict_memory import DictMemory # noqa: E402 -from utils.logger import get_logger # noqa: E402 +from agents.standard_agent import StandardAgent +from agents.prebuilt import ReACTAgent, ReWOOAgent +from agents.reasoner.base import BaseReasoner, ReasoningResult +from agents.llm.base_llm import BaseLLM +from agents.tools.base import JustInTimeToolingBase, ToolBase +from agents.memory.dict_memory import DictMemory +from utils.logger import get_logger
246-246: Consider using structured logging parameters.For better observability and log parsing, consider using structured logging parameters instead of f-string interpolation:
-self.logger.info(f"Running benchmark: {scenario_name}/{operation} ({iterations} iterations)") +self.logger.info("Running benchmark", scenario=scenario_name, operation=operation, iterations=iterations)docs/benchmarking.md (1)
131-131: Add language specifier to fenced code block.Static analysis suggests adding a language specifier. For plain text output, use
text:-``` +```text BENCHMARK RESULTS
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.codacy/codacy.yaml.gitignoredocs/benchmarking.mdscripts/benchmark.pytests/llm/test_litellm.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/llm/test_litellm.py (1)
agents/llm/litellm.py (1)
LiteLLM(9-101)
🪛 markdownlint-cli2 (0.18.1)
docs/benchmarking.md
131-131: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.11)
scripts/benchmark.py
23-23: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
24-24: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
25-25: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
26-26: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
27-27: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
28-28: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
29-29: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
76-76: Unused method argument: messages
(ARG002)
76-76: Unused method argument: kwargs
(ARG002)
86-86: Unused method argument: text
(ARG002)
86-86: Unused method argument: max_retries
(ARG002)
131-131: Unused method argument: query
(ARG002)
268-268: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (4)
tests/llm/test_litellm.py (2)
22-22: Good practice: Usingpytest.approxfor floating-point comparison.This properly handles floating-point precision issues in temperature comparisons.
151-152: Good improvement: Explicit model parameter for test isolation.Providing
model="test-model"explicitly removes the dependency on environment variables, making tests more deterministic and self-contained. This pattern is correctly applied across all affected test methods.scripts/benchmark.py (1)
35-67: Well-structured dataclasses for benchmark results.The
BenchmarkResultandBenchmarkSummarydataclasses provide a clean, typed representation of benchmark data with appropriate fields for duration, memory, and metadata.docs/benchmarking.md (1)
1-7: Comprehensive and well-organized documentation.The documentation provides thorough coverage of the benchmarking system, including quick start guides, CLI options, scenario explanations, result interpretation, and troubleshooting. This will be valuable for users adopting the benchmarking workflow.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
@aviralgarg05 Apologies for the delay in PR review. I have left a few comments and unresolved a few issues that I have concerns about. Also according to the design principles of standard agent, it would be best to keep the root directory as clean as possible. Adding the benchmark to the examples folder seems more apt and also the doc for the benchmark should co exist with the script in the same folder is what I feel. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@examples/benchmarking.md`:
- Around line 269-277: In the troubleshooting examples at the end of the
benchmarking markdown, update the incorrect references to "scripts/benchmark.py"
so they point to the actual example script "examples/benchmark.py"; specifically
change any occurrences in the Permission Denied and Memory Errors snippets to
use examples/benchmark.py and ensure the chmod and python commands reflect that
corrected path.
- Around line 230-233: The docs incorrectly reference a non-existent file name;
update the documentation to point to the actual module and class by replacing
references to benchmark_runner.py with examples/benchmark.py and the
BenchmarkRunner class (e.g., use BenchmarkRunner in examples/benchmark.py) and
ensure any steps that say "Add scenario to `benchmark_runner.py`" or "Implement
agent creation logic" refer to the correct module and symbols in
examples/benchmark.py (e.g., the BenchmarkRunner class and its methods) and
adjust CLI argument docs to match the real implementation.
- Around line 202-222: The documentation uses the wrong script paths: update all
occurrences of "scripts/benchmark.py" to "examples/benchmark.py" and likewise
change "scripts/compare_benchmarks.py" to "examples/compare_benchmarks.py" in
the benchmarking docs so commands in the examples and CI section point to the
actual scripts; search for these string literals in examples/benchmarking.md and
replace them accordingly to ensure the CLI examples (including flags like
--output, --deterministic, --iterations) reference the correct example scripts.
♻️ Duplicate comments (1)
examples/benchmark.py (1)
151-193: Deterministic mode still bypasses real reasoning logic.The
DeterministicReasonerimplementation simulates a Think→Act→Observe loop but doesn't exercise the actualReACTAgentorReWOOAgentreasoning paths. As discussed in prior review comments, this means deterministic benchmarks measure mock overhead rather than actual agent framework performance.The author previously agreed to refactor
_create_deterministic_agentto use real reasoners withDeterministicLLMandDeterministicToolsinjected as dependencies.
🧹 Nitpick comments (2)
examples/benchmark.py (2)
346-347: Minor: Inconsistent indentation.Line 347 has an extra space before
history.append.📝 Suggested fix
for i in range(5): - history.append({"goal": f"test_goal_{i}", "result": f"test_result_{i}" * 50}) + history.append({"goal": f"test_goal_{i}", "result": f"test_result_{i}" * 50})
449-449: Minor: Missing space in output formatting.The "Avg Memory Current" line is missing a space before the value, unlike the other lines.
📝 Suggested fix
- print(f"Avg Memory Current:{summary.avg_memory_current_mb:.2f}MB") + print(f"Avg Memory Current: {summary.avg_memory_current_mb:.2f}MB")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.codacy/codacy.yamlexamples/benchmark.pyexamples/benchmarking.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .codacy/codacy.yaml
🧰 Additional context used
🪛 Ruff (0.14.11)
examples/benchmark.py
77-77: Unused method argument: messages
(ARG002)
77-77: Unused method argument: kwargs
(ARG002)
87-87: Unused method argument: text
(ARG002)
87-87: Unused method argument: max_retries
(ARG002)
132-132: Unused method argument: query
(ARG002)
269-269: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (6)
examples/benchmark.py (6)
35-67: LGTM!The
BenchmarkResultandBenchmarkSummarydataclasses are well-structured with appropriate fields for capturing benchmark metrics including duration, memory usage, success/failure tracking, and metadata.
69-91: LGTM!The
DeterministicLLMimplementation correctly callssuper().__init__()and provides consistent mock responses with simulated latency. The unused arguments (flagged by static analysis) are intentional since this is a mock implementation conforming to theBaseLLMinterface.
93-148: LGTM!The
DeterministicToolandDeterministicToolsclasses provide a well-structured mock tooling interface with varied simulated execution times (25ms, 75ms, 50ms) for realistic benchmarking.
242-301: LGTM!The
run_benchmarkmethod has solid structure:
- Proper memory tracing with
tracemalloc- Accurate timing with
time.perf_counter()- Comprehensive result capture including metadata extraction from
ReasoningResult- Appropriate error handling for benchmark robustness (the broad exception catch is acceptable here since benchmarks should be resilient to various failure modes)
362-431: LGTM!The summary generation logic is well-factored with good edge-case handling:
stdevis correctly guarded for single-element lists (line 377)_create_summary_from_resultsreturnsNonefor all-failed scenarios (line 394)- Clean separation of concerns between grouping, stats calculation, and summary creation
470-540: LGTM!The CLI implementation is well-designed with:
- Sensible defaults for scenarios, iterations, and goals
- Clear help text for each argument
- Smart runtime management (limiting goals to 2 and halving iterations for goal-solving)
- Proper orchestration of benchmarks across scenarios
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
- Removed DeterministicReasoner class that bypassed actual agent logic - Updated _create_deterministic_agent to instantiate real ReACTReasoner/ReWOOReasoner - Inject DeterministicLLM and DeterministicTools as dependencies - Now exercises complete agent code paths (prompt construction, state management, reasoning loops) - Maintains deterministic behavior for consistent benchmarking - Addresses PR review feedback to ensure benchmarks measure real agent performance
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@examples/benchmark.py`:
- Around line 309-310: The history.append line inside the for loop has an extra
leading space causing inconsistent indentation; edit the loop in
examples/benchmark.py so the statement that references history
(history.append({"goal": f"test_goal_{i}", "result": f"test_result_{i}" * 50}))
is indented to match the for loop body (same indentation as the for's first
child) to fix the mismatch.
- Around line 79-93: The mock prompt_to_json() must return JSON shapes that
match what ReWOOReasoner.reflect() and ReACTReasoner expect (not just {"mock":
True, "response_id": N}); update the prompt_to_json implementation to include an
"action" key and appropriate additional fields depending on intent (e.g.,
{"action":"give_up"} for immediate give-up decisions,
{"action":"rephrase_step","rephrase":"..."} for rephrase requests,
{"action":"change_tool","tool":"tool_name","params":{...}} for tool-switches,
and {"action":"retry_params","retry_params":{...}} for parameter retries), and
ensure completion()/prompt() still increment call_count and simulate
latency—this keeps the benchmark mocks compatible with ReWOOReasoner and
ReACTReasoner reflection and parameter-generation checks by providing the
expected keys and structures.
- Fixed indentation issue on line 310 - Updated prompt_to_json to return proper JSON structures with: - step_type=STOP and text field for ReACT reasoner - action field for ReWOO reflection - Mock parameters for parameter generation - Now completes without errors when exercising real ReACT/ReWOO logic - Addresses CodeRabbit review feedback
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/benchmark.py`:
- Around line 368-397: The function _create_summary_from_results currently
returns None when all runs failed, dropping those scenarios; change it to always
return a BenchmarkSummary: detect if successful is empty and in that case set
duration_stats to sensible defaults (e.g., avg/median/min/max/stddev = 0 or
None), set avg_memory_peak_mb and avg_memory_current_mb to 0.0 (or None), set
success_rate = 0.0, and build metadata from either successful or group_results
as appropriate; ensure calls to _calculate_duration_stats and
_calculate_metadata handle empty input (call _calculate_duration_stats only when
durations non-empty or handle empty inside that function) and return values
match the keys used when constructing BenchmarkSummary so no KeyError occurs.
🧹 Nitpick comments (2)
examples/benchmark.py (2)
79-90: Silence Ruff ARG002 unused-argument warnings for overrides.
completion,prompt_to_json, andsearchintentionally ignore some parameters; consider explicitly marking them as used to keep lint clean.♻️ Suggested tweak
def completion(self, messages: List[Dict[str, str]], **kwargs) -> str: + _ = messages, kwargs time.sleep(self.response_time_ms / 1000) # Simulate network latency self.call_count += 1 return f"Mock response #{self.call_count}" def prompt_to_json(self, text: str, max_retries: int = 0) -> Dict[str, Any]: + _ = max_retries time.sleep(self.response_time_ms / 1000) self.call_count += 1 @@ def search(self, query: str, *, top_k: int = 10) -> List[ToolBase]: """Search for tools matching a query - returns all tools for testing.""" + _ = query return self.tools[:min(top_k, len(self.tools))]Also applies to: 149-151
433-444: Create the output directory before writing results.
If--outputpoints to a non-existent folder,open()will fail.♻️ Suggested tweak
- with open(filename, 'w', encoding='utf-8') as f: + output_path = Path(filename) + output_path.parent.mkdir(parents=True, exist_ok=True) + with output_path.open('w', encoding='utf-8') as f: json.dump(output_data, f, indent=2)
- Silence Ruff ARG002 warnings by marking intentionally unused parameters - Handle all-failed benchmark scenarios with sensible defaults - Ensure output directory is created before writing results - Addresses final CodeRabbit review feedback
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/benchmark.py`:
- Around line 326-332: The trimming simulation is unreachable because the loop
"for i in range(5)" appends exactly 5 items so the condition "if len(history) >
5" never holds; fix by making the loop append more than 5 entries (e.g., use
range(6) or range(N>5)) or adjust the condition to trigger at 5 (e.g., "if
len(history) > 4" or "if len(history) >= 5") so that the simulated window
trimming of agent.memory["conversation_history"] runs; update whichever of the
two (the loop or the condition) in the examples/benchmark.py snippet to ensure
trimming behavior is exercised.
🧹 Nitpick comments (2)
examples/benchmark.py (2)
90-109: Pattern-based prompt detection is functional but fragile.The keyword matching (
"think" in text.lower(),"reflect" in text.lower()) works for current reasoner prompts but could break if prompt templates change. This is acceptable for benchmark mocking, but consider documenting these assumptions or using more robust detection (e.g., checking for specific JSON schema requests in the prompt).
250-255: Bare exception catch is acceptable for benchmarking, but consider logging the traceback.The static analysis flags
except Exception, but for benchmarking this is intentional—we want to capture all failure types. However, the traceback is lost, making debugging difficult if benchmarks fail unexpectedly.💡 Optional: Preserve traceback for debugging
except Exception as e: success = False error_message = str(e) - metadata = {"error_type": type(e).__name__} + metadata = { + "error_type": type(e).__name__, + "traceback": traceback.format_exc() + } iterations_count = 0 tool_calls_count = 0This would require adding
import tracebackat the top of the file.
…nd fix loop range
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/benchmark.py`:
- Around line 292-318: The except block that catches all exceptions for each
iteration (surrounding benchmark_fn) should explicitly log the exception
traceback so the linter knows the swallowing is intentional; in the except for
benchmark_fn, call the module logger (or use traceback.format_exc()) to record
the full exception and traceback along with context (e.g., iteration index,
benchmark_fn.__name__), then continue to set success=False,
error_message=str(e), metadata={"error_type": type(e).__name__} and reset
iterations_count and tool_calls_count as currently done; update the except block
that handles the ReasoningResult extraction to include this logging so BLE001 is
satisfied.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/benchmark.py`:
- Around line 140-143: The RETURNS in the ACT/STOP branch are over-indented
causing an IndentationError; within the method that checks self.call_count (the
if self.call_count % 4 != 0 / else block), align the two return statements to
the same indentation level as the if/else so they are direct children of the
blocks and return the dictionaries ({"step_type": "ACT", "text": ...} and
{"step_type": "STOP", "text": ...}) without the extra leading whitespace.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@examples/benchmark.py`:
- Around line 33-35: DEFAULT_MODEL is hard-coded to "gpt-3.5-turbo"; make the
model configurable via an environment variable and CLI flag and thread it into
the real agent creation so benchmarks don't break if a model is deprecated.
Update CLI arg parsing (or use os.environ) to accept a --model / MODEL env var,
pass that value into _create_real_agent (and any callers of _create_real_agent),
and default to DEFAULT_MODEL only when no override is provided; also replace
other hard-coded uses of "gpt-3.5-turbo" in the file (references around
_create_real_agent and lines noted) to use the configured value.
🧹 Nitpick comments (1)
examples/benchmark.py (1)
17-20: Avoid mutatingsys.pathon import.This executes on import and can leak into tests/tools that import this module. Consider moving the path bootstrap + project imports into
main()or guarding underif __name__ == "__main__".
This benchmarking system enables consistent performance measurement and comparison across different agent configurations, supporting both development and production monitoring use cases.
This is for Hacktoberfest, solving issue #90
Summary by CodeRabbit
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.