-
Notifications
You must be signed in to change notification settings - Fork 296
Add benchmark metrics persistence #2237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideAdds a persistent benchmarks subsystem: parses llama-bench JSON output, stores it in a versioned SQLite schema with migrations, wires it into the bench transport, introduces CONFIG.benchmarks settings, and exposes a new File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @ieaves, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking capabilities by introducing a robust system for persisting and managing benchmark results. It allows users to save detailed Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option. (link)
- Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
- Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- DBManager.save_llama_bench_result()’s docstring says it returns the number of rows inserted but the implementation doesn’t return anything; either return cursor.rowcount (or similar) or update the docstring/tests to avoid implying a return value.
- The benchmarks logger level is derived from CONFIG.log_level at import time in manager.py, so subsequent changes to CONFIG.log_level won’t affect it; consider configuring the logger lazily or in an explicit init path instead of at module import.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- DBManager.save_llama_bench_result()’s docstring says it returns the number of rows inserted but the implementation doesn’t return anything; either return cursor.rowcount (or similar) or update the docstring/tests to avoid implying a return value.
- The benchmarks logger level is derived from CONFIG.log_level at import time in manager.py, so subsequent changes to CONFIG.log_level won’t affect it; consider configuring the logger lazily or in an explicit init path instead of at module import.
## Individual Comments
### Comment 1
<location> `ramalama/cli.py:611` </location>
<code_context>
+ else:
+ test_type = "-"
+
+ tokens_sec = f"{row['tokens_per_sec']:.1f}" if row["tokens_per_sec"] else "-"
+ engine = (row["inference_engine"] or "")[:15]
+ device = (row["accel"] or "cpu")[:10]
</code_context>
<issue_to_address>
**issue (bug_risk):** Zero tokens/sec are rendered as '-' instead of '0.0', which can hide valid results.
Because this uses a truthiness check, a valid `tokens_per_sec` of `0`/`0.0` is rendered as `-`, conflating “very slow” with “no data”. Instead, check explicitly for `None`:
```python
value = row["tokens_per_sec"]
tokens_sec = f"{value:.1f}" if value is not None else "-"
```
</issue_to_address>
### Comment 2
<location> `ramalama/benchmarks/manager.py:272-273` </location>
<code_context>
+ ORDER BY lb.created_at DESC, lb.id DESC
+ """
+
+ if limit:
+ query += f" LIMIT {limit} OFFSET {offset}"
+
+ with get_conn(self.db_path) as connection:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Query builds LIMIT/OFFSET via f-strings instead of parameterized SQL, and silently ignores limit=0.
Consider making the `limit` handling explicit and fully parameterized, e.g.:
```python
params: list[Any] = []
if limit is not None:
query += " LIMIT ? OFFSET ?"
params.extend([limit, offset])
cursor.execute(query, params)
```
This avoids surprising behavior for `limit=0` and keeps the query consistent with parameterized SQL practices.
Suggested implementation:
```python
FROM llama_bench lb
JOIN container_configuration cc ON lb.test_configuration_id = cc.id
JOIN user_device ud ON cc.user_device_id = ud.id
ORDER BY lb.created_at DESC, lb.id DESC
"""
params: list[Any] = []
if limit is not None:
query += " LIMIT ? OFFSET ?"
params.extend([limit, offset])
```
To fully integrate this change, you should also:
1. Ensure `Any` is imported at the top of `ramalama/benchmarks/manager.py` (if not already present), e.g.:
```python
from typing import Any
```
2. Update the corresponding `cursor.execute` call in this method to pass `params`:
- If it currently does `cursor.execute(query)`, change it to `cursor.execute(query, params)`.
- If there are already existing query parameters, instead of initializing `params: list[Any] = []`, you should:
- Initialize `params` with the existing parameters before this snippet.
- Keep the `params.extend([limit, offset])` so that `LIMIT` and `OFFSET` are appended correctly.
3. Make sure this `params` variable is in the same scope as the `cursor.execute` call that runs this query.
</issue_to_address>
### Comment 3
<location> `ramalama/benchmarks/manager.py:209-214` </location>
<code_context>
+ migrate(self.db_path)
+ self.user_device_id = update_user_device(self.db_path)
+
+ def save_llama_bench_result(
+ self, configuration: llama_bench.TestConfiguration, result: llama_bench.LlamaBenchResult
+ ):
+ """Insert a single llama-bench result row. Returns rows inserted (1 or 0)."""
+
+ configuration_id = self.save_container_configuration(configuration)
+
+ columns = ["test_configuration_id"] + llama_bench.TABLE_COLUMNS
</code_context>
<issue_to_address>
**suggestion (performance):** Saving each benchmark result re-inserts identical container configuration rows, which can bloat the DB.
`save_llama_bench_result()` calls `save_container_configuration()` for every result in `bench()`, so the same configuration is inserted repeatedly. This unnecessarily grows `container_configuration` and complicates querying distinct configs.
Instead, persist the configuration once per run, reuse its `id` for all results, and bulk-insert the result rows using that shared `configuration_id`. Exposing a method that returns `configuration_id` would support this pattern.
Suggested implementation:
```python
def save_llama_bench_result(
self, configuration: llama_bench.TestConfiguration, result: llama_bench.LlamaBenchResult
) -> int:
"""Insert a single llama-bench result row. Returns rows inserted (1 or 0).
For optimal performance across a run, prefer:
1) Persist the configuration once via `save_llama_bench_configuration()`
2) Bulk-insert results via `save_llama_bench_results(configuration_id, results)`
"""
configuration_id = self.save_container_configuration(configuration)
return self.save_llama_bench_results(configuration_id, [result])
def save_llama_bench_configuration(
self, configuration: llama_bench.TestConfiguration
) -> int:
"""Persist a llama-bench container configuration and return its id.
This is a thin wrapper over `save_container_configuration()` to make the
llama-bench workflow explicit and allow re-use of the configuration id
across multiple result inserts in a single run.
"""
return self.save_container_configuration(configuration)
def save_llama_bench_results(
self,
configuration_id: int,
results: "Collection[llama_bench.LlamaBenchResult]",
) -> int:
"""Bulk-insert llama-bench result rows sharing a configuration id.
Returns the number of rows inserted.
"""
# Fast path: nothing to insert.
if not results:
return 0
columns = ["test_configuration_id"] + llama_bench.TABLE_COLUMNS
placeholders = ", ".join("?" for _ in columns)
insert_sql = f"INSERT INTO llama_bench ({', '.join(columns)}) VALUES ({placeholders})"
rows = [(configuration_id, *result.as_db_tuple()) for result in results]
with get_conn(self.db_path) as connection:
cursor = connection.cursor()
cursor.executemany(insert_sql, rows)
return cursor.rowcount
def save_container_configuration(self, configuration: llama_bench.TestConfiguration) -> int:
"""Insert a container_configuration row and return its id."""
```
1. At the top of `ramalama/benchmarks/manager.py`, import `Collection` from `collections.abc` and use the non-string annotation if your type-checking configuration supports it:
```python
from collections.abc import Collection
```
Then update the type annotation in `save_llama_bench_results` to:
```python
def save_llama_bench_results(
self,
configuration_id: int,
results: Collection[llama_bench.LlamaBenchResult],
) -> int:
```
2. Update the llama-bench `bench()` workflow (wherever it lives) to avoid re-inserting identical configurations:
```python
configuration_id = db_manager.save_llama_bench_configuration(configuration)
db_manager.save_llama_bench_results(configuration_id, results)
```
instead of calling `save_llama_bench_result()` for every `result`.
</issue_to_address>
### Comment 4
<location> `ramalama/transports/base.py:506-387` </location>
<code_context>
+ return None
+ if isinstance(value, bool):
+ return 1 if value else 0
+ try:
+ return int(value)
+ except (ValueError, TypeError):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Catch-all around benchmark execution may hide failures unrelated to result capture/parsing.
Using a broad `try/except Exception` and falling back to `self.execute_command()` will mask unrelated errors (e.g., logic or DB issues) as a simple command execution, making real failures harder to see and debug.
If the goal is to recover only from capture/parsing issues, narrow the `try` to just that logic or catch specific exception types so genuine runtime errors still propagate while malformed output is handled gracefully.
</issue_to_address>
### Comment 5
<location> `ramalama/benchmarks/manager.py:20-21` </location>
<code_context>
+from ramalama.rag import rag_image
+from ramalama.version import version
+
+logger = logging.getLogger("ramalama.benchmarks")
+logger.setLevel(CONFIG.log_level or LogLevel.WARNING)
+
+
</code_context>
<issue_to_address>
**suggestion:** Setting the logger level at import time couples logging configuration to CONFIG and may override application-level logging.
This also assumes `CONFIG` is fully initialized at import time. Instead, consider leaving the module logger unconfigured here and have the main application/CLI set log levels from `CONFIG`, so consumers (including tests or embedding apps) can control logging without surprises.
Suggested implementation:
```python
logger = logging.getLogger("ramalama.benchmarks")
```
If `LogLevel` and/or `CONFIG` are only used here for setting the logger level and not elsewhere in `manager.py`, you should also remove those unused imports:
- Remove `CONFIG` from `from ramalama.config import CONFIG, ...`
- Remove `LogLevel` from `from ramalama.log_levels import LogLevel`
Logging configuration based on `CONFIG` should instead be done in the main application/CLI startup code (e.g., where `CONFIG` is loaded), by calling `logging.getLogger("ramalama.benchmarks").setLevel(...)` there.
</issue_to_address>
### Comment 6
<location> `test/unit/test_benchmarks_manager.py:49-58` </location>
<code_context>
+def test_save_llama_bench_result_inserts_all(tmp_path, monkeypatch):
</code_context>
<issue_to_address>
**suggestion (testing):** Clarify expectations for `save_llama_bench_result` return value and assert on DB state only
In `test_save_llama_bench_result_inserts_all`, `save_llama_bench_result` currently returns `None`, so `inserted in (None, 1)` will always be true via the `None` branch and the `1` case is dead code. I’d suggest removing `inserted` altogether and asserting only on DB state (row counts/contents), which you’re already validating. If you do want this function to return an insertion count, then update the implementation accordingly and assert on that specific value instead of a union of possibilities.
Suggested implementation:
```python
def test_save_llama_bench_result_inserts_all(tmp_path, monkeypatch):
```
```python
db = manager.DBManager(db_path)
cfg = llama_bench.TestConfiguration(
```
```python
)
```
```python
db.save_llama_bench_result(
```
```python
```
These edits assume the test currently:
1. Assigns the result of `db.save_llama_bench_result(...)` to a variable named `inserted`.
2. Has a line `assert inserted in (None, 1)`.
If the exact spelling/spacing differs, you may need to adjust the `SEARCH` strings accordingly. No changes to `save_llama_bench_result` itself are required unless you later decide to give it a meaningful return value and explicitly assert on that.
</issue_to_address>
### Comment 7
<location> `test/unit/test_benchmarks_manager.py:32-29` </location>
<code_context>
+ con.close()
+
+
+def test_update_user_device_idempotent(tmp_path, monkeypatch):
+ _patch_config(monkeypatch, tmp_path)
+ db_path = tmp_path / "bench.sqlite"
+ manager.migrate(db_path)
+
+ first_id = manager.update_user_device(db_path)
+ second_id = manager.update_user_device(db_path)
+
+ assert first_id == second_id
+
+ con = sqlite3.connect(db_path)
+ cur = con.cursor()
+ cur.execute("SELECT COUNT(*) FROM user_device")
+ assert cur.fetchone()[0] == 1
+ con.close()
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add test coverage for `DBManager.list_benchmarks` to validate filtering and joins
Current tests don’t exercise `DBManager.list_benchmarks`, so regressions in its join/ordering/limit logic wouldn’t be caught and could break `benchmarks list`. Please add a test that:
- inserts at least two `llama_bench` rows (with associated config/device data),
- calls `list_benchmarks()` with and without `limit`/`offset`, and
- checks the key fields in the returned rows (e.g. `tokens_per_sec`, `inference_engine`, `hostname`, `accel`, `created_at`) and that results are ordered most-recent-first.
This will better validate the read path against the persisted data.
Suggested implementation:
```python
from ramalama.benchmarks import llama_bench, manager
def _patch_config(monkeypatch, tmp_path):
# Avoid external engine calls and keep paths isolated per test.
monkeypatch.setattr(manager.CONFIG, "engine", None)
monkeypatch.setattr(manager.CONFIG, "runtime", "llama.cpp")
monkeypatch.setattr(manager.CONFIG, "store", str(tmp_path / "store"))
monkeypatch.setattr(manager.CONFIG, "container", False)
monkeypatch.setattr(manager, "get_accel", lambda: "none")
def test_list_benchmarks_filters_joins_and_orders(tmp_path, monkeypatch):
# Arrange: configure isolated environment and migrate schema
_patch_config(monkeypatch, tmp_path)
db_path = tmp_path / "bench.sqlite"
manager.migrate(db_path)
# Ensure we have a user_device row for the join
first_device_id = manager.update_user_device(db_path)
second_device_id = manager.update_user_device(db_path)
assert first_device_id == second_device_id
user_device_id = first_device_id
con = sqlite3.connect(db_path)
try:
cur = con.cursor()
# Insert two container configurations (for join/inference_engine/runtime)
cur.execute(
"""
INSERT INTO container_configuration (inference_engine, runtime, extra_args)
VALUES (?, ?, ?)
""",
("engine-a", "llama.cpp", "--foo"),
)
config_a_id = cur.lastrowid
cur.execute(
"""
INSERT INTO container_configuration (inference_engine, runtime, extra_args)
VALUES (?, ?, ?)
""",
("engine-b", "llama.cpp", "--bar"),
)
config_b_id = cur.lastrowid
# Insert two llama_bench rows with different timestamps and TPS
# so that ordering (most recent first) and limits can be asserted.
cur.execute(
"""
INSERT INTO llama_bench (
container_configuration_id,
user_device_id,
tokens_per_sec,
created_at
)
VALUES (?, ?, ?, ?)
""",
(config_a_id, user_device_id, 100.0, "2024-01-01T00:00:00Z"),
)
older_id = cur.lastrowid
cur.execute(
"""
INSERT INTO llama_bench (
container_configuration_id,
user_device_id,
tokens_per_sec,
created_at
)
VALUES (?, ?, ?, ?)
""",
(config_b_id, user_device_id, 200.0, "2024-01-02T00:00:00Z"),
)
newer_id = cur.lastrowid
assert older_id != newer_id
con.commit()
finally:
con.close()
# Act: fetch benchmarks via the DBManager read path
all_rows = list(manager.list_benchmarks(db_path))
limited_rows = list(manager.list_benchmarks(db_path, limit=1))
offset_rows = list(manager.list_benchmarks(db_path, limit=1, offset=1))
# Assert: we got both rows, ordered most-recent-first
assert len(all_rows) == 2
assert all_rows[0]["tokens_per_sec"] == 200.0
assert all_rows[1]["tokens_per_sec"] == 100.0
# Key joined / projected fields should be present and consistent
first = all_rows[0]
second = all_rows[1]
# These fields exercise the join between llama_bench, container_configuration,
# and user_device; exact field names mirror list_benchmarks' SELECT.
assert first["inference_engine"] == "engine-b"
assert second["inference_engine"] == "engine-a"
# hostname/accel come from user_device; created_at from llama_bench
assert isinstance(first["hostname"], str)
assert first["hostname"]
assert isinstance(first["accel"], str)
assert first["accel"] == "none"
assert first["created_at"] > second["created_at"]
# Limit and offset behaviour:
# - limit=1 should return only the most recent row
assert len(limited_rows) == 1
assert limited_rows[0]["tokens_per_sec"] == 200.0
# - limit=1, offset=1 should return the second-most-recent row
assert len(offset_rows) == 1
assert offset_rows[0]["tokens_per_sec"] == 100.0
```
This test assumes the following about the existing schema and API, which you may need to align with the actual code:
1. Table schemas:
- `container_configuration` has columns `(id INTEGER PRIMARY KEY, inference_engine TEXT, runtime TEXT, extra_args TEXT, ...)`.
- `llama_bench` has columns `(id INTEGER PRIMARY KEY, container_configuration_id INTEGER, user_device_id INTEGER, tokens_per_sec REAL, created_at TEXT, ...)`.
If your actual column names differ (e.g. `tps` instead of `tokens_per_sec`, `config_id` instead of `container_configuration_id`, or no `extra_args` column), update the `INSERT` statements accordingly.
2. Foreign keys:
- `llama_bench.container_configuration_id` must reference `container_configuration.id`.
- `llama_bench.user_device_id` must reference `user_device.id`.
If your FK column names differ, adjust both the `INSERT` column lists and the join expectations in `list_benchmarks`.
3. `list_benchmarks` API:
- This test calls `manager.list_benchmarks(db_path, limit=None, offset=None)` and expects an iterable of dict-like rows keyed by:
`tokens_per_sec`, `inference_engine`, `hostname`, `accel`, and `created_at`.
If the function is a method (e.g. `DBManager(db_path).list_benchmarks(...)`) or returns rows with different keys or types, update the invocation and assertions to match.
4. Ordering:
- The test assumes `list_benchmarks` orders by `created_at` descending (most recent first).
If it orders differently (e.g. by `id`), ensure that logic is updated or adjust the assertions to reflect the intended contract.
</issue_to_address>
### Comment 8
<location> `ramalama/benchmarks/manager.py:223` </location>
<code_context>
cursor.execute(insert_sql, values)
</code_context>
<issue_to_address>
**security (python.sqlalchemy.security.sqlalchemy-execute-raw-query):** Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
*Source: opengrep*
</issue_to_address>
### Comment 9
<location> `ramalama/transports/base.py:395` </location>
<code_context>
return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
### Comment 10
<location> `ramalama/transports/base.py:401` </location>
<code_context>
return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: persisting benchmark results to a local SQLite database. It includes a versioned database schema with migration support, data models for benchmark results, and new CLI commands (ramalama benchmarks list) to view the stored data. The existing ramalama bench command is enhanced to parse and store its output.
The implementation is well-structured and robust, with good separation of concerns between data models, database management, and CLI. The use of a context manager for database connections and a dedicated migration manager are excellent practices.
I've identified a few areas for improvement, mainly related to making exception handling more specific and improving code clarity in a couple of places. Overall, this is a great addition to the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 1 security issue, 5 other issues, and left some high level feedback:
Security issues:
- Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- In
BaseTransport.benchthe broad outerexcept Exceptionfollowed by a fallback toself.execute_command(cmd, args)means a benchmark command that already ran (but failed parsing/persistence) will be executed a second time; consider distinguishing parse/persistence failures from command-execution failures and avoid re-running the benchmark in the error path. - The migration path in
run_next_migration/DBMigrationManager.run_migrationsonly explicitly handles version 0 → 1 and otherwise raises, which will make adding schema version 2+ fragile; it may be worth wiring migrations by filename or a mapping so future upgrades can be chained without modifying this function’s control flow.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `BaseTransport.bench` the broad outer `except Exception` followed by a fallback to `self.execute_command(cmd, args)` means a benchmark command that already ran (but failed parsing/persistence) will be executed a second time; consider distinguishing parse/persistence failures from command-execution failures and avoid re-running the benchmark in the error path.
- The migration path in `run_next_migration`/`DBMigrationManager.run_migrations` only explicitly handles version 0 → 1 and otherwise raises, which will make adding schema version 2+ fragile; it may be worth wiring migrations by filename or a mapping so future upgrades can be chained without modifying this function’s control flow.
## Individual Comments
### Comment 1
<location> `ramalama/transports/base.py:425-429` </location>
<code_context>
+ try:
+ bench_results = parse_json(result.stdout)
+ except (json.JSONDecodeError, ValueError):
+ message = ("Could not parse benchmark output. Expected json but got: \n", result.stdout)
+ raise ValueError(message)
+
+ if output_format == "json":
</code_context>
<issue_to_address>
**suggestion:** The parse error message is built as a tuple, which produces an unhelpful exception string.
Building `message` as a tuple makes the `ValueError` look like `("Could not parse ...", "<stdout>")` instead of a clear string, which hurts readability in logs and debugging. Please build a single formatted string instead, for example:
```python
message = f"Could not parse benchmark output. Expected JSON but got:\n{result.stdout}"
raise ValueError(message)
```
```suggestion
try:
bench_results = parse_json(result.stdout)
except (json.JSONDecodeError, ValueError):
message = f"Could not parse benchmark output. Expected JSON but got:\n{result.stdout}"
raise ValueError(message)
```
</issue_to_address>
### Comment 2
<location> `ramalama/benchmarks/manager.py:284-286` </location>
<code_context>
+ LIMIT ? OFFSET ?
+ """
+
+ effective_limit = limit if limit is not None else -1
+ effective_offset = offset if limit is not None else 0
+ params = [effective_limit, effective_offset]
+
+ with get_conn(self.db_path) as connection:
</code_context>
<issue_to_address>
**issue:** Offset is silently ignored when no limit is provided, which may surprise CLI users.
In `list_benchmarks`, when `limit` is `None` you force `effective_offset = 0`, so a user-specified `offset` is ignored. From the CLI, `--offset` appears usable on its own. Either honor `offset` regardless of `limit`, or explicitly reject `offset` when `limit` is not set so the behavior is clear to users.
</issue_to_address>
### Comment 3
<location> `test/unit/test_benchmarks_manager.py:15-24` </location>
<code_context>
+ monkeypatch.setattr(manager, "get_accel", lambda: "none")
+
+
+def test_migrate_creates_schema(tmp_path, monkeypatch):
+ _patch_config(monkeypatch, tmp_path)
+ db_path = tmp_path / "bench.sqlite"
+
+ manager.migrate(db_path)
+
+ con = sqlite3.connect(db_path)
+ cur = con.cursor()
+ cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
+ table_names = {row[0] for row in cur.fetchall()}
+ assert {"meta", "user_device", "container_configuration", "llama_bench"} <= table_names
+
+ cur.execute("SELECT value FROM meta WHERE key='schema_version'")
+ assert cur.fetchone()[0] == "1"
+ con.close()
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for DBManager initialization with a missing DB path
We currently test `migrate`, `update_user_device`, and the happy-path `DBManager` constructor, but we never verify that `DBManager(None)` raises `MissingDBPathError` as designed. Please add a small test that constructs `DBManager` with `None` (or no config path) and asserts that this exception is raised, so this behaviour is covered and guarded against regressions.
Suggested implementation:
```python
import pytest
from ramalama.benchmarks import llama_bench, manager
```
```python
def _patch_config(monkeypatch, tmp_path):
# Avoid external engine calls and keep paths isolated per test.
monkeypatch.setattr(manager.CONFIG, "engine", None)
monkeypatch.setattr(manager.CONFIG, "runtime", "llama.cpp")
monkeypatch.setattr(manager.CONFIG, "store", str(tmp_path / "store"))
monkeypatch.setattr(manager.CONFIG, "container", False)
monkeypatch.setattr(manager, "get_accel", lambda: "none")
def test_dbmanager_missing_db_path_raises(tmp_path, monkeypatch):
_patch_config(monkeypatch, tmp_path)
with pytest.raises(manager.MissingDBPathError):
manager.DBManager(None)
```
</issue_to_address>
### Comment 4
<location> `test/unit/test_benchmarks_manager.py:107-116` </location>
<code_context>
+def test_list_benchmarks_orders_and_limits(tmp_path, monkeypatch):
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding coverage for the empty-results case in list_benchmarks
Right now we only verify ordering and pagination. Since `DBManager.list_benchmarks` is also expected to return an empty `LlamaBenchResultCollection` when there are no rows, please add a test that calls `list_benchmarks` on a fresh DB (no inserts) and asserts that the collection is empty, so downstream code can rely on that behavior.
Suggested implementation:
```python
assert bench_row[1:] == ("abc123", "model.gguf", 2, 8, 16, 1.5, 0.1)
con.close()
def test_list_benchmarks_empty_returns_empty_collection(tmp_path, monkeypatch):
_patch_config(monkeypatch, tmp_path)
db_path = tmp_path / "bench.sqlite"
db = manager.DBManager(db_path)
collection = db.list_benchmarks()
# list_benchmarks on a fresh DB should return an empty LlamaBenchResultCollection
assert isinstance(collection, llama_bench.LlamaBenchResultCollection)
assert not collection
def test_list_benchmarks_orders_and_limits(tmp_path, monkeypatch):
_patch_config(monkeypatch, tmp_path)
db_path = tmp_path / "bench.sqlite"
db = manager.DBManager(db_path)
cfg_a = llama_bench.TestConfiguration(
container_image="image-a",
container_runtime="docker",
inference_engine="engine-a",
runtime_args={"threads": 1},
)
```
If `LlamaBenchResultCollection` is not currently imported into this test module (e.g. via `import llama_bench` or `from ... import LlamaBenchResultCollection`), make sure the existing imports already provide access to `llama_bench.LlamaBenchResultCollection`. If they do not, add the appropriate import at the top of the file to ensure `llama_bench` is available.
</issue_to_address>
### Comment 5
<location> `ramalama/transports/base.py:405` </location>
<code_context>
def bench(self, args, cmd: list[str]):
set_accel_env_vars()
- self.execute_command(cmd, args)
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the new bench logic into small helper methods for running, parsing, rendering, and saving results so bench stays a simple orchestrator with narrower try/except scopes.
You can keep all the new behavior but lower complexity by pulling the major branches out of `bench` and tightening exception scopes.
### 1. Extract process execution + parsing
Move the dryrun/container/local logic and JSON parsing into small helpers. This keeps `bench` as an orchestrator and isolates the catch‑all fallback to just the execution/parsing step.
```python
def _run_bench_process(self, args, cmd: list[str]) -> subprocess.CompletedProcess:
if args.dryrun:
if args.container:
self.engine.dryrun()
else:
dry_run(cmd)
return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
if args.container:
self.setup_container(args)
self.setup_mounts(args)
self.engine.add([args.image] + cmd)
return self.engine.run_process()
return run_cmd(cmd, encoding="utf-8")
def _parse_bench_results(self, stdout: str):
try:
return parse_json(stdout)
except (json.JSONDecodeError, ValueError):
msg = ("Could not parse benchmark output. Expected json but got: \n", stdout)
raise ValueError(msg)
```
Then `bench` becomes simpler and the outer `try/except` only guards the execution/parsing part that falls back to `execute_command`:
```python
def bench(self, args, cmd: list[str]):
set_accel_env_vars()
output_format = getattr(args, "format", "table")
try:
result = self._run_bench_process(args, cmd)
bench_results = self._parse_bench_results(result.stdout)
except Exception as e:
logger.debug(f"Failed to capture benchmark output: {e}")
self.execute_command(cmd, args)
return
self._render_bench_results(output_format, result.stdout, bench_results)
self._save_bench_results(args, bench_results)
```
### 2. Extract rendering logic
Avoid branching for formatting directly in `bench`:
```python
def _render_bench_results(self, output_format: str, raw_stdout: str, bench_results):
if output_format == "json":
print(raw_stdout)
else:
print_bench_results(bench_results)
```
### 3. Isolate DB side effects and narrow exception scope
Keep DB exceptions local to the save helper, log and continue:
```python
def _save_bench_results(self, args, bench_results):
db_path = CONFIG.benchmarks.db_path
if not (db_path and not CONFIG.benchmarks.disable):
return
try:
test_config = TestConfiguration.from_args(args)
db = DBManager(db_path)
inserted = 0
if bench_results:
configuration_id = db.save_llama_bench_configuration(test_config)
inserted = db.save_llama_bench_results(configuration_id, bench_results)
logger.debug(f"Saved {inserted} benchmark result(s) to database")
except Exception as e:
logger.warning(f"Failed to save benchmark results to database: {e}")
```
This keeps all existing behavior (dry-run/container/local, parse, print, DB save, fallback to `execute_command`) but flattens the control flow and isolates responsibilities into testable units.
</issue_to_address>
### Comment 6
<location> `ramalama/transports/base.py:416` </location>
<code_context>
result = subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@olliewalsh @rhatdan This adds a persistence mechanism for benchmark results so that users can track performance metrics over time and against different image / accelerator configurations. Would you be willing to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ieaves hey, good idea. I would be concerned about the maintenance cost of this though.
I would be inclined to stop at just adding support for json output. The user could then feed the results into $whatever they want to use to store and query performance data e.g grafana or elastic/kibana.
docs/ramalama-bench.1.md
Outdated
| for a value and set the variable only if it is set on the host. | ||
|
|
||
| #### **--format** | ||
| Set the output format of the benchmark results (default: table). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should list the options
@olliewalsh I have to admit I'm squeamish about adding sqlite as well (particularly without dependencies to help manage schema) but I had in mind adding tracking for real-time chat performance and potentially other test results beyond those from llama-bench next. Since different benchmarking tools and the real-time results would all follow different schemas we'd end up having to mix them together in a single json file and then manually manage loading / parsing / filtering them informally throughout the rest of the benchmarking code (not to mention the potential performance costs of deserializing the entire file on every read). It's not a slam dunk case for going the db route either but all roads felt a little icky in different ways. Bundling all of the system / image / benchmark data into the a json record and dumping to a jsonl file is definitely easier to manage though and it'd be pretty easy for me to switch the implementation over. I think json is unambiguously the better option if we only care about discrete benchmark test results from llama-bench. |
5dc2c43 to
e3671da
Compare
|
@olliewalsh I've refactored this to use a |
Summary by Sourcery
Add persistent storage and management for llama-bench metrics via a local SQLite database and expose them through new CLI commands and configuration.
New Features:
Enhancements:
Documentation:
Tests: