Add benchmark metrics persistence #2237

ieaves · 2025-12-12T03:03:17Z

Summary by Sourcery

Add persistent storage and management for llama-bench metrics via a local SQLite database and expose them through new CLI commands and configuration.

New Features:

Introduce a benchmarks SQLite schema and migration to store llama-bench runs with device and container configuration metadata.
Add a new ramalama benchmarks CLI with a list subcommand to page through historical benchmark results in table or JSON format.
Extend ramalama bench to support choosing output format and to persist parsed llama-bench results into the benchmarks database when enabled.

Enhancements:

Add benchmark-related configuration (including default database path and disable flag) to the global configuration model.
Format benchmark results into a human-readable table, including optional IDs, engine and timestamp metadata when available.

Documentation:

Document the new ramalama.benchmarks config table and db_path option in ramalama.conf and its manpage.
Add manpages for the new ramalama-benchmarks command and the --format option for ramalama-bench.
Update the top-level ramalama manpage to reference the benchmarks command.

Tests:

Add unit tests covering database migration, user device upserts, benchmark result persistence, and ordered, paginated listing of benchmark entries.

sourcery-ai · 2025-12-12T03:03:24Z

Reviewer's Guide

Adds a persistent benchmarks subsystem: parses llama-bench JSON output, stores it in a versioned SQLite schema with migrations, wires it into the bench transport, introduces CONFIG.benchmarks settings, and exposes a new ramalama benchmarks CLI with table/JSON views over stored results and updated documentation/tests.

File-Level Changes

Change	Details	Files
Extend CLI with benchmark formatting and a new `benchmarks` subcommand to list stored results from the SQLite database.	Add `--format {table,json}` option to `bench` subcommand and document it in ramalama-bench.1.md Introduce `benchmarks` top-level subcommand with `list` subcommand, pagination and table/JSON output selection Implement `benchmarks_list_cli` to open DB via DBManager, handle missing/disabled DB path, and render results via JSON or table helper Register new subcommand in `configure_subcommands` and add manpage + link from main ramalama.1.md	`ramalama/cli.py` `docs/ramalama-bench.1.md` `docs/ramalama.1.md` `docs/ramalama-benchmarks.1.md`
Capture llama-bench JSON output in the transport, render it as table/JSON, and persist it to the benchmarks database with configuration metadata.	Refactor `TransportBase.bench` to run the benchmark command, capture stdout, and parse it as JSON into domain objects On success, print either raw JSON or a formatted table via `print_bench_results` depending on `--format` When CONFIG.benchmarks.db_path is set and not disabled, build a TestConfiguration from CLI args and save results via DBManager On any failure in the enhanced path, fall back to the previous `execute_command` behavior while logging debug info	`ramalama/transports/base.py` `ramalama/benchmarks/llama_bench.py` `ramalama/benchmarks/manager.py` `ramalama/benchmarks/errors.py`
Introduce configuration and documentation for persistent benchmark storage, including default DB path resolution and config-doc consistency updates.	Add `Benchmarks` dataclass with `db_path` (defaulting to first existing DEFAULT_CONFIG_DIR/db.sql) and `disable` flag, referenced from BaseConfig as `benchmarks` Adjust config post-init and helper utilities where necessary, and update config documentation tests to account for new subsection Document `ramalama.benchmarks` table and `db_path` option in docs/ramalama.conf and docs/ramalama.conf.5.md	`ramalama/config.py` `docs/ramalama.conf` `docs/ramalama.conf.5.md` `test/unit/test_config_documentation.py`
Create SQLite-backed benchmarks persistence layer with schema migrations, device/config tracking, query helpers, and unit tests.	Add initial migration (00001.sql) defining meta, user_device, container_configuration, and llama_bench tables plus indexes and created_at columns Implement connection helper with PRAGMAs, migration framework (schema versioning and upgrade loop), and DBManager API for saving configurations and llama-bench results Track per-user-device info (hostname/OS/CPU/GPU/accel) and container configuration deduplication via unique index and upsert logic Expose `list_benchmarks(limit, offset)` returning rich result items (including engine and formatted created_at) and support tabular rendering with optional columns Add unit tests for migration creation, idempotent user_device updates, inserting benchmark results, and verifying ordering/limit/offset semantics	`ramalama/benchmarks/manager.py` `ramalama/benchmarks/migrations/00001.sql` `ramalama/benchmarks/llama_bench.py` `test/unit/test_benchmarks_manager.py`
Update llama.cpp bench spec so llama-bench emits JSON suitable for parsing and persistence.	Override llama.cpp `llama-bench` command options to explicitly set `--model`, GPU-layer flags, `--threads`, and `-o json` for JSON output Ensure the runtime args in TestConfiguration align with threads/ctx/gpu/batch used when invoking llama-bench	`inference-spec/engines/llama.cpp.yaml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist · 2025-12-12T03:03:39Z

Summary of Changes

Hello @ieaves, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities by introducing a robust system for persisting and managing benchmark results. It allows users to save detailed llama-bench outputs to a SQLite database, providing a historical record of performance across different configurations and environments. This feature is exposed through a new CLI command, enabling users to easily list and inspect past benchmark runs, which is crucial for performance analysis and regression detection.

Highlights

Benchmark Metrics Persistence: Introduced the capability to persist benchmark results to a SQLite database, allowing for historical tracking of performance metrics.
New 'benchmarks' CLI Command: Added a new top-level CLI command benchmarks with a list subcommand to view and manage stored benchmark results, supporting both table and JSON output formats.
Llama-bench Result Parsing and Data Models: Implemented new Python data models (LlamaBenchResult, TestConfiguration) and parsing logic to handle and standardize llama-bench JSON/JSONL output for database storage.
Database Management and Migration: Developed a DBManager and DBMigrationManager to handle SQLite database connections, schema creation, and migrations, ensuring the database structure is up-to-date.
Configuration for Benchmarks: Added a new benchmarks section to the main configuration (BaseConfig) to manage benchmark-related settings, such as the database path.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option. (link)
Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

DBManager.save_llama_bench_result()’s docstring says it returns the number of rows inserted but the implementation doesn’t return anything; either return cursor.rowcount (or similar) or update the docstring/tests to avoid implying a return value.
The benchmarks logger level is derived from CONFIG.log_level at import time in manager.py, so subsequent changes to CONFIG.log_level won’t affect it; consider configuring the logger lazily or in an explicit init path instead of at module import.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- DBManager.save_llama_bench_result()’s docstring says it returns the number of rows inserted but the implementation doesn’t return anything; either return cursor.rowcount (or similar) or update the docstring/tests to avoid implying a return value.
- The benchmarks logger level is derived from CONFIG.log_level at import time in manager.py, so subsequent changes to CONFIG.log_level won’t affect it; consider configuring the logger lazily or in an explicit init path instead of at module import.

## Individual Comments

### Comment 1
<location> `ramalama/cli.py:611` </location>
<code_context>
+        else:
+            test_type = "-"
+
+        tokens_sec = f"{row['tokens_per_sec']:.1f}" if row["tokens_per_sec"] else "-"
+        engine = (row["inference_engine"] or "")[:15]
+        device = (row["accel"] or "cpu")[:10]
</code_context>

<issue_to_address>
**issue (bug_risk):** Zero tokens/sec are rendered as '-' instead of '0.0', which can hide valid results.

Because this uses a truthiness check, a valid `tokens_per_sec` of `0`/`0.0` is rendered as `-`, conflating “very slow” with “no data”. Instead, check explicitly for `None`:

```python
value = row["tokens_per_sec"]
tokens_sec = f"{value:.1f}" if value is not None else "-"
```
</issue_to_address>

### Comment 2
<location> `ramalama/benchmarks/manager.py:272-273` </location>
<code_context>
+            ORDER BY lb.created_at DESC, lb.id DESC
+        """
+
+        if limit:
+            query += f" LIMIT {limit} OFFSET {offset}"
+
+        with get_conn(self.db_path) as connection:
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Query builds LIMIT/OFFSET via f-strings instead of parameterized SQL, and silently ignores limit=0.

Consider making the `limit` handling explicit and fully parameterized, e.g.:

```python
params: list[Any] = []
if limit is not None:
    query += " LIMIT ? OFFSET ?"
    params.extend([limit, offset])

cursor.execute(query, params)
```

This avoids surprising behavior for `limit=0` and keeps the query consistent with parameterized SQL practices.

Suggested implementation:

```python
            FROM llama_bench lb
            JOIN container_configuration cc ON lb.test_configuration_id = cc.id
            JOIN user_device ud ON cc.user_device_id = ud.id
            ORDER BY lb.created_at DESC, lb.id DESC
        """

        params: list[Any] = []
        if limit is not None:
            query += " LIMIT ? OFFSET ?"
            params.extend([limit, offset])

```

To fully integrate this change, you should also:

1. Ensure `Any` is imported at the top of `ramalama/benchmarks/manager.py` (if not already present), e.g.:
   ```python
   from typing import Any
   ```
2. Update the corresponding `cursor.execute` call in this method to pass `params`:
   - If it currently does `cursor.execute(query)`, change it to `cursor.execute(query, params)`.
   - If there are already existing query parameters, instead of initializing `params: list[Any] = []`, you should:
     - Initialize `params` with the existing parameters before this snippet.
     - Keep the `params.extend([limit, offset])` so that `LIMIT` and `OFFSET` are appended correctly.
3. Make sure this `params` variable is in the same scope as the `cursor.execute` call that runs this query.
</issue_to_address>

### Comment 3
<location> `ramalama/benchmarks/manager.py:209-214` </location>
<code_context>
+        migrate(self.db_path)
+        self.user_device_id = update_user_device(self.db_path)
+
+    def save_llama_bench_result(
+        self, configuration: llama_bench.TestConfiguration, result: llama_bench.LlamaBenchResult
+    ):
+        """Insert a single llama-bench result row. Returns rows inserted (1 or 0)."""
+
+        configuration_id = self.save_container_configuration(configuration)
+
+        columns = ["test_configuration_id"] + llama_bench.TABLE_COLUMNS
</code_context>

<issue_to_address>
**suggestion (performance):** Saving each benchmark result re-inserts identical container configuration rows, which can bloat the DB.

`save_llama_bench_result()` calls `save_container_configuration()` for every result in `bench()`, so the same configuration is inserted repeatedly. This unnecessarily grows `container_configuration` and complicates querying distinct configs.

Instead, persist the configuration once per run, reuse its `id` for all results, and bulk-insert the result rows using that shared `configuration_id`. Exposing a method that returns `configuration_id` would support this pattern.

Suggested implementation:

```python
    def save_llama_bench_result(
        self, configuration: llama_bench.TestConfiguration, result: llama_bench.LlamaBenchResult
    ) -> int:
        """Insert a single llama-bench result row. Returns rows inserted (1 or 0).

        For optimal performance across a run, prefer:
        1) Persist the configuration once via `save_llama_bench_configuration()`
        2) Bulk-insert results via `save_llama_bench_results(configuration_id, results)`
        """

        configuration_id = self.save_container_configuration(configuration)
        return self.save_llama_bench_results(configuration_id, [result])

    def save_llama_bench_configuration(
        self, configuration: llama_bench.TestConfiguration
    ) -> int:
        """Persist a llama-bench container configuration and return its id.

        This is a thin wrapper over `save_container_configuration()` to make the
        llama-bench workflow explicit and allow re-use of the configuration id
        across multiple result inserts in a single run.
        """
        return self.save_container_configuration(configuration)

    def save_llama_bench_results(
        self,
        configuration_id: int,
        results: "Collection[llama_bench.LlamaBenchResult]",
    ) -> int:
        """Bulk-insert llama-bench result rows sharing a configuration id.

        Returns the number of rows inserted.
        """
        # Fast path: nothing to insert.
        if not results:
            return 0

        columns = ["test_configuration_id"] + llama_bench.TABLE_COLUMNS
        placeholders = ", ".join("?" for _ in columns)
        insert_sql = f"INSERT INTO llama_bench ({', '.join(columns)}) VALUES ({placeholders})"

        rows = [(configuration_id, *result.as_db_tuple()) for result in results]

        with get_conn(self.db_path) as connection:
            cursor = connection.cursor()
            cursor.executemany(insert_sql, rows)
            return cursor.rowcount

    def save_container_configuration(self, configuration: llama_bench.TestConfiguration) -> int:
        """Insert a container_configuration row and return its id."""

```

1. At the top of `ramalama/benchmarks/manager.py`, import `Collection` from `collections.abc` and use the non-string annotation if your type-checking configuration supports it:

   ```python
   from collections.abc import Collection
   ```

   Then update the type annotation in `save_llama_bench_results` to:

   ```python
   def save_llama_bench_results(
       self,
       configuration_id: int,
       results: Collection[llama_bench.LlamaBenchResult],
   ) -> int:
   ```

2. Update the llama-bench `bench()` workflow (wherever it lives) to avoid re-inserting identical configurations:

   ```python
   configuration_id = db_manager.save_llama_bench_configuration(configuration)
   db_manager.save_llama_bench_results(configuration_id, results)
   ```

   instead of calling `save_llama_bench_result()` for every `result`.
</issue_to_address>

### Comment 4
<location> `ramalama/transports/base.py:506-387` </location>
<code_context>
+        return None
+    if isinstance(value, bool):
+        return 1 if value else 0
+    try:
+        return int(value)
+    except (ValueError, TypeError):
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Catch-all around benchmark execution may hide failures unrelated to result capture/parsing.

Using a broad `try/except Exception` and falling back to `self.execute_command()` will mask unrelated errors (e.g., logic or DB issues) as a simple command execution, making real failures harder to see and debug.

If the goal is to recover only from capture/parsing issues, narrow the `try` to just that logic or catch specific exception types so genuine runtime errors still propagate while malformed output is handled gracefully.
</issue_to_address>

### Comment 5
<location> `ramalama/benchmarks/manager.py:20-21` </location>
<code_context>
+from ramalama.rag import rag_image
+from ramalama.version import version
+
+logger = logging.getLogger("ramalama.benchmarks")
+logger.setLevel(CONFIG.log_level or LogLevel.WARNING)
+
+
</code_context>

<issue_to_address>
**suggestion:** Setting the logger level at import time couples logging configuration to CONFIG and may override application-level logging.

This also assumes `CONFIG` is fully initialized at import time. Instead, consider leaving the module logger unconfigured here and have the main application/CLI set log levels from `CONFIG`, so consumers (including tests or embedding apps) can control logging without surprises.

Suggested implementation:

```python
logger = logging.getLogger("ramalama.benchmarks")

```

If `LogLevel` and/or `CONFIG` are only used here for setting the logger level and not elsewhere in `manager.py`, you should also remove those unused imports:

- Remove `CONFIG` from `from ramalama.config import CONFIG, ...`
- Remove `LogLevel` from `from ramalama.log_levels import LogLevel`

Logging configuration based on `CONFIG` should instead be done in the main application/CLI startup code (e.g., where `CONFIG` is loaded), by calling `logging.getLogger("ramalama.benchmarks").setLevel(...)` there.
</issue_to_address>

### Comment 6
<location> `test/unit/test_benchmarks_manager.py:49-58` </location>
<code_context>
+def test_save_llama_bench_result_inserts_all(tmp_path, monkeypatch):
</code_context>

<issue_to_address>
**suggestion (testing):** Clarify expectations for `save_llama_bench_result` return value and assert on DB state only

In `test_save_llama_bench_result_inserts_all`, `save_llama_bench_result` currently returns `None`, so `inserted in (None, 1)` will always be true via the `None` branch and the `1` case is dead code. I’d suggest removing `inserted` altogether and asserting only on DB state (row counts/contents), which you’re already validating. If you do want this function to return an insertion count, then update the implementation accordingly and assert on that specific value instead of a union of possibilities.

Suggested implementation:

```python
def test_save_llama_bench_result_inserts_all(tmp_path, monkeypatch):

```

```python
    db = manager.DBManager(db_path)

    cfg = llama_bench.TestConfiguration(

```

```python
    )

```

```python
    db.save_llama_bench_result(

```

```python

```

These edits assume the test currently:
1. Assigns the result of `db.save_llama_bench_result(...)` to a variable named `inserted`.
2. Has a line `assert inserted in (None, 1)`.

If the exact spelling/spacing differs, you may need to adjust the `SEARCH` strings accordingly. No changes to `save_llama_bench_result` itself are required unless you later decide to give it a meaningful return value and explicitly assert on that.
</issue_to_address>

### Comment 7
<location> `test/unit/test_benchmarks_manager.py:32-29` </location>
<code_context>
+    con.close()
+
+
+def test_update_user_device_idempotent(tmp_path, monkeypatch):
+    _patch_config(monkeypatch, tmp_path)
+    db_path = tmp_path / "bench.sqlite"
+    manager.migrate(db_path)
+
+    first_id = manager.update_user_device(db_path)
+    second_id = manager.update_user_device(db_path)
+
+    assert first_id == second_id
+
+    con = sqlite3.connect(db_path)
+    cur = con.cursor()
+    cur.execute("SELECT COUNT(*) FROM user_device")
+    assert cur.fetchone()[0] == 1
+    con.close()
+
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add test coverage for `DBManager.list_benchmarks` to validate filtering and joins

Current tests don’t exercise `DBManager.list_benchmarks`, so regressions in its join/ordering/limit logic wouldn’t be caught and could break `benchmarks list`. Please add a test that:

- inserts at least two `llama_bench` rows (with associated config/device data),
- calls `list_benchmarks()` with and without `limit`/`offset`, and
- checks the key fields in the returned rows (e.g. `tokens_per_sec`, `inference_engine`, `hostname`, `accel`, `created_at`) and that results are ordered most-recent-first.

This will better validate the read path against the persisted data.

Suggested implementation:

```python
from ramalama.benchmarks import llama_bench, manager


def _patch_config(monkeypatch, tmp_path):
    # Avoid external engine calls and keep paths isolated per test.
    monkeypatch.setattr(manager.CONFIG, "engine", None)
    monkeypatch.setattr(manager.CONFIG, "runtime", "llama.cpp")
    monkeypatch.setattr(manager.CONFIG, "store", str(tmp_path / "store"))
    monkeypatch.setattr(manager.CONFIG, "container", False)
    monkeypatch.setattr(manager, "get_accel", lambda: "none")


def test_list_benchmarks_filters_joins_and_orders(tmp_path, monkeypatch):
    # Arrange: configure isolated environment and migrate schema
    _patch_config(monkeypatch, tmp_path)
    db_path = tmp_path / "bench.sqlite"
    manager.migrate(db_path)

    # Ensure we have a user_device row for the join
    first_device_id = manager.update_user_device(db_path)
    second_device_id = manager.update_user_device(db_path)
    assert first_device_id == second_device_id
    user_device_id = first_device_id

    con = sqlite3.connect(db_path)
    try:
        cur = con.cursor()

        # Insert two container configurations (for join/inference_engine/runtime)
        cur.execute(
            """
            INSERT INTO container_configuration (inference_engine, runtime, extra_args)
            VALUES (?, ?, ?)
            """,
            ("engine-a", "llama.cpp", "--foo"),
        )
        config_a_id = cur.lastrowid

        cur.execute(
            """
            INSERT INTO container_configuration (inference_engine, runtime, extra_args)
            VALUES (?, ?, ?)
            """,
            ("engine-b", "llama.cpp", "--bar"),
        )
        config_b_id = cur.lastrowid

        # Insert two llama_bench rows with different timestamps and TPS
        # so that ordering (most recent first) and limits can be asserted.
        cur.execute(
            """
            INSERT INTO llama_bench (
                container_configuration_id,
                user_device_id,
                tokens_per_sec,
                created_at
            )
            VALUES (?, ?, ?, ?)
            """,
            (config_a_id, user_device_id, 100.0, "2024-01-01T00:00:00Z"),
        )
        older_id = cur.lastrowid

        cur.execute(
            """
            INSERT INTO llama_bench (
                container_configuration_id,
                user_device_id,
                tokens_per_sec,
                created_at
            )
            VALUES (?, ?, ?, ?)
            """,
            (config_b_id, user_device_id, 200.0, "2024-01-02T00:00:00Z"),
        )
        newer_id = cur.lastrowid

        assert older_id != newer_id
        con.commit()
    finally:
        con.close()

    # Act: fetch benchmarks via the DBManager read path
    all_rows = list(manager.list_benchmarks(db_path))
    limited_rows = list(manager.list_benchmarks(db_path, limit=1))
    offset_rows = list(manager.list_benchmarks(db_path, limit=1, offset=1))

    # Assert: we got both rows, ordered most-recent-first
    assert len(all_rows) == 2
    assert all_rows[0]["tokens_per_sec"] == 200.0
    assert all_rows[1]["tokens_per_sec"] == 100.0

    # Key joined / projected fields should be present and consistent
    first = all_rows[0]
    second = all_rows[1]

    # These fields exercise the join between llama_bench, container_configuration,
    # and user_device; exact field names mirror list_benchmarks' SELECT.
    assert first["inference_engine"] == "engine-b"
    assert second["inference_engine"] == "engine-a"

    # hostname/accel come from user_device; created_at from llama_bench
    assert isinstance(first["hostname"], str)
    assert first["hostname"]
    assert isinstance(first["accel"], str)
    assert first["accel"] == "none"

    assert first["created_at"] > second["created_at"]

    # Limit and offset behaviour:
    # - limit=1 should return only the most recent row
    assert len(limited_rows) == 1
    assert limited_rows[0]["tokens_per_sec"] == 200.0

    # - limit=1, offset=1 should return the second-most-recent row
    assert len(offset_rows) == 1
    assert offset_rows[0]["tokens_per_sec"] == 100.0

```

This test assumes the following about the existing schema and API, which you may need to align with the actual code:

1. Table schemas:
   - `container_configuration` has columns `(id INTEGER PRIMARY KEY, inference_engine TEXT, runtime TEXT, extra_args TEXT, ...)`.
   - `llama_bench` has columns `(id INTEGER PRIMARY KEY, container_configuration_id INTEGER, user_device_id INTEGER, tokens_per_sec REAL, created_at TEXT, ...)`.
   If your actual column names differ (e.g. `tps` instead of `tokens_per_sec`, `config_id` instead of `container_configuration_id`, or no `extra_args` column), update the `INSERT` statements accordingly.

2. Foreign keys:
   - `llama_bench.container_configuration_id` must reference `container_configuration.id`.
   - `llama_bench.user_device_id` must reference `user_device.id`.
   If your FK column names differ, adjust both the `INSERT` column lists and the join expectations in `list_benchmarks`.

3. `list_benchmarks` API:
   - This test calls `manager.list_benchmarks(db_path, limit=None, offset=None)` and expects an iterable of dict-like rows keyed by:
     `tokens_per_sec`, `inference_engine`, `hostname`, `accel`, and `created_at`.
   If the function is a method (e.g. `DBManager(db_path).list_benchmarks(...)`) or returns rows with different keys or types, update the invocation and assertions to match.

4. Ordering:
   - The test assumes `list_benchmarks` orders by `created_at` descending (most recent first).
   If it orders differently (e.g. by `id`), ensure that logic is updated or adjust the assertions to reflect the intended contract.
</issue_to_address>

### Comment 8
<location> `ramalama/benchmarks/manager.py:223` </location>
<code_context>
            cursor.execute(insert_sql, values)
</code_context>

<issue_to_address>
**security (python.sqlalchemy.security.sqlalchemy-execute-raw-query):** Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

*Source: opengrep*
</issue_to_address>

### Comment 9
<location> `ramalama/transports/base.py:395` </location>
<code_context>
                return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

### Comment 10
<location> `ramalama/transports/base.py:401` </location>
<code_context>
                return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ramalama/cli.py

ramalama/benchmarks/manager.py

test/unit/test_benchmarks_manager.py

ramalama/benchmarks/manager.py

ramalama/transports/base.py

gemini-code-assist

Code Review

This pull request introduces a significant new feature: persisting benchmark results to a local SQLite database. It includes a versioned database schema with migration support, data models for benchmark results, and new CLI commands (ramalama benchmarks list) to view the stored data. The existing ramalama bench command is enhanced to parse and store its output.

The implementation is well-structured and robust, with good separation of concerns between data models, database management, and CLI. The use of a context manager for database connections and a dedicated migration manager are excellent practices.

I've identified a few areas for improvement, mainly related to making exception handling more specific and improving code clarity in a couple of places. Overall, this is a great addition to the project.

ramalama/benchmarks/manager.py

ramalama/cli.py

sourcery-ai

New security issues found

ramalama/benchmarks/manager.py

sourcery-ai

Hey - I've found 1 security issue, 5 other issues, and left some high level feedback:

Security issues:

Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

In BaseTransport.bench the broad outer except Exception followed by a fallback to self.execute_command(cmd, args) means a benchmark command that already ran (but failed parsing/persistence) will be executed a second time; consider distinguishing parse/persistence failures from command-execution failures and avoid re-running the benchmark in the error path.
The migration path in run_next_migration/DBMigrationManager.run_migrations only explicitly handles version 0 → 1 and otherwise raises, which will make adding schema version 2+ fragile; it may be worth wiring migrations by filename or a mapping so future upgrades can be chained without modifying this function’s control flow.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `BaseTransport.bench` the broad outer `except Exception` followed by a fallback to `self.execute_command(cmd, args)` means a benchmark command that already ran (but failed parsing/persistence) will be executed a second time; consider distinguishing parse/persistence failures from command-execution failures and avoid re-running the benchmark in the error path.
- The migration path in `run_next_migration`/`DBMigrationManager.run_migrations` only explicitly handles version 0 → 1 and otherwise raises, which will make adding schema version 2+ fragile; it may be worth wiring migrations by filename or a mapping so future upgrades can be chained without modifying this function’s control flow.

## Individual Comments

### Comment 1
<location> `ramalama/transports/base.py:425-429` </location>
<code_context>
+            try:
+                bench_results = parse_json(result.stdout)
+            except (json.JSONDecodeError, ValueError):
+                message = ("Could not parse benchmark output. Expected json but got: \n", result.stdout)
+                raise ValueError(message)
+
+            if output_format == "json":
</code_context>

<issue_to_address>
**suggestion:** The parse error message is built as a tuple, which produces an unhelpful exception string.

Building `message` as a tuple makes the `ValueError` look like `("Could not parse ...", "<stdout>")` instead of a clear string, which hurts readability in logs and debugging. Please build a single formatted string instead, for example:

```python
message = f"Could not parse benchmark output. Expected JSON but got:\n{result.stdout}"
raise ValueError(message)
```

```suggestion
            try:
                bench_results = parse_json(result.stdout)
            except (json.JSONDecodeError, ValueError):
                message = f"Could not parse benchmark output. Expected JSON but got:\n{result.stdout}"
                raise ValueError(message)
```
</issue_to_address>

### Comment 2
<location> `ramalama/benchmarks/manager.py:284-286` </location>
<code_context>
+            LIMIT ? OFFSET ?
+        """
+
+        effective_limit = limit if limit is not None else -1
+        effective_offset = offset if limit is not None else 0
+        params = [effective_limit, effective_offset]
+
+        with get_conn(self.db_path) as connection:
</code_context>

<issue_to_address>
**issue:** Offset is silently ignored when no limit is provided, which may surprise CLI users.

In `list_benchmarks`, when `limit` is `None` you force `effective_offset = 0`, so a user-specified `offset` is ignored. From the CLI, `--offset` appears usable on its own. Either honor `offset` regardless of `limit`, or explicitly reject `offset` when `limit` is not set so the behavior is clear to users.
</issue_to_address>

### Comment 3
<location> `test/unit/test_benchmarks_manager.py:15-24` </location>
<code_context>
+    monkeypatch.setattr(manager, "get_accel", lambda: "none")
+
+
+def test_migrate_creates_schema(tmp_path, monkeypatch):
+    _patch_config(monkeypatch, tmp_path)
+    db_path = tmp_path / "bench.sqlite"
+
+    manager.migrate(db_path)
+
+    con = sqlite3.connect(db_path)
+    cur = con.cursor()
+    cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
+    table_names = {row[0] for row in cur.fetchall()}
+    assert {"meta", "user_device", "container_configuration", "llama_bench"} <= table_names
+
+    cur.execute("SELECT value FROM meta WHERE key='schema_version'")
+    assert cur.fetchone()[0] == "1"
+    con.close()
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test for DBManager initialization with a missing DB path

We currently test `migrate`, `update_user_device`, and the happy-path `DBManager` constructor, but we never verify that `DBManager(None)` raises `MissingDBPathError` as designed. Please add a small test that constructs `DBManager` with `None` (or no config path) and asserts that this exception is raised, so this behaviour is covered and guarded against regressions.

Suggested implementation:

```python
import pytest
from ramalama.benchmarks import llama_bench, manager

```

```python
def _patch_config(monkeypatch, tmp_path):
    # Avoid external engine calls and keep paths isolated per test.
    monkeypatch.setattr(manager.CONFIG, "engine", None)
    monkeypatch.setattr(manager.CONFIG, "runtime", "llama.cpp")
    monkeypatch.setattr(manager.CONFIG, "store", str(tmp_path / "store"))
    monkeypatch.setattr(manager.CONFIG, "container", False)
    monkeypatch.setattr(manager, "get_accel", lambda: "none")


def test_dbmanager_missing_db_path_raises(tmp_path, monkeypatch):
    _patch_config(monkeypatch, tmp_path)

    with pytest.raises(manager.MissingDBPathError):
        manager.DBManager(None)

```
</issue_to_address>

### Comment 4
<location> `test/unit/test_benchmarks_manager.py:107-116` </location>
<code_context>
+def test_list_benchmarks_orders_and_limits(tmp_path, monkeypatch):
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding coverage for the empty-results case in list_benchmarks

Right now we only verify ordering and pagination. Since `DBManager.list_benchmarks` is also expected to return an empty `LlamaBenchResultCollection` when there are no rows, please add a test that calls `list_benchmarks` on a fresh DB (no inserts) and asserts that the collection is empty, so downstream code can rely on that behavior.

Suggested implementation:

```python
    assert bench_row[1:] == ("abc123", "model.gguf", 2, 8, 16, 1.5, 0.1)
    con.close()


def test_list_benchmarks_empty_returns_empty_collection(tmp_path, monkeypatch):
    _patch_config(monkeypatch, tmp_path)
    db_path = tmp_path / "bench.sqlite"
    db = manager.DBManager(db_path)

    collection = db.list_benchmarks()

    # list_benchmarks on a fresh DB should return an empty LlamaBenchResultCollection
    assert isinstance(collection, llama_bench.LlamaBenchResultCollection)
    assert not collection


def test_list_benchmarks_orders_and_limits(tmp_path, monkeypatch):
    _patch_config(monkeypatch, tmp_path)
    db_path = tmp_path / "bench.sqlite"
    db = manager.DBManager(db_path)

    cfg_a = llama_bench.TestConfiguration(
        container_image="image-a",
        container_runtime="docker",
        inference_engine="engine-a",
        runtime_args={"threads": 1},
    )

```

If `LlamaBenchResultCollection` is not currently imported into this test module (e.g. via `import llama_bench` or `from ... import LlamaBenchResultCollection`), make sure the existing imports already provide access to `llama_bench.LlamaBenchResultCollection`. If they do not, add the appropriate import at the top of the file to ensure `llama_bench` is available.
</issue_to_address>

### Comment 5
<location> `ramalama/transports/base.py:405` </location>
<code_context>

     def bench(self, args, cmd: list[str]):
         set_accel_env_vars()
-        self.execute_command(cmd, args)
</code_context>

<issue_to_address>
**issue (complexity):** Consider refactoring the new bench logic into small helper methods for running, parsing, rendering, and saving results so bench stays a simple orchestrator with narrower try/except scopes.

You can keep all the new behavior but lower complexity by pulling the major branches out of `bench` and tightening exception scopes.

### 1. Extract process execution + parsing

Move the dryrun/container/local logic and JSON parsing into small helpers. This keeps `bench` as an orchestrator and isolates the catch‑all fallback to just the execution/parsing step.

```python
def _run_bench_process(self, args, cmd: list[str]) -> subprocess.CompletedProcess:
    if args.dryrun:
        if args.container:
            self.engine.dryrun()
        else:
            dry_run(cmd)
        return subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")

    if args.container:
        self.setup_container(args)
        self.setup_mounts(args)
        self.engine.add([args.image] + cmd)
        return self.engine.run_process()

    return run_cmd(cmd, encoding="utf-8")


def _parse_bench_results(self, stdout: str):
    try:
        return parse_json(stdout)
    except (json.JSONDecodeError, ValueError):
        msg = ("Could not parse benchmark output. Expected json but got: \n", stdout)
        raise ValueError(msg)
```

Then `bench` becomes simpler and the outer `try/except` only guards the execution/parsing part that falls back to `execute_command`:

```python
def bench(self, args, cmd: list[str]):
    set_accel_env_vars()
    output_format = getattr(args, "format", "table")

    try:
        result = self._run_bench_process(args, cmd)
        bench_results = self._parse_bench_results(result.stdout)
    except Exception as e:
        logger.debug(f"Failed to capture benchmark output: {e}")
        self.execute_command(cmd, args)
        return

    self._render_bench_results(output_format, result.stdout, bench_results)
    self._save_bench_results(args, bench_results)
```

### 2. Extract rendering logic

Avoid branching for formatting directly in `bench`:

```python
def _render_bench_results(self, output_format: str, raw_stdout: str, bench_results):
    if output_format == "json":
        print(raw_stdout)
    else:
        print_bench_results(bench_results)
```

### 3. Isolate DB side effects and narrow exception scope

Keep DB exceptions local to the save helper, log and continue:

```python
def _save_bench_results(self, args, bench_results):
    db_path = CONFIG.benchmarks.db_path
    if not (db_path and not CONFIG.benchmarks.disable):
        return

    try:
        test_config = TestConfiguration.from_args(args)
        db = DBManager(db_path)
        inserted = 0
        if bench_results:
            configuration_id = db.save_llama_bench_configuration(test_config)
            inserted = db.save_llama_bench_results(configuration_id, bench_results)
        logger.debug(f"Saved {inserted} benchmark result(s) to database")
    except Exception as e:
        logger.warning(f"Failed to save benchmark results to database: {e}")
```

This keeps all existing behavior (dry-run/container/local, parse, print, DB save, fallback to `execute_command`) but flattens the control flow and isolates responsibilities into testable units.
</issue_to_address>

### Comment 6
<location> `ramalama/transports/base.py:416` </location>
<code_context>
                result = subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ramalama/transports/base.py

ramalama/benchmarks/manager.py

test/unit/test_benchmarks_manager.py

ramalama/transports/base.py

ieaves · 2026-01-06T06:29:05Z

@olliewalsh @rhatdan This adds a persistence mechanism for benchmark results so that users can track performance metrics over time and against different image / accelerator configurations. Would you be willing to take a look?

olliewalsh

@ieaves hey, good idea. I would be concerned about the maintenance cost of this though.
I would be inclined to stop at just adding support for json output. The user could then feed the results into $whatever they want to use to store and query performance data e.g grafana or elastic/kibana.

olliewalsh · 2026-01-06T14:40:02Z

docs/ramalama-bench.1.md

 for a value and set the variable only if it is set on the host.

+#### **--format**
+Set the output format of the benchmark results (default: table).


should list the options

ieaves · 2026-01-06T17:19:12Z

@ieaves hey, good idea. I would be concerned about the maintenance cost of this though. I would be inclined to stop at just adding support for json output. The user could then feed the results into $whatever they want to use to store and query performance data e.g grafana or elastic/kibana.

@olliewalsh I have to admit I'm squeamish about adding sqlite as well (particularly without dependencies to help manage schema) but I had in mind adding tracking for real-time chat performance and potentially other test results beyond those from llama-bench next. Since different benchmarking tools and the real-time results would all follow different schemas we'd end up having to mix them together in a single json file and then manually manage loading / parsing / filtering them informally throughout the rest of the benchmarking code (not to mention the potential performance costs of deserializing the entire file on every read). It's not a slam dunk case for going the db route either but all roads felt a little icky in different ways.

Bundling all of the system / image / benchmark data into the a json record and dumping to a jsonl file is definitely easier to manage though and it'd be pretty easy for me to switch the implementation over. I think json is unambiguously the better option if we only care about discrete benchmark test results from llama-bench.

ieaves · 2026-01-22T06:53:15Z

@olliewalsh I've refactored this to use a jsonl storage solution and updated the docs to reflect your comments. Please take another look if you get a chance.

ieaves requested review from bmahabirbu, cgruver, engelmi, jhjaggars, maxamillion, rhatdan and swarajpande5 as code owners December 12, 2025 03:03

sourcery-ai bot reviewed Dec 12, 2025

View reviewed changes

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

ieaves force-pushed the metrics branch from 81f4d8d to 0f1f7f6 Compare January 5, 2026 20:51

sourcery-ai bot reviewed Jan 6, 2026

View reviewed changes

ramalama/benchmarks/manager.py Outdated Show resolved Hide resolved

sourcery-ai bot reviewed Jan 6, 2026

View reviewed changes

ieaves force-pushed the metrics branch from befef3c to 896fbc8 Compare January 6, 2026 06:27

olliewalsh reviewed Jan 6, 2026

View reviewed changes

ieaves force-pushed the metrics branch 2 times, most recently from 5dc2c43 to e3671da Compare January 22, 2026 06:37

ieaves closed this Jan 22, 2026

ieaves force-pushed the metrics branch from 4f46031 to 770957b Compare January 22, 2026 07:18

ieaves mentioned this pull request Jan 22, 2026

Add benchmark metrics persistence #2339

Merged

Add benchmark metrics persistence #2237

Add benchmark metrics persistence #2237

Uh oh!

Conversation

ieaves commented Dec 12, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Dec 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ieaves commented Jan 6, 2026

Uh oh!

olliewalsh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olliewalsh Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ieaves commented Jan 6, 2026

Uh oh!

ieaves commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ieaves commented Dec 12, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 12, 2025 •

edited

Loading

olliewalsh left a comment •

edited

Loading