Skip to content

Conversation

@christinaexyou
Copy link
Collaborator

@christinaexyou christinaexyou commented Sep 16, 2025

Summary by Sourcery

Add a new LM-Eval inline provider alongside the existing remote provider structure by relocating remote code to a dedicated subpackage, updating provider specs and configurations, and introducing a full suite of unit tests for inline evaluation functionality

New Features:

  • Introduce LMEval inline provider implementation (LMEvalInline) and corresponding InlineProviderSpec for on-the-fly evaluations
  • Add comprehensive unit tests for the inline provider covering spec registration, initialization, benchmark/job management, argument building, command execution, and result parsing

Enhancements:

  • Refactor remote provider by moving existing lmeval and provider modules into a 'remote' subpackage and update import paths and provider specs
  • Update providers.d YAML configs to reference the new remote and inline module paths
  • Add run-inline.yaml configuration for the inline provider deployment

Chores:

  • Remove unused whitespace and add init.py for the newly created inline and remote packages

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Sep 16, 2025

Reviewer's Guide

This PR implements a new inline LM-Eval provider by creating LMEvalInline with full job and benchmark management, command-building/execution, and result parsing, refactors the existing remote provider into a dedicated subpackage with updated imports and specs, relaxes the use_k8s configuration validation to support inline mode, and adds extensive unit tests and sample YAML for inline usage.

ER diagram for provider configuration YAMLs (inline and remote)

erDiagram
    PROVIDER {
        string provider_id
        string provider_type
        string config_class
        string module
        list pip_packages
    }
    CONFIG {
        string base_url
        bool use_k8s
    }
    PROVIDER ||--o| CONFIG : has
    REMOTE_PROVIDER {
        string provider_id
        string provider_type
        string config_class
        string module
        list pip_packages
    }
    REMOTE_CONFIG {
        string url
        int max_tokens
        string api_token
        bool tls_verify
        bool use_k8s
    }
    REMOTE_PROVIDER ||--o| REMOTE_CONFIG : has
Loading

Class diagram for new LMEvalInline provider (inline mode)

classDiagram
    class LMEvalInline {
        +LMEvalEvalProviderConfig config
        +dict[str, Benchmark] benchmarks
        +list[Job] _jobs
        +dict[str, dict[str, str]] _job_metadata
        +async initialize()
        +async list_benchmarks() ListBenchmarksResponse
        +async get_benchmark(benchmark_id: str) Benchmark | None
        +async register_benchmark(benchmark: Benchmark)
        +async run_eval(benchmark_id: str, benchmark_config: BenchmarkConfig, limit="2") Job
        +async build_command(task_config: BenchmarkConfig, benchmark_id: str, limit: str, stored_benchmark: Benchmark | None) list[str]
        +async _run_command(cmd: list[str], job_id: str)
        +_parse_lmeval_results(stdout: str) dict
        +_read_results_from_json(output_path: str) dict
        +async job_cancel(benchmark_id: str, job_id: str)
        +async job_result(benchmark_id: str, job_id: str) EvaluateResponse
        +async job_status(benchmark_id: str, job_id: str) Job | None
        +async shutdown()
    }
    LMEvalInline --|> Eval
    LMEvalInline --|> BenchmarksProtocolPrivate
    LMEvalInline o-- LMEvalEvalProviderConfig
    LMEvalInline o-- Benchmark
    LMEvalInline o-- Job
    LMEvalInline o-- EvaluateResponse
Loading

Class diagram for provider spec registration (inline and remote)

classDiagram
    class InlineProviderSpec {
        +api: Api
        +provider_type: str
        +pip_packages: list[str]
        +config_class: str
        +module: str
    }
    class RemoteProviderSpec {
        +api: Api
        +provider_type: str
        +pip_packages: list[str]
        +config_class: str
        +module: str
    }
    class ProviderSpec
    InlineProviderSpec --|> ProviderSpec
    RemoteProviderSpec --|> ProviderSpec
Loading

File-Level Changes

Change Details Files
Introduce inline LMEval provider implementation
  • Add LMEvalInline class with initialize, benchmark/job methods, command creation, execution, and parsing logic
  • Create InlineProviderSpec in inline/provider.py and init.py helper
  • Add run-inline.yaml and providers.d inline YAML for sample configuration
  • Add comprehensive unit tests in tests/test_lmeval_inline.py covering initialization, command building, argument handling, error cases, run_eval, and _run_command
src/llama_stack_provider_lmeval/inline/lmeval.py
src/llama_stack_provider_lmeval/inline/provider.py
src/llama_stack_provider_lmeval/inline/__init__.py
providers.d/inline/eval/trustyai_lmeval.yaml
run-inline.yaml
tests/test_lmeval_inline.py
Refactor remote provider into its own subpackage
  • Move existing lmeval and provider modules into remote/ subdirectory
  • Update import paths in code and tests to point to src/llama_stack_provider_lmeval/remote
  • Adjust remote provider spec module path in providers.d/remote/eval YAML
src/llama_stack_provider_lmeval/remote/lmeval.py
src/llama_stack_provider_lmeval/remote/provider.py
src/llama_stack_provider_lmeval/remote/__init__.py
providers.d/remote/eval/trustyai_lmeval.yaml
tests/test_lmeval.py
Relax configuration validation for inline mode
  • Remove strict use_k8s boolean validation in LMEvalEvalProviderConfig
  • Allow use_k8s to be false without error
src/llama_stack_provider_lmeval/config.py
Expand and update test coverage
  • Add inline provider tests for model args, lmeval args, task name extraction, benchmarking, job lifecycle, and error scenarios
  • Update existing test_lmeval suite imports for remote code relocation
tests/test_lmeval_inline.py
tests/test_lmeval.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@christinaexyou christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 28f34e8 to 87e3113 Compare September 17, 2025 18:54
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

@christinaexyou christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch 2 times, most recently from 5e41354 to 80ebed6 Compare September 18, 2025 15:10
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bandit found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@christinaexyou christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 80ebed6 to 09f0264 Compare September 18, 2025 15:14
@christinaexyou christinaexyou marked this pull request as ready for review September 18, 2025 15:16
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

  • Avoid using synchronous subprocess.run inside your async _run_command method, as it will block the event loop—consider using asyncio.create_subprocess_exec or run_in_executor instead.
  • You’ve marked _create_model_args, _collect_lmeval_args, and build_command as async even though they don’t await anything; converting them to regular methods will simplify the code and avoid confusion.
  • The build_command method is quite large—extract sections like flag/string assembly and metadata handling into smaller helper functions to improve readability and testability.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Avoid using synchronous subprocess.run inside your async _run_command method, as it will block the event loop—consider using asyncio.create_subprocess_exec or run_in_executor instead.
- You’ve marked _create_model_args, _collect_lmeval_args, and build_command as async even though they don’t await anything; converting them to regular methods will simplify the code and avoid confusion.
- The build_command method is quite large—extract sections like flag/string assembly and metadata handling into smaller helper functions to improve readability and testability.

## Individual Comments

### Comment 1
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:287` </location>
<code_context>
+        if os.path.exists(venv_path):
+            cmd[0] = venv_path
+
+        process_id = os.getpid()
+        self._job_metadata[job_id]["process_id"] = process_id
+        logger.info("Running lm_eval command: %s", " ".join(cmd))
</code_context>

<issue_to_address>
**issue (bug_risk):** Using os.getpid() for process tracking may not work for subprocesses.

Since evaluation runs in a subprocess, consider storing the child process's PID to ensure job cancellation works as intended.
</issue_to_address>

### Comment 2
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:291-298` </location>
<code_context>
+        logger.info("Running lm_eval command: %s", " ".join(cmd))
+
+        # Capture output for result parsing
+        result = subprocess.run(
+            cmd,
+            check=True,
+            capture_output=True,
+            text=True
+        )
+        return result.stdout, result.stderr
</code_context>

<issue_to_address>
**suggestion (bug_risk):** subprocess.run with check=True will raise on non-zero exit codes.

Handle CalledProcessError to capture stdout and stderr on failure, enabling better debugging and job metadata collection.

```suggestion
        # Capture output for result parsing
        try:
            result = subprocess.run(
                cmd,
                check=True,
                capture_output=True,
                text=True
            )
            return result.stdout, result.stderr
        except subprocess.CalledProcessError as e:
            logger.error(
                "lm_eval command failed with exit code %d: %s",
                e.returncode,
                " ".join(cmd)
            )
            logger.error("stdout: %s", e.stdout)
            logger.error("stderr: %s", e.stderr)
            return e.stdout, e.stderr
```
</issue_to_address>

### Comment 3
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:414` </location>
<code_context>
+            benchmark_id: The ID of the benchmark to run the evaluation on.
+            job_id: The ID of the job to cancel.
+        """
+        job = next((j for j in self._jobs if j.job_id == job_id))
+        if not job:
+            logger.warning("Job %s not found", job_id)
</code_context>

<issue_to_address>
**issue (bug_risk):** next() without a default will raise if job is not found.

Provide a default value to next(), such as None, to prevent StopIteration and ensure the warning is logged.
</issue_to_address>

### Comment 4
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:419-420` </location>
<code_context>
+            logger.warning("Job %s not found", job_id)
+            return
+
+        if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
+            logger.warning("Job %s is not running", job_id)
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Job status check does not return after logging warning.

Return immediately after logging the warning to prevent attempts to cancel jobs that are already completed or failed.

```suggestion
        if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
            logger.warning("Job %s is not running", job_id)
            return
```
</issue_to_address>

### Comment 5
<location> `src/llama_stack_provider_lmeval/inline/__init__.py:32-41` </location>
<code_context>
+
+        # Extract base_url from config if available
+        base_url = None
+        if hasattr(config, "model_args") and config.model_args:
+            for arg in config.model_args:
+                if arg.get("name") == "base_url":
+                    base_url = arg.get("value")
+                    logger.debug(f"Using base_url from config: {base_url}")
+                    break
+
+        return LMEvalInline(config=config)
</code_context>

<issue_to_address>
**suggestion:** Extracted base_url is not used in provider instantiation.

Since base_url is only logged and not used in LMEvalInline initialization, consider removing this extraction or passing base_url if needed.

```suggestion
        return LMEvalInline(config=config)
```
</issue_to_address>

### Comment 6
<location> `tests/test_lmeval_inline.py:544-545` </location>
<code_context>
+            await self.provider._run_command(cmd)
+
+
+if __name__ == "__main__":
+    unittest.main()
</code_context>

<issue_to_address>
**suggestion (testing):** Suggestion to add test coverage for job_cancel and job_result methods.

Please add unit tests for job_cancel and job_result, covering scenarios like cancelling completed jobs and retrieving results from failed jobs.

Suggested implementation:

```python
class TestJobMethods(unittest.IsolatedAsyncioTestCase):
    async def asyncSetUp(self):
        # Setup provider and mock jobs
        self.provider = ...  # Replace with actual provider initialization
        self.job_id = "test_job"
        self.completed_job_id = "completed_job"
        self.failed_job_id = "failed_job"

        # Mock job states
        self.provider.jobs = {
            self.job_id: {"status": "running"},
            self.completed_job_id: {"status": "completed"},
            self.failed_job_id: {"status": "failed", "result": Exception("Job failed")}
        }

    async def test_job_cancel_running(self):
        # Cancel a running job
        result = await self.provider.job_cancel(self.job_id)
        self.assertTrue(result)
        self.assertEqual(self.provider.jobs[self.job_id]["status"], "cancelled")

    async def test_job_cancel_completed(self):
        # Try to cancel a completed job
        result = await self.provider.job_cancel(self.completed_job_id)
        self.assertFalse(result)
        self.assertEqual(self.provider.jobs[self.completed_job_id]["status"], "completed")

    async def test_job_result_success(self):
        # Simulate a successful job result
        self.provider.jobs[self.completed_job_id]["result"] = "success"
        result = await self.provider.job_result(self.completed_job_id)
        self.assertEqual(result, "success")

    async def test_job_result_failed(self):
        # Simulate a failed job result
        with self.assertRaises(Exception) as context:
            await self.provider.job_result(self.failed_job_id)
        self.assertIn("Job failed", str(context.exception))


if __name__ == "__main__":
    unittest.main()

```

- Replace `self.provider = ...` with the actual provider initialization as per your codebase.
- Ensure that `job_cancel` and `job_result` methods exist and follow the expected interface.
- Adjust job dictionary structure if your implementation differs.
</issue_to_address>

### Comment 7
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:292-297` </location>
<code_context>
        result = subprocess.run(
            cmd,
            check=True,
            capture_output=True,
            text=True
        )
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

### Comment 8
<location> `src/llama_stack_provider_lmeval/inline/__init__.py:43` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Raise a specific error instead of the general `Exception` or `BaseException` ([`raise-specific-error`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/raise-specific-error))

<details><summary>Explanation</summary>If a piece of code raises a specific exception type
rather than the generic
[`BaseException`](https://docs.python.org/3/library/exceptions.html#BaseException)
or [`Exception`](https://docs.python.org/3/library/exceptions.html#Exception),
the calling code can:

- get more information about what type of error it is
- define specific exception handling for it

This way, callers of the code can handle the error appropriately.

How can you solve this?

- Use one of the [built-in exceptions](https://docs.python.org/3/library/exceptions.html) of the standard library.
- [Define your own error class](https://docs.python.org/3/tutorial/errors.html#tut-userexceptions) that subclasses `Exception`.

So instead of having code raising `Exception` or `BaseException` like

```python
if incorrect_input(value):
    raise Exception("The input is incorrect")
```

you can have code raising a specific error like

```python
if incorrect_input(value):
    raise ValueError("The input is incorrect")
```

or

```python
class IncorrectInputError(Exception):
    pass


if incorrect_input(value):
    raise IncorrectInputError("The input is incorrect")
```
</details>
</issue_to_address>

### Comment 9
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:48-49` </location>
<code_context>
    async def get_benchmark(self, benchmark_id: str) -> Benchmark | None:
        """Get a specific benchmark by ID."""
        benchmark = self.benchmarks.get(benchmark_id)
        return benchmark

</code_context>

<issue_to_address>
**suggestion (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))

```suggestion
        return self.benchmarks.get(benchmark_id)
```
</issue_to_address>

### Comment 10
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:168-175` </location>
<code_context>
    async def _collect_lmeval_args(self, task_config: BenchmarkConfig, stored_benchmark: Benchmark | None):
        lmeval_args = {}
        if hasattr(task_config, "lmeval_args") and task_config.lmeval_args:
            lmeval_args = task_config.lmeval_args

        if hasattr(task_config, "metadata") and task_config.metadata:
            metadata_lmeval_args = task_config.metadata.get("lmeval_args")
            if metadata_lmeval_args:
                for key, value in metadata_lmeval_args.items():
                    lmeval_args[key] = value

        # Check stored benchmark for additional lmeval args
        if stored_benchmark and hasattr(stored_benchmark, "metadata") and stored_benchmark.metadata:
            benchmark_lmeval_args = stored_benchmark.metadata.get("lmeval_args")
            if benchmark_lmeval_args:
                for key, value in benchmark_lmeval_args.items():
                    lmeval_args[key] = value

        return lmeval_args

</code_context>

<issue_to_address>
**issue (code-quality):** Use named expression to simplify assignment and conditional [×2] ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
</issue_to_address>

### Comment 11
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:193` </location>
<code_context>
    async def build_command(
        self,
        task_config: BenchmarkConfig,
        benchmark_id: str,
        limit: str,
        stored_benchmark: Benchmark | None
    ) -> list[str]:
        """Build lm_eval command with default args and user overrides."""

        logger.info("BUILD_COMMAND: Starting to build command for benchmark: %s", benchmark_id)
        logger.info("BUILD_COMMAND: Task config type: %s", type(task_config))
        logger.info(
            "BUILD_COMMAND: Task config has metadata: %s", hasattr(task_config, "metadata")
        )
        if hasattr(task_config, "metadata"):
            logger.info(
                "BUILD_COMMAND: Task config metadata content: %s", task_config.metadata
            )

        eval_candidate = task_config.eval_candidate
        if not eval_candidate.type == "model":
            raise LMEvalConfigError("LMEval only supports model candidates for now")

        # Create model args - use VLLM_URL environment variable for inference provider
        inference_url = os.environ.get("VLLM_URL", "http://localhost:8080/v1")
        openai_url = inference_url.replace("/v1", "/v1/completions")
        model_args = await self._create_model_args(openai_url, task_config)

        if (
            stored_benchmark is not None
            and hasattr(stored_benchmark, "metadata")
            and stored_benchmark.metadata
            and "tokenizer" in stored_benchmark.metadata
        ):
            tokenizer_value = stored_benchmark.metadata.get("tokenizer")
            if isinstance(tokenizer_value, str) and tokenizer_value:
                logger.info("Using custom tokenizer from metadata: %s", tokenizer_value)

            tokenized_requests = stored_benchmark.metadata.get("tokenized_requests")
            if isinstance(tokenized_requests, bool) and tokenized_requests:
                logger.info("Using custom tokenized_requests from metadata: %s", tokenized_requests)

            model_args["tokenizer"] = tokenizer_value
            model_args["tokenized_requests"] = tokenized_requests

        task_name = self._extract_task_name(benchmark_id)

        lmeval_args = await self._collect_lmeval_args(task_config, stored_benchmark)

        # Start building the command
        cmd = ["lm_eval"]

        # Add model type (default to local-completions for OpenAI-compatible APIs)
        model_type = model_args.get("model_type", "local-completions")
        cmd.extend(["--model", model_type])

        # Build model_args string
        if model_args:
            model_args_list = []
            for key, value in model_args.items():
                if key != "model_type" and value is not None:
                    model_args_list.append(f"{key}={value}")

            if model_args_list:
                cmd.extend(["--model_args", ",".join(model_args_list)])

        # Add tasks
        cmd.extend(["--tasks", task_name])

        limit = limit or "2"
        if limit:
            cmd.extend(["--limit", str(limit)])

        # Add output file for JSON results
        output_path = f"/tmp/lmeval_results_{uuid.uuid4().hex}.json"
        cmd.extend(["--output_path", output_path])

        # Add lmeval_args
        if lmeval_args:
            for key, value in lmeval_args.items():
                if value is not None:
                    flag = f"--{key.replace('_', '-')}"
                    cmd.extend([flag, str(value)])

        logger.info("Generated command for benchmark %s: %s", benchmark_id, " ".join(cmd))
        return cmd, output_path

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional [×2] ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Simplify logical expression using De Morgan identities ([`de-morgan`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/de-morgan/))
- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
- Low code quality found in LMEvalInline.build\_command - 25% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>



The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 12
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:300` </location>
<code_context>
    def _parse_lmeval_results(self, stdout: str) -> dict:
        """Parse evaluation results from stdout."""
        # Try to find JSON results in the output
        json_pattern = r'\{[^{}]*"results"[^{}]*\}'
        json_matches = re.findall(json_pattern, stdout, re.DOTALL)

        scores = {}

        if json_matches:
            try:
                # Parse JSON results
                for match in json_matches:
                    result_data = json.loads(match)
                    if "results" in result_data:
                        for task, metrics in result_data["results"].items():
                            for metric, value in metrics.items():
                                if isinstance(value, int | float):
                                    score_key = f"{task}:{metric}"

                                    scores[score_key] = ScoringResult(
                                        aggregated_results={metric: value},
                                        score_rows=[{"score": value}]
                                    )
            except json.JSONDecodeError:
                logger.warning("Failed to parse JSON results from lm_eval output")

        # If no JSON found, try to parse table format
        if not scores:
            lines = stdout.split('\n')
            for i, line in enumerate(lines):
                if 'Metric' in line and '|' in line:
                    # Found table header, parse subsequent lines
                    for j in range(i + 2, len(lines)):  # Skip header separator
                        if not lines[j].strip() or '|' not in lines[j]:
                            break
                        parts = [p.strip() for p in lines[j].split('|') if p.strip()]
                        if len(parts) >= 5:
                            task = parts[0] or "arc_easy"  # Default task name
                            metric = parts[4]
                            try:
                                value = float(parts[6]) if len(parts) > 6 else 0.0
                                score_key = f"{task}:{metric}"
                                scores[score_key] = ScoringResult(
                                    aggregated_results={metric: value},
                                    score_rows=[{"score": value}]
                                )
                            except (ValueError, IndexError):
                                continue
                    break

        return EvaluateResponse(generations=[], scores=scores)

</code_context>

<issue_to_address>
**issue (code-quality):** Low code quality found in LMEvalInline.\_parse\_lmeval\_results - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 13
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:423-424` </location>
<code_context>
    async def job_cancel(self, benchmark_id: str, job_id: str) -> None:
        """Cancel a running evaluation job.

        Args:
            benchmark_id: The ID of the benchmark to run the evaluation on.
            job_id: The ID of the job to cancel.
        """
        job = next((j for j in self._jobs if j.job_id == job_id))
        if not job:
            logger.warning("Job %s not found", job_id)
            return

        if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
            logger.warning("Job %s is not running", job_id)

        elif job.status in [JobStatus.in_progress, JobStatus.scheduled]:
            process_id = self._job_metadata.get(job_id, {}).get("process_id")
            if process_id:
                process_id = int(process_id)
                logger.info("Killing process %s", process_id)
                try:
                    os.kill(process_id, signal.SIGTERM)
                    await asyncio.sleep(2)
                except ProcessLookupError:
                    logger.warning("Process %s not found", process_id)
                try:
                    os.kill(process_id, signal.SIGKILL)
                except ProcessLookupError:
                    logger.warning("Process %s not found", process_id)
            job.status = JobStatus.cancelled
            logger.info("Successfully cancelled job %s", job_id)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))

```suggestion
            if process_id := self._job_metadata.get(job_id, {}).get("process_id"):
```
</issue_to_address>

### Comment 14
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:453` </location>
<code_context>
    async def job_result(self, benchmark_id: str, job_id: str) -> EvaluateResponse:
        """Get the results of a completed evaluation job.

        Args:
            benchmark_id: The ID of the benchmark to run the evaluation on.
            job_id: The ID of the job to get the results of.
        """
        job = await self.job_status(benchmark_id, job_id)

        if job is None:
            logger.warning("Job %s not found", job_id)
            return EvaluateResponse(generations=[], scores={})

        if job.status == JobStatus.completed:
            # Get results from job metadata
            job_metadata = self._job_metadata.get(job_id, {})
            results = job_metadata.get("results")
            if results:
                return results
            else:
                logger.warning("No results found for completed job %s", job_id)
                return EvaluateResponse(generations=[], scores={})
        elif job.status == JobStatus.failed:
            logger.warning("Job %s failed", job_id)
            error_msg = self._job_metadata.get(job_id, {}).get("error", "Unknown error")
            return EvaluateResponse(generations=[], scores={}, metadata={"error": error_msg})
        else:
            logger.warning("Job %s is still running", job_id)
            return EvaluateResponse(generations=[], scores={}, metadata={"status": "running"})

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Remove unnecessary else after guard condition ([`remove-unnecessary-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-else/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 32 to 41
# Extract base_url from config if available
base_url = None
if hasattr(config, "model_args") and config.model_args:
for arg in config.model_args:
if arg.get("name") == "base_url":
base_url = arg.get("value")
logger.debug(f"Using base_url from config: {base_url}")
break

return LMEvalInline(config=config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Extracted base_url is not used in provider instantiation.

Since base_url is only logged and not used in LMEvalInline initialization, consider removing this extraction or passing base_url if needed.

Suggested change
# Extract base_url from config if available
base_url = None
if hasattr(config, "model_args") and config.model_args:
for arg in config.model_args:
if arg.get("name") == "base_url":
base_url = arg.get("value")
logger.debug(f"Using base_url from config: {base_url}")
break
return LMEvalInline(config=config)
return LMEvalInline(config=config)

Comment on lines 168 to 354
metadata_lmeval_args = task_config.metadata.get("lmeval_args")
if metadata_lmeval_args:
for key, value in metadata_lmeval_args.items():
lmeval_args[key] = value

# Check stored benchmark for additional lmeval args
if stored_benchmark and hasattr(stored_benchmark, "metadata") and stored_benchmark.metadata:
benchmark_lmeval_args = stored_benchmark.metadata.get("lmeval_args")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

Comment on lines 423 to 498
process_id = self._job_metadata.get(job_id, {}).get("process_id")
if process_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Suggested change
process_id = self._job_metadata.get(job_id, {}).get("process_id")
if process_id:
if process_id := self._job_metadata.get(job_id, {}).get("process_id"):

@christinaexyou christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch 2 times, most recently from eb8a389 to 867da4d Compare September 19, 2025 15:13
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

@christinaexyou christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 867da4d to 0a2e7f0 Compare September 19, 2025 15:44
@christinaexyou christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 60644a9 to 4596c27 Compare September 26, 2025 13:15
@ruivieira ruivieira changed the title feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests [Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests Sep 26, 2025
Copy link

@saichandrapandraju saichandrapandraju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @christinaexyou , just a couple of nits from me.


self._job_metadata[job_id]["process_id"] = str(process.pid)

stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=300)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we optionally parameterize timeout?

results_data, job_id
)
# Store the parsed results in job metadata
self._job_metadata[job_id]["results"] = parsed_results

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if result size is not small, maybe we can save it in the files store and just track the results_file id.


# Create model args - use VLLM_URL environment variable for inference provider
inference_url = os.environ.get("VLLM_URL", "http://localhost:8080/v1")
openai_url = inference_url.replace("/v1", "/v1/completions")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some checks here to make sure we're not polluting the url.

IIRC, if we need openai url, it should ends_with openai/v1/completions.

max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
api_token: ${env.VLLM_API_TOKEN:=fake}
tls_verify: ${env.VLLM_TLS_VERIFY:=false}
eval:
Copy link

@saichandrapandraju saichandrapandraju Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we start using module: way of specifying the provider as lls-core is moving away from external_providers_dir?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants