[Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests #60

christinaexyou · 2025-09-16T18:39:34Z

Summary by Sourcery

Add a new LM-Eval inline provider alongside the existing remote provider structure by relocating remote code to a dedicated subpackage, updating provider specs and configurations, and introducing a full suite of unit tests for inline evaluation functionality

New Features:

Introduce LMEval inline provider implementation (LMEvalInline) and corresponding InlineProviderSpec for on-the-fly evaluations
Add comprehensive unit tests for the inline provider covering spec registration, initialization, benchmark/job management, argument building, command execution, and result parsing

Enhancements:

Refactor remote provider by moving existing lmeval and provider modules into a 'remote' subpackage and update import paths and provider specs
Update providers.d YAML configs to reference the new remote and inline module paths
Add run-inline.yaml configuration for the inline provider deployment

Chores:

Remove unused whitespace and add init.py for the newly created inline and remote packages

sourcery-ai · 2025-09-16T18:40:17Z

Reviewer's Guide

This PR implements a new inline LM-Eval provider by creating LMEvalInline with full job and benchmark management, command-building/execution, and result parsing, refactors the existing remote provider into a dedicated subpackage with updated imports and specs, relaxes the use_k8s configuration validation to support inline mode, and adds extensive unit tests and sample YAML for inline usage.

ER diagram for provider configuration YAMLs (inline and remote)

erDiagram
    PROVIDER {
        string provider_id
        string provider_type
        string config_class
        string module
        list pip_packages
    }
    CONFIG {
        string base_url
        bool use_k8s
    }
    PROVIDER ||--o| CONFIG : has
    REMOTE_PROVIDER {
        string provider_id
        string provider_type
        string config_class
        string module
        list pip_packages
    }
    REMOTE_CONFIG {
        string url
        int max_tokens
        string api_token
        bool tls_verify
        bool use_k8s
    }
    REMOTE_PROVIDER ||--o| REMOTE_CONFIG : has

Class diagram for new LMEvalInline provider (inline mode)

classDiagram
    class LMEvalInline {
        +LMEvalEvalProviderConfig config
        +dict[str, Benchmark] benchmarks
        +list[Job] _jobs
        +dict[str, dict[str, str]] _job_metadata
        +async initialize()
        +async list_benchmarks() ListBenchmarksResponse
        +async get_benchmark(benchmark_id: str) Benchmark | None
        +async register_benchmark(benchmark: Benchmark)
        +async run_eval(benchmark_id: str, benchmark_config: BenchmarkConfig, limit="2") Job
        +async build_command(task_config: BenchmarkConfig, benchmark_id: str, limit: str, stored_benchmark: Benchmark | None) list[str]
        +async _run_command(cmd: list[str], job_id: str)
        +_parse_lmeval_results(stdout: str) dict
        +_read_results_from_json(output_path: str) dict
        +async job_cancel(benchmark_id: str, job_id: str)
        +async job_result(benchmark_id: str, job_id: str) EvaluateResponse
        +async job_status(benchmark_id: str, job_id: str) Job | None
        +async shutdown()
    }
    LMEvalInline --|> Eval
    LMEvalInline --|> BenchmarksProtocolPrivate
    LMEvalInline o-- LMEvalEvalProviderConfig
    LMEvalInline o-- Benchmark
    LMEvalInline o-- Job
    LMEvalInline o-- EvaluateResponse

Class diagram for provider spec registration (inline and remote)

classDiagram
    class InlineProviderSpec {
        +api: Api
        +provider_type: str
        +pip_packages: list[str]
        +config_class: str
        +module: str
    }
    class RemoteProviderSpec {
        +api: Api
        +provider_type: str
        +pip_packages: list[str]
        +config_class: str
        +module: str
    }
    class ProviderSpec
    InlineProviderSpec --|> ProviderSpec
    RemoteProviderSpec --|> ProviderSpec

File-Level Changes

Change	Details	Files
Introduce inline LMEval provider implementation	Add LMEvalInline class with initialize, benchmark/job methods, command creation, execution, and parsing logic Create InlineProviderSpec in inline/provider.py and init.py helper Add run-inline.yaml and providers.d inline YAML for sample configuration Add comprehensive unit tests in tests/test_lmeval_inline.py covering initialization, command building, argument handling, error cases, run_eval, and _run_command	`src/llama_stack_provider_lmeval/inline/lmeval.py` `src/llama_stack_provider_lmeval/inline/provider.py` `src/llama_stack_provider_lmeval/inline/__init__.py` `providers.d/inline/eval/trustyai_lmeval.yaml` `run-inline.yaml` `tests/test_lmeval_inline.py`
Refactor remote provider into its own subpackage	Move existing lmeval and provider modules into remote/ subdirectory Update import paths in code and tests to point to src/llama_stack_provider_lmeval/remote Adjust remote provider spec module path in providers.d/remote/eval YAML	`src/llama_stack_provider_lmeval/remote/lmeval.py` `src/llama_stack_provider_lmeval/remote/provider.py` `src/llama_stack_provider_lmeval/remote/__init__.py` `providers.d/remote/eval/trustyai_lmeval.yaml` `tests/test_lmeval.py`
Relax configuration validation for inline mode	Remove strict use_k8s boolean validation in LMEvalEvalProviderConfig Allow use_k8s to be false without error	`src/llama_stack_provider_lmeval/config.py`
Expand and update test coverage	Add inline provider tests for model args, lmeval args, task name extraction, benchmarking, job lifecycle, and error scenarios Update existing test_lmeval suite imports for remote code relocation	`tests/test_lmeval_inline.py` `tests/test_lmeval.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

New security issues found

src/llama_stack_provider_lmeval/inline/lmeval.py

github-advanced-security

Bandit found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

Avoid using synchronous subprocess.run inside your async _run_command method, as it will block the event loop—consider using asyncio.create_subprocess_exec or run_in_executor instead.
You’ve marked _create_model_args, _collect_lmeval_args, and build_command as async even though they don’t await anything; converting them to regular methods will simplify the code and avoid confusion.
The build_command method is quite large—extract sections like flag/string assembly and metadata handling into smaller helper functions to improve readability and testability.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Avoid using synchronous subprocess.run inside your async _run_command method, as it will block the event loop—consider using asyncio.create_subprocess_exec or run_in_executor instead.
- You’ve marked _create_model_args, _collect_lmeval_args, and build_command as async even though they don’t await anything; converting them to regular methods will simplify the code and avoid confusion.
- The build_command method is quite large—extract sections like flag/string assembly and metadata handling into smaller helper functions to improve readability and testability.

## Individual Comments

### Comment 1
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:287` </location>
<code_context>
+        if os.path.exists(venv_path):
+            cmd[0] = venv_path
+
+        process_id = os.getpid()
+        self._job_metadata[job_id]["process_id"] = process_id
+        logger.info("Running lm_eval command: %s", " ".join(cmd))
</code_context>

<issue_to_address>
**issue (bug_risk):** Using os.getpid() for process tracking may not work for subprocesses.

Since evaluation runs in a subprocess, consider storing the child process's PID to ensure job cancellation works as intended.
</issue_to_address>

### Comment 2
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:291-298` </location>
<code_context>
+        logger.info("Running lm_eval command: %s", " ".join(cmd))
+
+        # Capture output for result parsing
+        result = subprocess.run(
+            cmd,
+            check=True,
+            capture_output=True,
+            text=True
+        )
+        return result.stdout, result.stderr
</code_context>

<issue_to_address>
**suggestion (bug_risk):** subprocess.run with check=True will raise on non-zero exit codes.

Handle CalledProcessError to capture stdout and stderr on failure, enabling better debugging and job metadata collection.

```suggestion
        # Capture output for result parsing
        try:
            result = subprocess.run(
                cmd,
                check=True,
                capture_output=True,
                text=True
            )
            return result.stdout, result.stderr
        except subprocess.CalledProcessError as e:
            logger.error(
                "lm_eval command failed with exit code %d: %s",
                e.returncode,
                " ".join(cmd)
            )
            logger.error("stdout: %s", e.stdout)
            logger.error("stderr: %s", e.stderr)
            return e.stdout, e.stderr
```
</issue_to_address>

### Comment 3
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:414` </location>
<code_context>
+            benchmark_id: The ID of the benchmark to run the evaluation on.
+            job_id: The ID of the job to cancel.
+        """
+        job = next((j for j in self._jobs if j.job_id == job_id))
+        if not job:
+            logger.warning("Job %s not found", job_id)
</code_context>

<issue_to_address>
**issue (bug_risk):** next() without a default will raise if job is not found.

Provide a default value to next(), such as None, to prevent StopIteration and ensure the warning is logged.
</issue_to_address>

### Comment 4
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:419-420` </location>
<code_context>
+            logger.warning("Job %s not found", job_id)
+            return
+
+        if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
+            logger.warning("Job %s is not running", job_id)
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Job status check does not return after logging warning.

Return immediately after logging the warning to prevent attempts to cancel jobs that are already completed or failed.

```suggestion
        if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
            logger.warning("Job %s is not running", job_id)
            return
```
</issue_to_address>

### Comment 5
<location> `src/llama_stack_provider_lmeval/inline/__init__.py:32-41` </location>
<code_context>
+
+        # Extract base_url from config if available
+        base_url = None
+        if hasattr(config, "model_args") and config.model_args:
+            for arg in config.model_args:
+                if arg.get("name") == "base_url":
+                    base_url = arg.get("value")
+                    logger.debug(f"Using base_url from config: {base_url}")
+                    break
+
+        return LMEvalInline(config=config)
</code_context>

<issue_to_address>
**suggestion:** Extracted base_url is not used in provider instantiation.

Since base_url is only logged and not used in LMEvalInline initialization, consider removing this extraction or passing base_url if needed.

```suggestion
        return LMEvalInline(config=config)
```
</issue_to_address>

### Comment 6
<location> `tests/test_lmeval_inline.py:544-545` </location>
<code_context>
+            await self.provider._run_command(cmd)
+
+
+if __name__ == "__main__":
+    unittest.main()
</code_context>

<issue_to_address>
**suggestion (testing):** Suggestion to add test coverage for job_cancel and job_result methods.

Please add unit tests for job_cancel and job_result, covering scenarios like cancelling completed jobs and retrieving results from failed jobs.

Suggested implementation:

```python
class TestJobMethods(unittest.IsolatedAsyncioTestCase):
    async def asyncSetUp(self):
        # Setup provider and mock jobs
        self.provider = ...  # Replace with actual provider initialization
        self.job_id = "test_job"
        self.completed_job_id = "completed_job"
        self.failed_job_id = "failed_job"

        # Mock job states
        self.provider.jobs = {
            self.job_id: {"status": "running"},
            self.completed_job_id: {"status": "completed"},
            self.failed_job_id: {"status": "failed", "result": Exception("Job failed")}
        }

    async def test_job_cancel_running(self):
        # Cancel a running job
        result = await self.provider.job_cancel(self.job_id)
        self.assertTrue(result)
        self.assertEqual(self.provider.jobs[self.job_id]["status"], "cancelled")

    async def test_job_cancel_completed(self):
        # Try to cancel a completed job
        result = await self.provider.job_cancel(self.completed_job_id)
        self.assertFalse(result)
        self.assertEqual(self.provider.jobs[self.completed_job_id]["status"], "completed")

    async def test_job_result_success(self):
        # Simulate a successful job result
        self.provider.jobs[self.completed_job_id]["result"] = "success"
        result = await self.provider.job_result(self.completed_job_id)
        self.assertEqual(result, "success")

    async def test_job_result_failed(self):
        # Simulate a failed job result
        with self.assertRaises(Exception) as context:
            await self.provider.job_result(self.failed_job_id)
        self.assertIn("Job failed", str(context.exception))


if __name__ == "__main__":
    unittest.main()

```

- Replace `self.provider = ...` with the actual provider initialization as per your codebase.
- Ensure that `job_cancel` and `job_result` methods exist and follow the expected interface.
- Adjust job dictionary structure if your implementation differs.
</issue_to_address>

### Comment 7
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:292-297` </location>
<code_context>
        result = subprocess.run(
            cmd,
            check=True,
            capture_output=True,
            text=True
        )
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

### Comment 8
<location> `src/llama_stack_provider_lmeval/inline/__init__.py:43` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Raise a specific error instead of the general `Exception` or `BaseException` ([`raise-specific-error`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/raise-specific-error))

<details><summary>Explanation</summary>If a piece of code raises a specific exception type
rather than the generic
[`BaseException`](https://docs.python.org/3/library/exceptions.html#BaseException)
or [`Exception`](https://docs.python.org/3/library/exceptions.html#Exception),
the calling code can:

- get more information about what type of error it is
- define specific exception handling for it

This way, callers of the code can handle the error appropriately.

How can you solve this?

- Use one of the [built-in exceptions](https://docs.python.org/3/library/exceptions.html) of the standard library.
- [Define your own error class](https://docs.python.org/3/tutorial/errors.html#tut-userexceptions) that subclasses `Exception`.

So instead of having code raising `Exception` or `BaseException` like

```python
if incorrect_input(value):
    raise Exception("The input is incorrect")
```

you can have code raising a specific error like

```python
if incorrect_input(value):
    raise ValueError("The input is incorrect")
```

or

```python
class IncorrectInputError(Exception):
    pass


if incorrect_input(value):
    raise IncorrectInputError("The input is incorrect")
```
</details>
</issue_to_address>

### Comment 9
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:48-49` </location>
<code_context>
    async def get_benchmark(self, benchmark_id: str) -> Benchmark | None:
        """Get a specific benchmark by ID."""
        benchmark = self.benchmarks.get(benchmark_id)
        return benchmark

</code_context>

<issue_to_address>
**suggestion (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))

```suggestion
        return self.benchmarks.get(benchmark_id)
```
</issue_to_address>

### Comment 10
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:168-175` </location>
<code_context>
    async def _collect_lmeval_args(self, task_config: BenchmarkConfig, stored_benchmark: Benchmark | None):
        lmeval_args = {}
        if hasattr(task_config, "lmeval_args") and task_config.lmeval_args:
            lmeval_args = task_config.lmeval_args

        if hasattr(task_config, "metadata") and task_config.metadata:
            metadata_lmeval_args = task_config.metadata.get("lmeval_args")
            if metadata_lmeval_args:
                for key, value in metadata_lmeval_args.items():
                    lmeval_args[key] = value

        # Check stored benchmark for additional lmeval args
        if stored_benchmark and hasattr(stored_benchmark, "metadata") and stored_benchmark.metadata:
            benchmark_lmeval_args = stored_benchmark.metadata.get("lmeval_args")
            if benchmark_lmeval_args:
                for key, value in benchmark_lmeval_args.items():
                    lmeval_args[key] = value

        return lmeval_args

</code_context>

<issue_to_address>
**issue (code-quality):** Use named expression to simplify assignment and conditional [×2] ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
</issue_to_address>

### Comment 11
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:193` </location>
<code_context>
    async def build_command(
        self,
        task_config: BenchmarkConfig,
        benchmark_id: str,
        limit: str,
        stored_benchmark: Benchmark | None
    ) -> list[str]:
        """Build lm_eval command with default args and user overrides."""

        logger.info("BUILD_COMMAND: Starting to build command for benchmark: %s", benchmark_id)
        logger.info("BUILD_COMMAND: Task config type: %s", type(task_config))
        logger.info(
            "BUILD_COMMAND: Task config has metadata: %s", hasattr(task_config, "metadata")
        )
        if hasattr(task_config, "metadata"):
            logger.info(
                "BUILD_COMMAND: Task config metadata content: %s", task_config.metadata
            )

        eval_candidate = task_config.eval_candidate
        if not eval_candidate.type == "model":
            raise LMEvalConfigError("LMEval only supports model candidates for now")

        # Create model args - use VLLM_URL environment variable for inference provider
        inference_url = os.environ.get("VLLM_URL", "http://localhost:8080/v1")
        openai_url = inference_url.replace("/v1", "/v1/completions")
        model_args = await self._create_model_args(openai_url, task_config)

        if (
            stored_benchmark is not None
            and hasattr(stored_benchmark, "metadata")
            and stored_benchmark.metadata
            and "tokenizer" in stored_benchmark.metadata
        ):
            tokenizer_value = stored_benchmark.metadata.get("tokenizer")
            if isinstance(tokenizer_value, str) and tokenizer_value:
                logger.info("Using custom tokenizer from metadata: %s", tokenizer_value)

            tokenized_requests = stored_benchmark.metadata.get("tokenized_requests")
            if isinstance(tokenized_requests, bool) and tokenized_requests:
                logger.info("Using custom tokenized_requests from metadata: %s", tokenized_requests)

            model_args["tokenizer"] = tokenizer_value
            model_args["tokenized_requests"] = tokenized_requests

        task_name = self._extract_task_name(benchmark_id)

        lmeval_args = await self._collect_lmeval_args(task_config, stored_benchmark)

        # Start building the command
        cmd = ["lm_eval"]

        # Add model type (default to local-completions for OpenAI-compatible APIs)
        model_type = model_args.get("model_type", "local-completions")
        cmd.extend(["--model", model_type])

        # Build model_args string
        if model_args:
            model_args_list = []
            for key, value in model_args.items():
                if key != "model_type" and value is not None:
                    model_args_list.append(f"{key}={value}")

            if model_args_list:
                cmd.extend(["--model_args", ",".join(model_args_list)])

        # Add tasks
        cmd.extend(["--tasks", task_name])

        limit = limit or "2"
        if limit:
            cmd.extend(["--limit", str(limit)])

        # Add output file for JSON results
        output_path = f"/tmp/lmeval_results_{uuid.uuid4().hex}.json"
        cmd.extend(["--output_path", output_path])

        # Add lmeval_args
        if lmeval_args:
            for key, value in lmeval_args.items():
                if value is not None:
                    flag = f"--{key.replace('_', '-')}"
                    cmd.extend([flag, str(value)])

        logger.info("Generated command for benchmark %s: %s", benchmark_id, " ".join(cmd))
        return cmd, output_path

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional [×2] ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Simplify logical expression using De Morgan identities ([`de-morgan`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/de-morgan/))
- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
- Low code quality found in LMEvalInline.build\_command - 25% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>



The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 12
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:300` </location>
<code_context>
    def _parse_lmeval_results(self, stdout: str) -> dict:
        """Parse evaluation results from stdout."""
        # Try to find JSON results in the output
        json_pattern = r'\{[^{}]*"results"[^{}]*\}'
        json_matches = re.findall(json_pattern, stdout, re.DOTALL)

        scores = {}

        if json_matches:
            try:
                # Parse JSON results
                for match in json_matches:
                    result_data = json.loads(match)
                    if "results" in result_data:
                        for task, metrics in result_data["results"].items():
                            for metric, value in metrics.items():
                                if isinstance(value, int | float):
                                    score_key = f"{task}:{metric}"

                                    scores[score_key] = ScoringResult(
                                        aggregated_results={metric: value},
                                        score_rows=[{"score": value}]
                                    )
            except json.JSONDecodeError:
                logger.warning("Failed to parse JSON results from lm_eval output")

        # If no JSON found, try to parse table format
        if not scores:
            lines = stdout.split('\n')
            for i, line in enumerate(lines):
                if 'Metric' in line and '|' in line:
                    # Found table header, parse subsequent lines
                    for j in range(i + 2, len(lines)):  # Skip header separator
                        if not lines[j].strip() or '|' not in lines[j]:
                            break
                        parts = [p.strip() for p in lines[j].split('|') if p.strip()]
                        if len(parts) >= 5:
                            task = parts[0] or "arc_easy"  # Default task name
                            metric = parts[4]
                            try:
                                value = float(parts[6]) if len(parts) > 6 else 0.0
                                score_key = f"{task}:{metric}"
                                scores[score_key] = ScoringResult(
                                    aggregated_results={metric: value},
                                    score_rows=[{"score": value}]
                                )
                            except (ValueError, IndexError):
                                continue
                    break

        return EvaluateResponse(generations=[], scores=scores)

</code_context>

<issue_to_address>
**issue (code-quality):** Low code quality found in LMEvalInline.\_parse\_lmeval\_results - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 13
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:423-424` </location>
<code_context>
    async def job_cancel(self, benchmark_id: str, job_id: str) -> None:
        """Cancel a running evaluation job.

        Args:
            benchmark_id: The ID of the benchmark to run the evaluation on.
            job_id: The ID of the job to cancel.
        """
        job = next((j for j in self._jobs if j.job_id == job_id))
        if not job:
            logger.warning("Job %s not found", job_id)
            return

        if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
            logger.warning("Job %s is not running", job_id)

        elif job.status in [JobStatus.in_progress, JobStatus.scheduled]:
            process_id = self._job_metadata.get(job_id, {}).get("process_id")
            if process_id:
                process_id = int(process_id)
                logger.info("Killing process %s", process_id)
                try:
                    os.kill(process_id, signal.SIGTERM)
                    await asyncio.sleep(2)
                except ProcessLookupError:
                    logger.warning("Process %s not found", process_id)
                try:
                    os.kill(process_id, signal.SIGKILL)
                except ProcessLookupError:
                    logger.warning("Process %s not found", process_id)
            job.status = JobStatus.cancelled
            logger.info("Successfully cancelled job %s", job_id)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))

```suggestion
            if process_id := self._job_metadata.get(job_id, {}).get("process_id"):
```
</issue_to_address>

### Comment 14
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:453` </location>
<code_context>
    async def job_result(self, benchmark_id: str, job_id: str) -> EvaluateResponse:
        """Get the results of a completed evaluation job.

        Args:
            benchmark_id: The ID of the benchmark to run the evaluation on.
            job_id: The ID of the job to get the results of.
        """
        job = await self.job_status(benchmark_id, job_id)

        if job is None:
            logger.warning("Job %s not found", job_id)
            return EvaluateResponse(generations=[], scores={})

        if job.status == JobStatus.completed:
            # Get results from job metadata
            job_metadata = self._job_metadata.get(job_id, {})
            results = job_metadata.get("results")
            if results:
                return results
            else:
                logger.warning("No results found for completed job %s", job_id)
                return EvaluateResponse(generations=[], scores={})
        elif job.status == JobStatus.failed:
            logger.warning("Job %s failed", job_id)
            error_msg = self._job_metadata.get(job_id, {}).get("error", "Unknown error")
            return EvaluateResponse(generations=[], scores={}, metadata={"error": error_msg})
        else:
            logger.warning("Job %s is still running", job_id)
            return EvaluateResponse(generations=[], scores={}, metadata={"status": "running"})

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Remove unnecessary else after guard condition ([`remove-unnecessary-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-else/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/llama_stack_provider_lmeval/inline/lmeval.py

sourcery-ai · 2025-09-18T15:17:45Z

src/llama_stack_provider_lmeval/inline/__init__.py

+        # Extract base_url from config if available
+        base_url = None
+        if hasattr(config, "model_args") and config.model_args:
+            for arg in config.model_args:
+                if arg.get("name") == "base_url":
+                    base_url = arg.get("value")
+                    logger.debug(f"Using base_url from config: {base_url}")
+                    break
+
+        return LMEvalInline(config=config)


suggestion: Extracted base_url is not used in provider instantiation.

Since base_url is only logged and not used in LMEvalInline initialization, consider removing this extraction or passing base_url if needed.

Suggested change

# Extract base_url from config if available

base_url = None

if hasattr(config, "model_args") and config.model_args:

for arg in config.model_args:

if arg.get("name") == "base_url":

base_url = arg.get("value")

logger.debug(f"Using base_url from config: {base_url}")

break

return LMEvalInline(config=config)

return LMEvalInline(config=config)

sourcery-ai · 2025-09-18T15:17:46Z

src/llama_stack_provider_lmeval/inline/lmeval.py

+            metadata_lmeval_args = task_config.metadata.get("lmeval_args")
+            if metadata_lmeval_args:
+                for key, value in metadata_lmeval_args.items():
+                    lmeval_args[key] = value
+
+        # Check stored benchmark for additional lmeval args
+        if stored_benchmark and hasattr(stored_benchmark, "metadata") and stored_benchmark.metadata:
+            benchmark_lmeval_args = stored_benchmark.metadata.get("lmeval_args")


issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

src/llama_stack_provider_lmeval/inline/lmeval.py

sourcery-ai · 2025-09-18T15:17:46Z

src/llama_stack_provider_lmeval/inline/lmeval.py

+            process_id = self._job_metadata.get(job_id, {}).get("process_id")
+            if process_id:


suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Suggested change

process_id = self._job_metadata.get(job_id, {}).get("process_id")

if process_id:

if process_id := self._job_metadata.get(job_id, {}).get("process_id"):

src/llama_stack_provider_lmeval/inline/lmeval.py

sourcery-ai

New security issues found

src/llama_stack_provider_lmeval/inline/lmeval.py

saichandrapandraju

LGTM! Thank you @christinaexyou , just a couple of nits from me.

saichandrapandraju · 2025-09-29T00:40:18Z

src/llama_stack_provider_lmeval/inline/lmeval.py

+
+            self._job_metadata[job_id]["process_id"] = str(process.pid)
+
+            stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=300)


shall we optionally parameterize timeout?

saichandrapandraju · 2025-09-29T00:45:34Z

src/llama_stack_provider_lmeval/inline/lmeval.py

+                            results_data, job_id
+                        )
+                        # Store the parsed results in job metadata
+                        self._job_metadata[job_id]["results"] = parsed_results


nit: if result size is not small, maybe we can save it in the files store and just track the results_file id.

saichandrapandraju · 2025-09-29T01:07:35Z

src/llama_stack_provider_lmeval/inline/lmeval.py

+
+        # Create model args - use VLLM_URL environment variable for inference provider
+        inference_url = os.environ.get("VLLM_URL", "http://localhost:8080/v1")
+        openai_url = inference_url.replace("/v1", "/v1/completions")


I think we need some checks here to make sure we're not polluting the url.

IIRC, if we need openai url, it should ends_with openai/v1/completions.

saichandrapandraju · 2025-09-29T01:17:07Z

run-inline.yaml

+        max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
+        api_token: ${env.VLLM_API_TOKEN:=fake}
+        tls_verify: ${env.VLLM_TLS_VERIFY:=false}
+  eval:


nit: shall we start using module: way of specifying the provider as lls-core is moving away from external_providers_dir?

christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 28f34e8 to 87e3113 Compare September 17, 2025 18:54

sourcery-ai bot requested changes Sep 17, 2025

View reviewed changes

src/llama_stack_provider_lmeval/inline/lmeval.py Outdated Show resolved Hide resolved

christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch 2 times, most recently from 5e41354 to 80ebed6 Compare September 18, 2025 15:10

github-advanced-security bot found potential problems Sep 18, 2025

View reviewed changes

christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 80ebed6 to 09f0264 Compare September 18, 2025 15:14

christinaexyou marked this pull request as ready for review September 18, 2025 15:16

sourcery-ai bot requested changes Sep 18, 2025

View reviewed changes

christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch 2 times, most recently from eb8a389 to 867da4d Compare September 19, 2025 15:13

sourcery-ai bot requested changes Sep 19, 2025

View reviewed changes

src/llama_stack_provider_lmeval/inline/lmeval.py Outdated Show resolved Hide resolved

christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 867da4d to 0a2e7f0 Compare September 19, 2025 15:44

Add LM-Inline provider and unit tests

4596c27

christinaexyou force-pushed the RHOAIENG-33971-lmeval-inline branch from 60644a9 to 4596c27 Compare September 26, 2025 13:15

christinaexyou requested review from adolfo-ab and ruivieira September 26, 2025 13:21

ruivieira changed the title ~~feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests~~ [Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests Sep 26, 2025

saichandrapandraju self-requested a review September 29, 2025 00:31

saichandrapandraju approved these changes Sep 29, 2025

View reviewed changes

		process_id = self._job_metadata.get(job_id, {}).get("process_id")
		if process_id:

	process_id = self._job_metadata.get(job_id, {}).get("process_id")
	if process_id:
	if process_id := self._job_metadata.get(job_id, {}).get("process_id"):


		self._job_metadata[job_id]["process_id"] = str(process.pid)

		stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=300)

[Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests #60

Are you sure you want to change the base?

[Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests #60

Uh oh!

Conversation

christinaexyou commented Sep 16, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

ER diagram for provider configuration YAMLs (inline and remote)

Class diagram for new LMEvalInline provider (inline mode)

Class diagram for provider spec registration (inline and remote)

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saichandrapandraju left a comment

Choose a reason for hiding this comment

Uh oh!

saichandrapandraju Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

saichandrapandraju Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

saichandrapandraju Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

saichandrapandraju Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christinaexyou commented Sep 16, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Sep 16, 2025 •

edited

Loading

saichandrapandraju Sep 29, 2025 •

edited

Loading