feat: Add Python code verifiers for Mellea-generated code #171

0xCUB3 · 2025-10-01T10:53:02Z

Added Verifiers

HasPythonCodeListing: Smart extraction from markdown/plain text with multi-block scoring
PythonCodeParses: AST-based syntax validation
PythonValidImports: Import availability checking (supports venv paths as well)
PythonExecutesWithoutError: Isolated execution, also with timeout protection (could be more robust potentially)
PythonHasFunctionDef: Function definition validation
PythonHasClassDef: Class definition validation
PythonMatchesExamples: Functional correctness testing with examples provided by user

Since the "good" code isn't always in the beginning or the end, I handle common LLM patterns (function + tests, bad -> good examples, code evolution) using a basic scoring algorithm based on length, structure, test indicators, and context cues. Worked well for about 50 code samples of varying complexity.

Validation

Tested with Ollama + Granite 3.3 8B. Tests covered extraction edge cases, syntax validation, execution, and integration scenarios.

Note: Some test cases were generated with AI assistance and reviewed/validated by me.

Review Before Merging

Smart extraction logic (python.py:19-112) on more complicated code samples

mergify · 2025-10-01T10:53:36Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

Uses LLM to generate test cases from docstring specifications, then validates generated code against those tests. Supports temperature-based sampling strategies for test generation.

0xCUB3 · 2025-10-05T21:35:54Z

I just added a PythonMatchesDocstring verifier for validating generated code against natural language specifications using LLM-generated test cases.

Design Rationale

Evaluated three approaches for LLM-based code verification. Direct LLM judgment (querying the model with "does this code satisfy the specification?") is trivially inconsistent and incurs per-validation LLM costs during rejection sampling so it would be prohibitively expensive.

I then perused the web and found an interesting paper on Stanford's Clover's multi-consistency verification approach (http://ai.stanford.edu/blog/clover/), which generates N independent implementations and validates through consensus on test inputs. The main problem I see with this approach is that it multiplies generation costs by N per validation which I think isn't too great for our purposes.

I read into Python's doctest framework (https://docs.python.org/3/library/doctest.html) as a model for specification-driven testing. The core idea of embedding executable test cases in specifications is sound, but doctest's rigid format doesn't mesh well with flexible LLM outputs, so I had to scrap this idea.

So herein is a sort of hybrid mish-mash, if you will: LLM-based test generation with deterministic validation. This architecture amortizes the expensive LLM call (one-time test case generation) across multiple cheap validation passes (deterministic code execution during rejection sampling).

Implementation

Implemented two-phase verification. Phase one (generate_tests(temperature)) uses the LLM to parse the docstring specification and generate test cases as structured JSON, mapping input dictionaries to expected outputs. Phase two (_validate()) executes generated code against these test cases deterministically.

Temperature parameter controls test generation stochasticity. I went with 0.3 as a good balance, but the following were my general findings. Low temperature (0.2) produces deterministic, consistent test suites but unadventurous ones. Medium (0.5) balances coverage and diversity. High (0.8) increases exploration of edge cases and parameter space but can be too much. Validated with Granite 3.3 8B; YMMV with different models.

Usage

verifier = PythonMatchesDocstring("Calculate factorial of n", num_tests=5)
verifier.generate_tests(temperature=0.3)  # Note the one-time LLM call

result = session.instruct(
    "Write a factorial function", 
    requirements=[verifier]
)

Problems

The LLM still occasionally doesn't use the correct debug params. This might be a greater issue but it happened only once for me in the final iteration. Worth testing though.

nrfulton

LGTM other than the unsafe code execution.

nrfulton · 2025-10-09T22:20:11Z

mellea/stdlib/reqlib/python.py

+    # Create temporary file and execute
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
+        f.write(code)
+        temp_file = f.name
+
+    try:
+        result = subprocess.run(
+            [sys.executable, temp_file], capture_output=True, text=True, timeout=timeout
+        )
+
+        if result.returncode == 0:
+            return ValidationResult(result=True, reason="Code executed successfully")
+        else:
+            return ValidationResult(
+                result=False,
+                reason=f"Execution failed with error: {result.stderr[:200]}",
+            )
+    except subprocess.TimeoutExpired:
+        return ValidationResult(
+            result=False, reason=f"Execution timed out after {timeout} seconds"
+        )
+    except Exception as e:
+        return ValidationResult(result=False, reason=f"Execution error: {e!s}")
+    finally:
+        # Clean up temp file
+        try:
+            Path(temp_file).unlink()
+        except Exception:
+            pass


We should use ibm/guardx or something similar here. There are now good wasm solutions which might prove to be a better solution than containers long term.

nrfulton · 2025-10-10T16:58:12Z

@0xCUB3 Thanks for the contribution!

LGTM other than sandboxing code execution. Can you please make that update, then comment here indicating it's ready for review?

0xCUB3 · 2025-10-10T18:14:07Z

That's a good point. It is unsafe, so I've done a little bit of research on some options.

IBM GuardX (your suggestion). Full Python stdlib support, Docker-based isolation with syscall filtering, and an in-house solution. The downside is it requires Docker and users need to run guardx init which can take a long time (not to mention resources) to build containers. Definitely secure, but heavy for a verification library. Or is there a simpler approach I'm missing?
WebAssembly sandboxing. I looked at wasmtime-py with VMware's Python WASM and wasm-py-sandbox from PyPI. The wasmtime approach has no Docker requirement and good resource limiting via "fuel" mechanism, but you need to bundle a 50MB+ Python WASM binary and stdlib support is limited (no numpy, pandas, etc). The PyPI package is alpha quality from 2023 and uses RustPython instead of CPython (which probably isn't a huge deal though much less mature). All WASM solutions seem to have the same general issue that you need a Python interpreter compiled to WASM which can be large and have compatibility issues if your code relies on a bunch of proprietary dependencies.
Ignore this issue and just make execution opt-in with explicit flag. Change PythonExecutesWithoutError() to skip by default. Require allow_unsafe=True to use subprocess execution, or use_sandbox=True for GuardX/WASM or whatever we choose. This forces users to think about security and prevents accidental unsafe execution. The downside is it's a breaking change and doesn't solve sandboxing, just makes the risk explicit.

What direction do you think we should follow?

Introduces ExecutionBackend abstractions for validating or executing Python code, with SafeBackend (syntax check only) as the default and UnsafeBackend (subprocess execution) as an option. Adds import restrictions for unsafe execution, updates PythonExecutesWithoutError to support these modes.

…3/mellea into feat/python-code-verifiers

0xCUB3 · 2025-10-23T13:46:51Z

I added the unsafe execution backend. A full implementation would include something like GuardX for a sandboxed option. Another alternative could be llm-sandbox, though using in-house solutions is probably the best idea.

feat: add Python code verifiers for Mellea-generated code validation

220bfcd

0xCUB3 changed the title ~~Add Python code verifiers for Mellea-generated code~~ feat: Add Python code verifiers for Mellea-generated code Oct 1, 2025

0xCUB3 added 3 commits October 1, 2025 13:15

Merge branch 'main' into feat/python-code-verifiers

17861ec

Merge branch 'main' into feat/python-code-verifiers

e2693c3

Merge branch 'main' into feat/python-code-verifiers

e59ab7e

nrfulton requested a review from avinash2692 October 2, 2025 15:14

0xCUB3 and others added 3 commits October 2, 2025 13:09

Refactor Python requirement validators

b369e4a

Merge branch 'main' into feat/python-code-verifiers

eac9533

Add PythonMatchesDocstring verifier

03c6ed6

Uses LLM to generate test cases from docstring specifications, then validates generated code against those tests. Supports temperature-based sampling strategies for test generation.

nrfulton self-requested a review October 9, 2025 22:13

nrfulton reviewed Oct 9, 2025

View reviewed changes

nrfulton removed the request for review from avinash2692 October 9, 2025 22:22

Merge branch 'main' into feat/python-code-verifiers

5e60425

0xCUB3 and others added 5 commits October 15, 2025 15:36

Merge branch 'main' into feat/python-code-verifiers

d6f768a

Merge branch 'main' into feat/python-code-verifiers

c473efd

Merge branch 'main' into feat/python-code-verifiers

9b29944

Merge branch 'feat/python-code-verifiers' of https://github.com/0xCUB…

aa8a20b

…3/mellea into feat/python-code-verifiers

Merge branch 'main' into feat/python-code-verifiers

313e3e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Python code verifiers for Mellea-generated code #171

feat: Add Python code verifiers for Mellea-generated code #171

0xCUB3 commented Oct 1, 2025

Uh oh!

mergify bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

0xCUB3 commented Oct 5, 2025 •

edited

Loading

Uh oh!

nrfulton left a comment •

edited

Loading

Uh oh!

nrfulton Oct 9, 2025

Uh oh!

nrfulton commented Oct 10, 2025

Uh oh!

0xCUB3 commented Oct 10, 2025 •

edited

Loading

Uh oh!

0xCUB3 commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add Python code verifiers for Mellea-generated code #171

Are you sure you want to change the base?

feat: Add Python code verifiers for Mellea-generated code #171

Conversation

0xCUB3 commented Oct 1, 2025

Added Verifiers

Validation

Review Before Merging

Uh oh!

mergify bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

0xCUB3 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Rationale

Implementation

Usage

Problems

Uh oh!

nrfulton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrfulton Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

nrfulton commented Oct 10, 2025

Uh oh!

0xCUB3 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xCUB3 commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergify bot commented Oct 1, 2025 •

edited

Loading

0xCUB3 commented Oct 5, 2025 •

edited

Loading

nrfulton left a comment •

edited

Loading

0xCUB3 commented Oct 10, 2025 •

edited

Loading