Skip to content

Conversation

@0xCUB3
Copy link

@0xCUB3 0xCUB3 commented Oct 1, 2025

Added Verifiers

  • HasPythonCodeListing: Smart extraction from markdown/plain text with multi-block scoring
  • PythonCodeParses: AST-based syntax validation
  • PythonValidImports: Import availability checking (supports venv paths as well)
  • PythonExecutesWithoutError: Isolated execution, also with timeout protection (could be more robust potentially)
  • PythonHasFunctionDef: Function definition validation
  • PythonHasClassDef: Class definition validation
  • PythonMatchesExamples: Functional correctness testing with examples provided by user

Since the "good" code isn't always in the beginning or the end, I handle common LLM patterns (function + tests, bad -> good examples, code evolution) using a basic scoring algorithm based on length, structure, test indicators, and context cues. Worked well for about 50 code samples of varying complexity.

Validation

Tested with Ollama + Granite 3.3 8B. Tests covered extraction edge cases, syntax validation, execution, and integration scenarios.

Note: Some test cases were generated with AI assistance and reviewed/validated by me.

Review Before Merging

  • Smart extraction logic (python.py:19-112) on more complicated code samples

@mergify
Copy link

mergify bot commented Oct 1, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@0xCUB3 0xCUB3 changed the title Add Python code verifiers for Mellea-generated code feat: Add Python code verifiers for Mellea-generated code Oct 1, 2025
@nrfulton nrfulton requested a review from avinash2692 October 2, 2025 15:14
0xCUB3 and others added 3 commits October 2, 2025 13:09
Uses LLM to generate test cases from docstring specifications, then
validates generated code against those tests. Supports temperature-based
sampling strategies for test generation.
@0xCUB3
Copy link
Author

0xCUB3 commented Oct 5, 2025

I just added a PythonMatchesDocstring verifier for validating generated code against natural language specifications using LLM-generated test cases.

Design Rationale

Evaluated three approaches for LLM-based code verification. Direct LLM judgment (querying the model with "does this code satisfy the specification?") is trivially inconsistent and incurs per-validation LLM costs during rejection sampling so it would be prohibitively expensive.

I then perused the web and found an interesting paper on Stanford's Clover's multi-consistency verification approach (http://ai.stanford.edu/blog/clover/), which generates N independent implementations and validates through consensus on test inputs. The main problem I see with this approach is that it multiplies generation costs by N per validation which I think isn't too great for our purposes.

I read into Python's doctest framework (https://docs.python.org/3/library/doctest.html) as a model for specification-driven testing. The core idea of embedding executable test cases in specifications is sound, but doctest's rigid format doesn't mesh well with flexible LLM outputs, so I had to scrap this idea.

So herein is a sort of hybrid mish-mash, if you will: LLM-based test generation with deterministic validation. This architecture amortizes the expensive LLM call (one-time test case generation) across multiple cheap validation passes (deterministic code execution during rejection sampling).

Implementation

Implemented two-phase verification. Phase one (generate_tests(temperature)) uses the LLM to parse the docstring specification and generate test cases as structured JSON, mapping input dictionaries to expected outputs. Phase two (_validate()) executes generated code against these test cases deterministically.

Temperature parameter controls test generation stochasticity. I went with 0.3 as a good balance, but the following were my general findings. Low temperature (0.2) produces deterministic, consistent test suites but unadventurous ones. Medium (0.5) balances coverage and diversity. High (0.8) increases exploration of edge cases and parameter space but can be too much. Validated with Granite 3.3 8B; YMMV with different models.

Usage

verifier = PythonMatchesDocstring("Calculate factorial of n", num_tests=5)
verifier.generate_tests(temperature=0.3)  # Note the one-time LLM call

result = session.instruct(
    "Write a factorial function", 
    requirements=[verifier]
)

Problems

The LLM still occasionally doesn't use the correct debug params. This might be a greater issue but it happened only once for me in the final iteration. Worth testing though.

@nrfulton nrfulton self-requested a review October 9, 2025 22:13
Copy link
Contributor

@nrfulton nrfulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the unsafe code execution.

Comment on lines 322 to 350
# Create temporary file and execute
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
f.write(code)
temp_file = f.name

try:
result = subprocess.run(
[sys.executable, temp_file], capture_output=True, text=True, timeout=timeout
)

if result.returncode == 0:
return ValidationResult(result=True, reason="Code executed successfully")
else:
return ValidationResult(
result=False,
reason=f"Execution failed with error: {result.stderr[:200]}",
)
except subprocess.TimeoutExpired:
return ValidationResult(
result=False, reason=f"Execution timed out after {timeout} seconds"
)
except Exception as e:
return ValidationResult(result=False, reason=f"Execution error: {e!s}")
finally:
# Clean up temp file
try:
Path(temp_file).unlink()
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ibm/guardx or something similar here. There are now good wasm solutions which might prove to be a better solution than containers long term.

@nrfulton nrfulton removed the request for review from avinash2692 October 9, 2025 22:22
@nrfulton
Copy link
Contributor

@0xCUB3 Thanks for the contribution!

LGTM other than sandboxing code execution. Can you please make that update, then comment here indicating it's ready for review?

@0xCUB3
Copy link
Author

0xCUB3 commented Oct 10, 2025

That's a good point. It is unsafe, so I've done a little bit of research on some options.

  1. IBM GuardX (your suggestion). Full Python stdlib support, Docker-based isolation with syscall filtering, and an in-house solution. The downside is it requires Docker and users need to run guardx init which can take a long time (not to mention resources) to build containers. Definitely secure, but heavy for a verification library. Or is there a simpler approach I'm missing?

  2. WebAssembly sandboxing. I looked at wasmtime-py with VMware's Python WASM and wasm-py-sandbox from PyPI. The wasmtime approach has no Docker requirement and good resource limiting via "fuel" mechanism, but you need to bundle a 50MB+ Python WASM binary and stdlib support is limited (no numpy, pandas, etc). The PyPI package is alpha quality from 2023 and uses RustPython instead of CPython (which probably isn't a huge deal though much less mature). All WASM solutions seem to have the same general issue that you need a Python interpreter compiled to WASM which can be large and have compatibility issues if your code relies on a bunch of proprietary dependencies.

  3. Ignore this issue and just make execution opt-in with explicit flag. Change PythonExecutesWithoutError() to skip by default. Require allow_unsafe=True to use subprocess execution, or use_sandbox=True for GuardX/WASM or whatever we choose. This forces users to think about security and prevents accidental unsafe execution. The downside is it's a breaking change and doesn't solve sandboxing, just makes the risk explicit.

What direction do you think we should follow?

0xCUB3 and others added 5 commits October 15, 2025 15:36
Introduces ExecutionBackend abstractions for validating or executing Python code, with SafeBackend (syntax check only) as the default and UnsafeBackend (subprocess execution) as an option. Adds import restrictions for unsafe execution, updates PythonExecutesWithoutError to support these modes.
@0xCUB3
Copy link
Author

0xCUB3 commented Oct 23, 2025

I added the unsafe execution backend. A full implementation would include something like GuardX for a sandboxed option. Another alternative could be llm-sandbox, though using in-house solutions is probably the best idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants