-
Notifications
You must be signed in to change notification settings - Fork 49
feat: Add Python code verifiers for Mellea-generated code #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add Python code verifiers for Mellea-generated code #171
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Uses LLM to generate test cases from docstring specifications, then validates generated code against those tests. Supports temperature-based sampling strategies for test generation.
|
I just added a Design RationaleEvaluated three approaches for LLM-based code verification. Direct LLM judgment (querying the model with "does this code satisfy the specification?") is trivially inconsistent and incurs per-validation LLM costs during rejection sampling so it would be prohibitively expensive. I then perused the web and found an interesting paper on Stanford's Clover's multi-consistency verification approach (http://ai.stanford.edu/blog/clover/), which generates N independent implementations and validates through consensus on test inputs. The main problem I see with this approach is that it multiplies generation costs by N per validation which I think isn't too great for our purposes. I read into Python's doctest framework (https://docs.python.org/3/library/doctest.html) as a model for specification-driven testing. The core idea of embedding executable test cases in specifications is sound, but doctest's rigid format doesn't mesh well with flexible LLM outputs, so I had to scrap this idea. So herein is a sort of hybrid mish-mash, if you will: LLM-based test generation with deterministic validation. This architecture amortizes the expensive LLM call (one-time test case generation) across multiple cheap validation passes (deterministic code execution during rejection sampling). ImplementationImplemented two-phase verification. Phase one ( Temperature parameter controls test generation stochasticity. I went with 0.3 as a good balance, but the following were my general findings. Low temperature (0.2) produces deterministic, consistent test suites but unadventurous ones. Medium (0.5) balances coverage and diversity. High (0.8) increases exploration of edge cases and parameter space but can be too much. Validated with Granite 3.3 8B; YMMV with different models. Usageverifier = PythonMatchesDocstring("Calculate factorial of n", num_tests=5)
verifier.generate_tests(temperature=0.3) # Note the one-time LLM call
result = session.instruct(
"Write a factorial function",
requirements=[verifier]
)ProblemsThe LLM still occasionally doesn't use the correct debug params. This might be a greater issue but it happened only once for me in the final iteration. Worth testing though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the unsafe code execution.
mellea/stdlib/reqlib/python.py
Outdated
| # Create temporary file and execute | ||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f: | ||
| f.write(code) | ||
| temp_file = f.name | ||
|
|
||
| try: | ||
| result = subprocess.run( | ||
| [sys.executable, temp_file], capture_output=True, text=True, timeout=timeout | ||
| ) | ||
|
|
||
| if result.returncode == 0: | ||
| return ValidationResult(result=True, reason="Code executed successfully") | ||
| else: | ||
| return ValidationResult( | ||
| result=False, | ||
| reason=f"Execution failed with error: {result.stderr[:200]}", | ||
| ) | ||
| except subprocess.TimeoutExpired: | ||
| return ValidationResult( | ||
| result=False, reason=f"Execution timed out after {timeout} seconds" | ||
| ) | ||
| except Exception as e: | ||
| return ValidationResult(result=False, reason=f"Execution error: {e!s}") | ||
| finally: | ||
| # Clean up temp file | ||
| try: | ||
| Path(temp_file).unlink() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ibm/guardx or something similar here. There are now good wasm solutions which might prove to be a better solution than containers long term.
|
@0xCUB3 Thanks for the contribution! LGTM other than sandboxing code execution. Can you please make that update, then comment here indicating it's ready for review? |
|
That's a good point. It is unsafe, so I've done a little bit of research on some options.
What direction do you think we should follow? |
Introduces ExecutionBackend abstractions for validating or executing Python code, with SafeBackend (syntax check only) as the default and UnsafeBackend (subprocess execution) as an option. Adds import restrictions for unsafe execution, updates PythonExecutesWithoutError to support these modes.
…3/mellea into feat/python-code-verifiers
|
I added the unsafe execution backend. A full implementation would include something like GuardX for a sandboxed option. Another alternative could be llm-sandbox, though using in-house solutions is probably the best idea. |
Added Verifiers
Since the "good" code isn't always in the beginning or the end, I handle common LLM patterns (function + tests, bad -> good examples, code evolution) using a basic scoring algorithm based on length, structure, test indicators, and context cues. Worked well for about 50 code samples of varying complexity.
Validation
Tested with Ollama + Granite 3.3 8B. Tests covered extraction edge cases, syntax validation, execution, and integration scenarios.
Note: Some test cases were generated with AI assistance and reviewed/validated by me.
Review Before Merging