Skip to content

Commit 9d12458

Browse files
authored
feat: Add secure Python code execution with llm-sandbox support (#217)
* feat: add secure Python code execution with llm-sandbox support - Add PythonExecutesWithoutError requirement with three execution backends: - SafeBackend: Validates syntax and imports without execution (default) - UnsafeBackend: Direct subprocess execution with warnings - LLMSandboxBackend: Docker-based execution using llm-sandbox - Implement allow_unsafe_execution flag with explicit opt-in and warnings - Add import restriction support for defense-in-depth security - Support use_sandbox flag for secure Docker-based execution - Include comprehensive test suite with 21 test cases - Maintain backward compatibility while defaulting to safe mode - Add llm-sandbox[docker] dependency for optional sandbox functionality * Refactor Python execution backends and formatting Improves code formatting and readability in python.py by splitting long lines, adding whitespace, and updating argument formatting. Also updates test import order in test_reqlib_python.py for consistency. * refactor: rename PythonExecutesWithoutError to PythonExecutionReq As suggested in PR review, the class name now includes 'Req' to better align with naming conventions for Requirement classes. * refactor: move duplicate __init__ to ExecutionBackend base class All subclasses had identical __init__ methods, so this reduces code duplication by implementing it once in the base class. * refactor: rename Backend to Environment for execution classes - Renamed ExecutionBackend to ExecutionEnvironment and all subclasses - Updated documentation to clarify that allowed_imports=None means any import is allowed - This avoids confusion with the main Backend classes used for LLM generation * docs: add explanatory comment for code scoring logic Clarified why blocks with less than 2 non-trivial lines are penalized and added TODO for future improvements using comment-to-code ratio * docs: improve error handling and documentation - Enhanced docstring to explain code extraction and 'best block' selection - Preserve underlying extraction failure reasons for better debugging - Error messages now include specific details about why extraction failed * refactor: remove unused context_text parameter Removed the unused context_text parameter from _score_code_block function and updated all call sites to match the simplified signature. * feat: include specific unauthorized imports in error messages - Added _get_unauthorized_imports function to return specific unauthorized import names - Enhanced all import restriction error messages to show which imports are unauthorized - Improved debugging experience by providing actionable error details - Maintained backward compatibility with existing tests * docs: clarify sys.executable usage in subprocess execution Added comment explaining that sys.executable uses the same Python interpreter and environment as the current process, ensuring access to all installed packages and dependencies. * feat: include stdout in successful execution results - Added stdout output to success messages for both subprocess and sandbox execution - Provides valuable debugging information and execution feedback - Only includes output when present, keeps messages clean when no output - Helps users understand what their code actually did during execution * feat: improve sandbox error handling and logging - Added detailed logging for unknown sandbox errors - Include exit code and available attributes for better debugging - Provide more informative error messages when stderr is not available - Helps diagnose sandbox execution issues more effectively * docs: add detailed scoring metrics to _score_code_block docstring Documented all scoring criteria including length bonus, function/class detection, control flow analysis, and non-trivial content filtering to help developers understand code block prioritization logic. * test: enable sandbox tests with Docker availability check - Replaced hard-coded skip with runtime Docker availability detection - Added llm_sandbox import check and Docker connectivity test - Sandbox tests now run when Docker is available, skip gracefully when not - All 21 tests now pass when Docker is running * fix: address pre-commit issues - Fixed ruff formatting across all files - Added explicit type annotation for unauthorized list in _get_unauthorized_imports - Resolved MyPy type annotation error in python.py * Add mypy ignores. * Linting
1 parent a2e29e6 commit 9d12458

35 files changed

+4181
-2508
lines changed

mellea/helpers/event_loop_helper.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ def __init__(self):
2020
"""
2121
self._event_loop = asyncio.new_event_loop()
2222
self._thread: threading.Thread = threading.Thread(
23-
target=self._event_loop.run_forever, daemon=True
23+
target=self._event_loop.run_forever,
24+
daemon=True, # type: ignore
2425
)
2526
self._thread.start()
2627

mellea/stdlib/reqlib/md.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ def as_markdown_list(ctx: Context) -> list[str] | None:
1414
raw_output = ctx.last_output()
1515
assert raw_output is not None
1616
try:
17-
parsed = mistletoe.Document(raw_output.value)
18-
for child in parsed.children:
17+
parsed = mistletoe.Document(raw_output.value) # type: ignore
18+
for child in parsed.children: # type: ignore
1919
if type(child) is not mistletoe.block_token.List:
2020
return None
21-
for item in child.children:
21+
for item in child.children: # type: ignore
2222
xs.append(mistletoe.base_renderer.BaseRenderer().render(item))
2323
return xs
2424
except Exception:
@@ -44,10 +44,10 @@ def _md_table(ctx: Context):
4444
raw_output = ctx.last_output()
4545
assert raw_output is not None
4646
try:
47-
parsed = mistletoe.Document(raw_output.value)
48-
if len(parsed.children) != 1:
47+
parsed = mistletoe.Document(raw_output.value) # type: ignore
48+
if len(parsed.children) != 1: # type: ignore
4949
return False
50-
return type(parsed.children[0]) is mistletoe.block_token.Table
50+
return type(parsed.children[0]) is mistletoe.block_token.Table # type: ignore
5151
except Exception:
5252
return False
5353

0 commit comments

Comments
 (0)