Skip to content

Conversation

@devpatelio
Copy link
Collaborator

Working draft and design for RLM (see blog) for SkyRL Gym.

  • REPL execution: Executes Python code from <repl> blocks in model responses
  • Recursive LLM calls: Supports llm_query() for nested LLM calls from code
  • Context management: Loads and exposes context data (strings, dicts, lists) to the REPL
  • Multi-turn loop: Iterates up to max_turns until a final answer is found
  • Final answer extraction: Detects FINAL(...) or FINAL_VAR(name) patterns

Reference Code: rlm/rlm repo

@devpatelio
Copy link
Collaborator Author

/gemini review

gemini-code-assist[bot]

This comment was marked as outdated.

gemini-code-assist[bot]

This comment was marked as outdated.

devpatelio and others added 4 commits January 6, 2026 22:24
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@devpatelio
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a promising RLM (Recursive Language Model) environment for SkyRL Gym, featuring a sandboxed REPL for executing Python code from model responses and supporting recursive LLM calls. The implementation is well-structured across new files for the environment logic, utilities, and testing. My review focuses on enhancing security, correctness, and maintainability. I've identified a critical sandbox escape vulnerability in the REPL execution that needs immediate attention. Additionally, I've provided suggestions to improve correctness by handling potential null values and unsafe configuration access, and to increase maintainability by adhering to Python best practices for imports and resource management. Overall, this is a strong initial draft, and addressing these points will significantly improve its robustness and security.

Comment on lines +303 to +308
combined = {**self.globals_dict, **self.locals_dict}
exec(code, combined, combined)

for key, value in combined.items():
if key not in self.globals_dict and not key.startswith("_"):
self.locals_dict[key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical security vulnerability here. The exec function is given a combined dictionary for both globals and locals which contains a reference to self.globals_dict['__builtins__']. Malicious code can modify this __builtins__ dictionary in-place (e.g., __builtins__['open'] = some_function), bypassing the sandbox for subsequent code executions within the same step. To fix this, you should provide exec with a copy of the globals, especially the mutable __builtins__ dictionary, to prevent such modifications.

            exec_globals = self.globals_dict.copy()
            exec_globals["__builtins__"] = self.globals_dict["__builtins__"].copy()
            combined = {**exec_globals, **self.locals_dict}
            exec(code, combined, combined)

            for key, value in combined.items():
                if key not in exec_globals and not key.startswith("_"):
                    self.locals_dict[key] = value

Comment on lines +73 to +81
openai_api_key = env_cfg.openai_api_key or os.getenv("OPENAI_API_KEY")
if openai_api_key is None:
raise ValueError("`OPENAI_API_KEY` must be set (as parameter, in env_cfg, or as environment variable)")

base_url = env_cfg.base_url
model = env_cfg.model
init_prompt = env_cfg.init_prompt
if not base_url or not model or not init_prompt:
raise ValueError("env_cfg must include base_url, model, and init_prompt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing keys from an omegaconf.DictConfig using attribute access (e.g., env_cfg.openai_api_key) will raise an exception if the key is not present, which can lead to unexpected crashes. It is safer to use the .get() method, which allows you to handle cases where a key might be missing.

Suggested change
openai_api_key = env_cfg.openai_api_key or os.getenv("OPENAI_API_KEY")
if openai_api_key is None:
raise ValueError("`OPENAI_API_KEY` must be set (as parameter, in env_cfg, or as environment variable)")
base_url = env_cfg.base_url
model = env_cfg.model
init_prompt = env_cfg.init_prompt
if not base_url or not model or not init_prompt:
raise ValueError("env_cfg must include base_url, model, and init_prompt")
openai_api_key = env_cfg.get("openai_api_key") or os.getenv("OPENAI_API_KEY")
if openai_api_key is None:
raise ValueError("`OPENAI_API_KEY` must be set (as parameter, in env_cfg, or as environment variable)")
base_url = env_cfg.get("base_url")
model = env_cfg.get("model")
init_prompt = env_cfg.get("init_prompt")
if not base_url or not model or not init_prompt:
raise ValueError("env_cfg must include base_url, model, and init_prompt")

Comment on lines +106 to +113
return (
self.lm_client.chat.completions.create(
model=used_model,
messages=msgs,
)
.choices[0]
.message.content
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The llm_query function is type-hinted to return a str, but it can return None if the OpenAI API response has message.content as None. This violates the type hint and can cause errors in the code executed in the REPL. You should ensure it always returns a string, for example by using or "".

        response = self.lm_client.chat.completions.create(
            model=used_model,
            messages=msgs,
        )
        return response.choices[0].message.content or ""

"compile": None,
"globals": None,
"locals": None,
"open": None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The open function is disabled in SAFE_BUILTINS for security, which is a good practice for a REPL. However, the RLMEnvironment.load_context method in env.py creates a file and exposes its path via the context_path variable in the REPL's scope. This creates a design contradiction: a file path is provided, but there's no way to use it from within the sandboxed code. If file access is not intended, consider removing the context_path variable to avoid confusion. If it is intended, a custom, sandboxed file-access utility should be provided instead of the disabled open.

Comment on lines +58 to +60
elif isinstance(ctx, dict):
total = sum(len(str(k)) + len(str(v)) for k, v in itertools.islice(ctx.items(), 1000))
preview = [len(ctx)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The preview generated for a dictionary in _context_metadata_prompt is just its length, which is inconsistent with the more detailed previews for strings and lists (which show lengths of first few items). For better introspection and consistency, consider providing a preview of the lengths of the first few items in the dictionary, similar to how it's done for lists.

Suggested change
elif isinstance(ctx, dict):
total = sum(len(str(k)) + len(str(v)) for k, v in itertools.islice(ctx.items(), 1000))
preview = [len(ctx)]
elif isinstance(ctx, dict):
total = sum(len(str(k)) + len(str(v)) for k, v in itertools.islice(ctx.items(), 1000))
preview = [len(str(k)) + len(str(v)) for k, v in itertools.islice(ctx.items(), 10)]

Comment on lines +172 to +173
stdout = execution_result.get("stdout", "") or ""
stderr = execution_result.get("stderr", "") or ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The or "" is redundant in these lines. execution_result.get("stdout", "") already returns an empty string if the key is missing or its value is None. The _execute_code method also ensures that stdout and stderr are always strings.

                stdout = execution_result.get("stdout", "")
                stderr = execution_result.get("stderr", "")

Comment on lines +291 to +293
import io
import sys
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Imports should generally be at the top of the file, as recommended by PEP 8. Moving io, sys, and time imports to the top level will improve code readability and consistency.

Comment on lines +342 to +343
def __del__(self):
self.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using __del__ for resource cleanup (like removing the temporary directory) is unreliable in Python. It's not guaranteed to be called, and the environment at the time of its execution can be unpredictable. Since you have correctly implemented a context manager (__enter__/__exit__), you should rely on that for cleanup. Relying on __del__ as a fallback is risky and should be avoided.

@ray.remote(num_cpus=1)
def run_simple_test(prompt: str):

import dotenv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import dotenv statement is inside a function. According to PEP 8, imports should be at the top of the file. This improves readability and makes dependencies clear. Please move this import to the top level of the module.

}


def safe_import(name: str, globals=None, locals=None, fromlist=(), level: int = 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parameter names globals and locals shadow Python's built-in functions. It's a best practice to avoid this to prevent confusion and potential bugs. Consider renaming them by appending an underscore (e.g., globals_, locals_) as recommended by PEP 8.

Suggested change
def safe_import(name: str, globals=None, locals=None, fromlist=(), level: int = 0):
def safe_import(name: str, globals_=None, locals_=None, fromlist=(), level: int = 0):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant