|
| 1 | +# Solve a Sliding Puzzle Using GRPO |
| 2 | + |
| 3 | +This guide explains how to use Nemo RL to train a model to solve the classic **nxn sliding puzzle** game through multi-turn reinforcement learning. This environment implements a classic **n×n sliding puzzle** where numbered tiles must be arranged in sequential order by sliding them into an empty space. |
| 4 | + |
| 5 | +The sliding puzzle task serves as a simple, yet effective example, to illustrate how multi-turn RL and tool-calling are implemented within Nemo RL. This example provides a minimal setup for understanding the core components of Group Relative Policy Optimization (GRPO) and sequential decision-making. |
| 6 | + |
| 7 | + |
| 8 | +## Quick Start Guide |
| 9 | + |
| 10 | +### 1. Install and Set Up NeMo RL with Megatron Backend (Optional) |
| 11 | + |
| 12 | +To get started, clone and set up the NeMo RL repository by initializing submodules, installing CUDA dependencies, and configuring the environment with uv. Refer to [Prerequisites](https://github.com/NVIDIA-NeMo/RL/tree/main?tab=readme-ov-file#prerequisites) for detailed instructions on installation. |
| 13 | + |
| 14 | +### 2. Train a Model |
| 15 | + |
| 16 | +Train a model to solve the sliding puzzle using GRPO with the default 2×2 configuration. |
| 17 | + |
| 18 | +```bash |
| 19 | +uv run python examples/run_grpo_sliding_puzzle.py |
| 20 | +``` |
| 21 | + |
| 22 | +### 3. Customize Puzzle Configuration |
| 23 | + |
| 24 | +By default, this training script uses the configuration in [grpo_sliding_puzzle.yaml](../../examples/configs/grpo_sliding_puzzle.yaml). You can customize parameters with command-line overrides to experiment with different puzzle sizes or levels of difficulty. |
| 25 | +```bash |
| 26 | +# Train on a 3×3 puzzle with 10 random moves to scramble the board |
| 27 | +uv run python examples/run_grpo_sliding_puzzle.py \ |
| 28 | + env.sliding_puzzle_game.cfg.game_config.size=3 \ |
| 29 | + env.sliding_puzzle_game.cfg.game_config.shuffle_moves=10 |
| 30 | +``` |
| 31 | + |
| 32 | +### 4. Monitor Progress |
| 33 | + |
| 34 | +You can enable logging via Weights & Biases and TensorBoard to monitor training metrics such as rewards, success rate, and loss curves. |
| 35 | + |
| 36 | +```bash |
| 37 | +# Enable logging (optional) |
| 38 | +uv run examples/run_grpo_sliding_puzzle.py \ |
| 39 | + --config examples/configs/grpo_sliding_puzzle.yaml \ |
| 40 | + logger.wandb_enabled=true \ |
| 41 | + logger.tensorboard_enabled=true |
| 42 | +``` |
| 43 | + |
| 44 | +## Game Mechanics |
| 45 | + |
| 46 | +### Puzzle Structure |
| 47 | + |
| 48 | +The sliding puzzle consists of: |
| 49 | +- **Grid**: An `n×n` grid with numbered tiles and one empty space |
| 50 | +- **Tiles**: Numbered from `1` to `n²-1`, placed in random order |
| 51 | +- **Empty Space**: Represented by `0`, typically starting at the bottom-right corner |
| 52 | +- **Goal State**: Sequential arrangement `1, 2, 3, ..., n²-1` with `0` at bottom-right |
| 53 | + |
| 54 | +### Example Data Sample |
| 55 | +``` |
| 56 | +===== SLIDING PUZZLE ===== |
| 57 | +Arrange the 3x3 grid by sliding tiles into the empty space. |
| 58 | +- The goal is to arrange numbers from 1 to 8 in order |
| 59 | +- Use 'up', 'down', 'left', 'right' to slide in that direction |
| 60 | +- Use 'view' to see the current state of the board |
| 61 | +
|
| 62 | +Current Board State: |
| 63 | +
|
| 64 | + +---------+ |
| 65 | +1 | 1 3 | |
| 66 | +2 | 4 2 5 | |
| 67 | +3 | 7 8 6 | |
| 68 | + +---------+ |
| 69 | + 1 2 3 |
| 70 | +
|
| 71 | +Reach the goal state where numbers are ordered 1 through 8 with the empty space (0) at the bottom right. |
| 72 | +Valid actions: 'up', 'down', 'left', 'right', or 'slide row col' (e.g., 'slide 1 2'). |
| 73 | +After thinking, output your chosen action on a new line starting with '<action></action>' like this: |
| 74 | +<action>your_action</action> |
| 75 | +If you just want to see the board, output <action>view</action> |
| 76 | +Think carefully step-by-step before acting. |
| 77 | +
|
| 78 | +``` |
| 79 | + |
| 80 | +### Movement Rules |
| 81 | + |
| 82 | +1. **Valid Moves**: Only tiles adjacent to the empty space `0` can be moved. |
| 83 | +2. **Movement Direction**: Tiles slide into the empty space, not the other way around. |
| 84 | +3. **Grid Boundaries**: Moves that would go beyond the grid are invalid. |
| 85 | +4. **Single Tile Movement**: Each action affects only one tile at a time. |
| 86 | + |
| 87 | +All actions must be wrapped in XML-style tags and follow one of the formats below: |
| 88 | +```xml |
| 89 | +<action>up</action> <!-- Slide a tile up into the empty space --> |
| 90 | +<action>slide 2 1</action> <!-- Slide tile at row 2, column 1 --> |
| 91 | +<action>view</action> <!-- View the current board state --> |
| 92 | +``` |
| 93 | + |
| 94 | +## Data Generation |
| 95 | + |
| 96 | +### Configuration Parameters |
| 97 | + |
| 98 | +Sliding puzzle instances are generated using the following parameters, which can be customized via the configuration file: |
| 99 | + |
| 100 | +```yaml |
| 101 | +env: |
| 102 | + sliding_puzzle_game: |
| 103 | + cfg: |
| 104 | + game_config: |
| 105 | + size: 5 # Size of the puzzle grid (e.g., 3x3, 4x4, 5x5) |
| 106 | + shuffle_moves: 4 # Number of random moves to scramble the puzzle |
| 107 | + max_moves: 40 # Maximum number of moves allowed per episode |
| 108 | +``` |
| 109 | +#### Description |
| 110 | +
|
| 111 | +- **`size`**: Determines the dimensions of the puzzle board (`n×n`). |
| 112 | +- **`shuffle_moves`**: Controls the initial difficulty by randomly moving tiles to scramble the puzzle. |
| 113 | +- **`max_moves`**: Sets an upper limit on the number of actions the agent can take in one episode. |
| 114 | + |
| 115 | +Grids are generated with sizes ranging from 2 to game_config.size. Each grid starts with a solved state and is shuffled by moving random tiles to the empty space n times, where n is a random number between 1 and `shuffle_moves`. The grid is shuffled using only valid moves. |
| 116 | +The `generate_puzzle_datum()` function in [run_grpo_sliding_puzzle.py](../../examples/run_grpo_sliding_puzzle.py) is responsible for generating the dataset. [sliding_puzzle.py](../../nemo_rl/environments/games/sliding_puzzle.py) contains the `SlidingPuzzleGameLogic` class, responsible for puzzle generation and initialization logic. The number of shuffle moves and size of the grid will control puzzle difficulty. |
| 117 | + |
| 118 | +#### Generation Algorithm |
| 119 | +The puzzle configuration is randomly generated by sampling the grid size and number of shuffling moves within the defined maximums: |
| 120 | + |
| 121 | +```python |
| 122 | +def generate_random_config(max_config: dict[str, Any]) -> dict[str, Any]: |
| 123 | + """Generate a random config for the sliding puzzle game.""" |
| 124 | + shuffle_moves = random.randint(1, max_config.get("shuffle_moves")) |
| 125 | + if shuffle_moves % 2 == 0: |
| 126 | + shuffle_moves += 1 # Ensure odd number for proper scrambling |
| 127 | + return { |
| 128 | + "size": random.randint(2, max_config.get("size", 3)), |
| 129 | + "shuffle_moves": shuffle_moves, |
| 130 | + } |
| 131 | +
|
| 132 | + game_config = generate_random_config(game_config) |
| 133 | + initial_game_state = SlidingPuzzleGameLogic.generate(game_config) |
| 134 | + initial_render = SlidingPuzzleGameLogic.render(initial_game_state) |
| 135 | + welcome_message = SlidingPuzzleGameLogic.init(initial_game_state) |
| 136 | + ``` |
| 137 | + |
| 138 | +### Dataset Size Calculation |
| 139 | + |
| 140 | +Dataset size is defined by parameters in grpo_sliding_puzzle.yaml: |
| 141 | +``` |
| 142 | +Training Size = num_prompts_per_step × num_generations_per_prompt × max_num_steps |
| 143 | +Validation Size = max_val_samples |
| 144 | +``` |
| 145 | + |
| 146 | +### Data Structure |
| 147 | + |
| 148 | +Each training sample is returned as a `DatumSpec` dictionary with the following structure: |
| 149 | + |
| 150 | +```python |
| 151 | +datum: DatumSpec = { |
| 152 | + "message_log": message_log, # Conversation history |
| 153 | + "length": len(tokenized_prompt), # Token count |
| 154 | + "extra_env_info": metadata, # Game state metadata |
| 155 | + "loss_multiplier": 1.0, # Training weight |
| 156 | + "idx": idx, # Sample index |
| 157 | + "task_name": task_name, # Task identifier |
| 158 | + "stop_strings": ["</action>"], # Termination tokens |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +## Environment Interface |
| 163 | + |
| 164 | +<!-- ### Architecture Flow |
| 165 | + |
| 166 | +``` |
| 167 | +GRPO Training Pipeline: |
| 168 | +run_grpo_sliding_puzzle.grpo_train → nemo_rl.experience.rollouts.run_multi_turn_rollouts → generate_response + calculate_reward → environments.games.sliding_puzzle.SlidingPuzzleEnv.step |
| 169 | +``` --> |
| 170 | +
|
| 171 | +### Core Classes |
| 172 | +
|
| 173 | +The [sliding_puzzle.py](../../nemo_rl/environments/games/sliding_puzzle.py) defines the environment and the logic for interacting with the environment. The core classes used are outlined below: |
| 174 | +
|
| 175 | +#### SlidingPuzzleEnv |
| 176 | +The SlidingPuzzleEnv class serves as the main environment, implementing a Ray remote actor for distributed processing and using functions from both the SlidingPuzzleGameLogic and SlidingPuzzleRunner classes to interact with the environment. |
| 177 | +
|
| 178 | +```python |
| 179 | +@ray.remote |
| 180 | +class SlidingPuzzleEnv(EnvironmentInterface): |
| 181 | + def __init__(self, cfg: Optional[SlidingPuzzleConfig] = None): |
| 182 | + """Initialize environment with configuration.""" |
| 183 | + |
| 184 | + def step( |
| 185 | + self, |
| 186 | + message_log_batch: list[LLMMessageLogType], |
| 187 | + metadata_batch: list[SlidingPuzzleMetadata], |
| 188 | + ) -> EnvironmentReturn: |
| 189 | + """Process batch of interactions.""" |
| 190 | +``` |
| 191 | + |
| 192 | +#### SlidingPuzzleGameLogic |
| 193 | +The SlidingPuzzleGameLogic class defines the core game mechanics through static methods for puzzle operations and includes functionality for reward calculation. |
| 194 | + |
| 195 | +```python |
| 196 | +class SlidingPuzzleGameLogic: |
| 197 | + @staticmethod |
| 198 | + def generate(config: dict[str, Any]) -> dict[str, Any]: |
| 199 | + """Generate new puzzle with specified configuration.""" |
| 200 | + |
| 201 | + @staticmethod |
| 202 | + def init(game_state: dict[str, Any]) -> str: |
| 203 | + """Create welcome message with game rules.""" |
| 204 | + |
| 205 | + @staticmethod |
| 206 | + def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]: |
| 207 | + """Execute action and return (response, reward, terminated, new_state).""" |
| 208 | + |
| 209 | + @staticmethod |
| 210 | + def render(game_state: dict[str, Any]) -> str: |
| 211 | + """Render current puzzle state as visual grid.""" |
| 212 | +``` |
| 213 | + |
| 214 | +#### SlidingPuzzleRunner |
| 215 | + |
| 216 | +The SlidingPuzzleRunner class handles turn processing and action management. |
| 217 | + |
| 218 | +```python |
| 219 | +class SlidingPuzzleRunner: |
| 220 | + def __init__(self): |
| 221 | + """Initialize runner with no persistent state.""" |
| 222 | + |
| 223 | + def _parse_action(self, text: str) -> Optional[str]: |
| 224 | + """Extract action from model response using XML tag parsing.""" |
| 225 | + |
| 226 | + def process_turn( |
| 227 | + self, |
| 228 | + message_log: LLMMessageLogType, |
| 229 | + metadata: SlidingPuzzleMetadata, |
| 230 | + ) -> tuple[dict[str, str], float, bool, Optional[list[str]], Optional[SlidingPuzzleMetadata]]: |
| 231 | + """Process single turn and return (response_dict, reward, terminated, stop_strings, updated_metadata).""" |
| 232 | +``` |
| 233 | + |
| 234 | +### Processing Pipeline |
| 235 | + |
| 236 | +The step function creates a processing pipeline where each class handles specific responsibilities: |
| 237 | + |
| 238 | +1. **Parse action** (`SlidingPuzzleRunner`): Extracts the action from the model response using XML tag parsing via the `process_turn` method. |
| 239 | +2. **Validate Move** (`SlidingPuzzleGameLogic`): Checks if the action is valid for the current game state and then executes the move. |
| 240 | +3. **Execute Action** (`SlidingPuzzleGameLogic`): Applies the move to the game state using the `SlidingPuzzleGameLogic.step` method. |
| 241 | +4. **Calculate Reward** (`SlidingPuzzleGameLogic`): Assigns a reward based on progress toward solving the puzzle (step function). |
| 242 | +5. **Return Results** (`SlidingPuzzleEnv`): Returns the updated interaction state as an `EnvironmentReturn` object. |
| 243 | + |
| 244 | +## Reward System |
| 245 | + |
| 246 | +### Reward Structure |
| 247 | + |
| 248 | +The environment uses a sparse reward scheme designed to encourage complete solution strategies, rather than incremental progress or reward hacking. |
| 249 | + |
| 250 | +| Condition | Reward | Termination | |
| 251 | +|-----------|--------|-------------| |
| 252 | +| Valid move (non-solving) | 0.0 | False | |
| 253 | +| Invalid move | 0.0 | False | |
| 254 | +| Puzzle solved | 1.0 | True | |
| 255 | +| Max moves reached | 0.0 | True | |
| 256 | +| Invalid action format | 0.0 | False | |
| 257 | + |
| 258 | +>Goal: The agent receives a reward only upon successfully solving the puzzle, promoting long-horizon planning. |
| 259 | +
|
| 260 | +### Reward Calculation Logic |
| 261 | + |
| 262 | +```python |
| 263 | +def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]: |
| 264 | + """Process action and calculate reward.""" |
| 265 | + reward = 0.0 |
| 266 | + is_terminated = False |
| 267 | + |
| 268 | + if move_made: |
| 269 | + # Check if puzzle is solved |
| 270 | + if new_state["grid"] == new_state["solution"]: |
| 271 | + reward = 1.0 |
| 272 | + is_terminated = True |
| 273 | + else: |
| 274 | + reward = 0.0 # No reward for non-solving moves |
| 275 | + |
| 276 | + return response, reward, is_terminated, new_state |
| 277 | +``` |
| 278 | +## Results |
| 279 | + |
| 280 | +We fine-tuned [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on synthetic data for 120 steps using the following configuration settings: |
| 281 | + |
| 282 | +``` |
| 283 | +game_config: |
| 284 | + size: 5 # Size of the puzzle (e.g., 2 for 2x2, 3 for 3x3) |
| 285 | + shuffle_moves: 10 # Number of random moves to shuffle the solved state |
| 286 | +max_moves: 30 |
| 287 | +``` |
| 288 | + |
| 289 | +The figure below displays training rewards vs. steps, along with validation accuracy. |
| 290 | + |
| 291 | + |
| 292 | + |
| 293 | + |
| 294 | + |
0 commit comments