Skip to content

Commit 63439ac

Browse files
authored
docs: guide for sliding puzzle example (#961)
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
1 parent a9ff45c commit 63439ac

File tree

5 files changed

+299
-4
lines changed

5 files changed

+299
-4
lines changed
29.6 KB
Loading
14.1 KB
Loading

docs/guides/grpo-sliding-puzzle.md

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Solve a Sliding Puzzle Using GRPO
2+
3+
This guide explains how to use Nemo RL to train a model to solve the classic **nxn sliding puzzle** game through multi-turn reinforcement learning. This environment implements a classic **n×n sliding puzzle** where numbered tiles must be arranged in sequential order by sliding them into an empty space.
4+
5+
The sliding puzzle task serves as a simple, yet effective example, to illustrate how multi-turn RL and tool-calling are implemented within Nemo RL. This example provides a minimal setup for understanding the core components of Group Relative Policy Optimization (GRPO) and sequential decision-making.
6+
7+
8+
## Quick Start Guide
9+
10+
### 1. Install and Set Up NeMo RL with Megatron Backend (Optional)
11+
12+
To get started, clone and set up the NeMo RL repository by initializing submodules, installing CUDA dependencies, and configuring the environment with uv. Refer to [Prerequisites](https://github.com/NVIDIA-NeMo/RL/tree/main?tab=readme-ov-file#prerequisites) for detailed instructions on installation.
13+
14+
### 2. Train a Model
15+
16+
Train a model to solve the sliding puzzle using GRPO with the default 2×2 configuration.
17+
18+
```bash
19+
uv run python examples/run_grpo_sliding_puzzle.py
20+
```
21+
22+
### 3. Customize Puzzle Configuration
23+
24+
By default, this training script uses the configuration in [grpo_sliding_puzzle.yaml](../../examples/configs/grpo_sliding_puzzle.yaml). You can customize parameters with command-line overrides to experiment with different puzzle sizes or levels of difficulty.
25+
```bash
26+
# Train on a 3×3 puzzle with 10 random moves to scramble the board
27+
uv run python examples/run_grpo_sliding_puzzle.py \
28+
env.sliding_puzzle_game.cfg.game_config.size=3 \
29+
env.sliding_puzzle_game.cfg.game_config.shuffle_moves=10
30+
```
31+
32+
### 4. Monitor Progress
33+
34+
You can enable logging via Weights & Biases and TensorBoard to monitor training metrics such as rewards, success rate, and loss curves.
35+
36+
```bash
37+
# Enable logging (optional)
38+
uv run examples/run_grpo_sliding_puzzle.py \
39+
--config examples/configs/grpo_sliding_puzzle.yaml \
40+
logger.wandb_enabled=true \
41+
logger.tensorboard_enabled=true
42+
```
43+
44+
## Game Mechanics
45+
46+
### Puzzle Structure
47+
48+
The sliding puzzle consists of:
49+
- **Grid**: An `n×n` grid with numbered tiles and one empty space
50+
- **Tiles**: Numbered from `1` to `n²-1`, placed in random order
51+
- **Empty Space**: Represented by `0`, typically starting at the bottom-right corner
52+
- **Goal State**: Sequential arrangement `1, 2, 3, ..., n²-1` with `0` at bottom-right
53+
54+
### Example Data Sample
55+
```
56+
===== SLIDING PUZZLE =====
57+
Arrange the 3x3 grid by sliding tiles into the empty space.
58+
- The goal is to arrange numbers from 1 to 8 in order
59+
- Use 'up', 'down', 'left', 'right' to slide in that direction
60+
- Use 'view' to see the current state of the board
61+
62+
Current Board State:
63+
64+
+---------+
65+
1 | 1 3 |
66+
2 | 4 2 5 |
67+
3 | 7 8 6 |
68+
+---------+
69+
1 2 3
70+
71+
Reach the goal state where numbers are ordered 1 through 8 with the empty space (0) at the bottom right.
72+
Valid actions: 'up', 'down', 'left', 'right', or 'slide row col' (e.g., 'slide 1 2').
73+
After thinking, output your chosen action on a new line starting with '<action></action>' like this:
74+
<action>your_action</action>
75+
If you just want to see the board, output <action>view</action>
76+
Think carefully step-by-step before acting.
77+
78+
```
79+
80+
### Movement Rules
81+
82+
1. **Valid Moves**: Only tiles adjacent to the empty space `0` can be moved.
83+
2. **Movement Direction**: Tiles slide into the empty space, not the other way around.
84+
3. **Grid Boundaries**: Moves that would go beyond the grid are invalid.
85+
4. **Single Tile Movement**: Each action affects only one tile at a time.
86+
87+
All actions must be wrapped in XML-style tags and follow one of the formats below:
88+
```xml
89+
<action>up</action> <!-- Slide a tile up into the empty space -->
90+
<action>slide 2 1</action> <!-- Slide tile at row 2, column 1 -->
91+
<action>view</action> <!-- View the current board state -->
92+
```
93+
94+
## Data Generation
95+
96+
### Configuration Parameters
97+
98+
Sliding puzzle instances are generated using the following parameters, which can be customized via the configuration file:
99+
100+
```yaml
101+
env:
102+
sliding_puzzle_game:
103+
cfg:
104+
game_config:
105+
size: 5 # Size of the puzzle grid (e.g., 3x3, 4x4, 5x5)
106+
shuffle_moves: 4 # Number of random moves to scramble the puzzle
107+
max_moves: 40 # Maximum number of moves allowed per episode
108+
```
109+
#### Description
110+
111+
- **`size`**: Determines the dimensions of the puzzle board (`n×n`).
112+
- **`shuffle_moves`**: Controls the initial difficulty by randomly moving tiles to scramble the puzzle.
113+
- **`max_moves`**: Sets an upper limit on the number of actions the agent can take in one episode.
114+
115+
Grids are generated with sizes ranging from 2 to game_config.size. Each grid starts with a solved state and is shuffled by moving random tiles to the empty space n times, where n is a random number between 1 and `shuffle_moves`. The grid is shuffled using only valid moves.
116+
The `generate_puzzle_datum()` function in [run_grpo_sliding_puzzle.py](../../examples/run_grpo_sliding_puzzle.py) is responsible for generating the dataset. [sliding_puzzle.py](../../nemo_rl/environments/games/sliding_puzzle.py) contains the `SlidingPuzzleGameLogic` class, responsible for puzzle generation and initialization logic. The number of shuffle moves and size of the grid will control puzzle difficulty.
117+
118+
#### Generation Algorithm
119+
The puzzle configuration is randomly generated by sampling the grid size and number of shuffling moves within the defined maximums:
120+
121+
```python
122+
def generate_random_config(max_config: dict[str, Any]) -> dict[str, Any]:
123+
"""Generate a random config for the sliding puzzle game."""
124+
shuffle_moves = random.randint(1, max_config.get("shuffle_moves"))
125+
if shuffle_moves % 2 == 0:
126+
shuffle_moves += 1 # Ensure odd number for proper scrambling
127+
return {
128+
"size": random.randint(2, max_config.get("size", 3)),
129+
"shuffle_moves": shuffle_moves,
130+
}
131+
132+
game_config = generate_random_config(game_config)
133+
initial_game_state = SlidingPuzzleGameLogic.generate(game_config)
134+
initial_render = SlidingPuzzleGameLogic.render(initial_game_state)
135+
welcome_message = SlidingPuzzleGameLogic.init(initial_game_state)
136+
```
137+
138+
### Dataset Size Calculation
139+
140+
Dataset size is defined by parameters in grpo_sliding_puzzle.yaml:
141+
```
142+
Training Size = num_prompts_per_step × num_generations_per_prompt × max_num_steps
143+
Validation Size = max_val_samples
144+
```
145+
146+
### Data Structure
147+
148+
Each training sample is returned as a `DatumSpec` dictionary with the following structure:
149+
150+
```python
151+
datum: DatumSpec = {
152+
"message_log": message_log, # Conversation history
153+
"length": len(tokenized_prompt), # Token count
154+
"extra_env_info": metadata, # Game state metadata
155+
"loss_multiplier": 1.0, # Training weight
156+
"idx": idx, # Sample index
157+
"task_name": task_name, # Task identifier
158+
"stop_strings": ["</action>"], # Termination tokens
159+
}
160+
```
161+
162+
## Environment Interface
163+
164+
<!-- ### Architecture Flow
165+
166+
```
167+
GRPO Training Pipeline:
168+
run_grpo_sliding_puzzle.grpo_train → nemo_rl.experience.rollouts.run_multi_turn_rollouts → generate_response + calculate_reward → environments.games.sliding_puzzle.SlidingPuzzleEnv.step
169+
``` -->
170+
171+
### Core Classes
172+
173+
The [sliding_puzzle.py](../../nemo_rl/environments/games/sliding_puzzle.py) defines the environment and the logic for interacting with the environment. The core classes used are outlined below:
174+
175+
#### SlidingPuzzleEnv
176+
The SlidingPuzzleEnv class serves as the main environment, implementing a Ray remote actor for distributed processing and using functions from both the SlidingPuzzleGameLogic and SlidingPuzzleRunner classes to interact with the environment.
177+
178+
```python
179+
@ray.remote
180+
class SlidingPuzzleEnv(EnvironmentInterface):
181+
def __init__(self, cfg: Optional[SlidingPuzzleConfig] = None):
182+
"""Initialize environment with configuration."""
183+
184+
def step(
185+
self,
186+
message_log_batch: list[LLMMessageLogType],
187+
metadata_batch: list[SlidingPuzzleMetadata],
188+
) -> EnvironmentReturn:
189+
"""Process batch of interactions."""
190+
```
191+
192+
#### SlidingPuzzleGameLogic
193+
The SlidingPuzzleGameLogic class defines the core game mechanics through static methods for puzzle operations and includes functionality for reward calculation.
194+
195+
```python
196+
class SlidingPuzzleGameLogic:
197+
@staticmethod
198+
def generate(config: dict[str, Any]) -> dict[str, Any]:
199+
"""Generate new puzzle with specified configuration."""
200+
201+
@staticmethod
202+
def init(game_state: dict[str, Any]) -> str:
203+
"""Create welcome message with game rules."""
204+
205+
@staticmethod
206+
def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]:
207+
"""Execute action and return (response, reward, terminated, new_state)."""
208+
209+
@staticmethod
210+
def render(game_state: dict[str, Any]) -> str:
211+
"""Render current puzzle state as visual grid."""
212+
```
213+
214+
#### SlidingPuzzleRunner
215+
216+
The SlidingPuzzleRunner class handles turn processing and action management.
217+
218+
```python
219+
class SlidingPuzzleRunner:
220+
def __init__(self):
221+
"""Initialize runner with no persistent state."""
222+
223+
def _parse_action(self, text: str) -> Optional[str]:
224+
"""Extract action from model response using XML tag parsing."""
225+
226+
def process_turn(
227+
self,
228+
message_log: LLMMessageLogType,
229+
metadata: SlidingPuzzleMetadata,
230+
) -> tuple[dict[str, str], float, bool, Optional[list[str]], Optional[SlidingPuzzleMetadata]]:
231+
"""Process single turn and return (response_dict, reward, terminated, stop_strings, updated_metadata)."""
232+
```
233+
234+
### Processing Pipeline
235+
236+
The step function creates a processing pipeline where each class handles specific responsibilities:
237+
238+
1. **Parse action** (`SlidingPuzzleRunner`): Extracts the action from the model response using XML tag parsing via the `process_turn` method.
239+
2. **Validate Move** (`SlidingPuzzleGameLogic`): Checks if the action is valid for the current game state and then executes the move.
240+
3. **Execute Action** (`SlidingPuzzleGameLogic`): Applies the move to the game state using the `SlidingPuzzleGameLogic.step` method.
241+
4. **Calculate Reward** (`SlidingPuzzleGameLogic`): Assigns a reward based on progress toward solving the puzzle (step function).
242+
5. **Return Results** (`SlidingPuzzleEnv`): Returns the updated interaction state as an `EnvironmentReturn` object.
243+
244+
## Reward System
245+
246+
### Reward Structure
247+
248+
The environment uses a sparse reward scheme designed to encourage complete solution strategies, rather than incremental progress or reward hacking.
249+
250+
| Condition | Reward | Termination |
251+
|-----------|--------|-------------|
252+
| Valid move (non-solving) | 0.0 | False |
253+
| Invalid move | 0.0 | False |
254+
| Puzzle solved | 1.0 | True |
255+
| Max moves reached | 0.0 | True |
256+
| Invalid action format | 0.0 | False |
257+
258+
>Goal: The agent receives a reward only upon successfully solving the puzzle, promoting long-horizon planning.
259+
260+
### Reward Calculation Logic
261+
262+
```python
263+
def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]:
264+
"""Process action and calculate reward."""
265+
reward = 0.0
266+
is_terminated = False
267+
268+
if move_made:
269+
# Check if puzzle is solved
270+
if new_state["grid"] == new_state["solution"]:
271+
reward = 1.0
272+
is_terminated = True
273+
else:
274+
reward = 0.0 # No reward for non-solving moves
275+
276+
return response, reward, is_terminated, new_state
277+
```
278+
## Results
279+
280+
We fine-tuned [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on synthetic data for 120 steps using the following configuration settings:
281+
282+
```
283+
game_config:
284+
size: 5 # Size of the puzzle (e.g., 2 for 2x2, 3 for 3x3)
285+
shuffle_moves: 10 # Number of random moves to shuffle the solved state
286+
max_moves: 30
287+
```
288+
289+
The figure below displays training rewards vs. steps, along with validation accuracy.
290+
291+
![Training Curve](../assets/train-reward-sliding-puzzle.png)
292+
293+
294+
![Validation Accuracy](../assets/valid_acc-sliding-puzzle.png)

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ guides/sft.md
2828
guides/dpo.md
2929
guides/grpo.md
3030
guides/grpo-deepscaler.md
31+
guides/grpo-sliding-puzzle.md
3132
guides/rm.md
3233
guides/environments.md
3334
guides/eval.md

examples/configs/grpo_sliding_puzzle.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ checkpointing:
1919

2020
policy:
2121
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
22-
max_total_sequence_length: 3072
22+
max_total_sequence_length: 1024
2323

2424
dtensor_cfg:
2525
enabled: true
@@ -54,8 +54,8 @@ env:
5454
cfg:
5555
game_config:
5656
size: 5 # Size of the puzzle (e.g., 2 for 2x2, 3 for 3x3)
57-
shuffle_moves: 15 # Number of random moves to shuffle the solved state
58-
max_moves: 50 # Maximum moves allowed per episode
57+
shuffle_moves: 10 # Number of random moves to shuffle the solved state
58+
max_moves: 30 # Maximum moves allowed per episode
5959

6060
logger:
6161
log_dir: "logs" # Base directory for all logs
@@ -73,4 +73,4 @@ logger:
7373
run_name: "grpo-dev-sliding_puzzle"
7474
gpu_monitoring:
7575
collection_interval: 10 # How often to collect GPU usage metrics (in seconds)
76-
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)
76+
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)

0 commit comments

Comments
 (0)