Skip to content

sparisi/gym_gridworlds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

173 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Minimalistic implementation of gridworlds based on Gymnasium, useful for quickly testing and prototyping reinforcement learning algorithms (both tabular and with function approximation).
The default class Gridworld implements a "go-to-goal" task where the agent has five actions (left, right, up, down, stay) and default transition function (e.g., doing "stay" in goal states ends the episode).
You can change actions and transition function by implementing more classes. For example, in RiverSwim there are only two actions and no terminal state, or in Taxi the agent can pick up passengers and drive them to the goal.
Basic gridworlds are defined in gridworld.py and are presented below. Harder gridworlds are defined in separate files in gym_gridworlds and are not discussed here (but are fully documented).

You can find a list of all environments here.

Install and Examples

To install the environments run

pip install -e .

Run python and then

import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Penalty-3x3-v0", render_mode="human")
env.reset()
env.step(2) # DOWN
env.step(4) # STAY

to render the Penalty-3x3-v0 gridworld (left figure),

import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Full-4x5-v0", render_mode="human")
env.reset()
env.step(2) # DOWN

to render the Full-4x5-v0 gridworld (middle figure), and

import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/DangerMaze-5x6-v0", render_mode="human")
env.reset()
env.step(2) # DOWN

to render the DangerMaze-5x6-v0 gridworld (right figure).

Gridworld Penalty       Gridworld Full       Gridworld Full

  • Black tiles are empty,
  • White tiles are pits (walking on them yields a large negative reward and the episode ends),
  • Gray tiles are walls (the agent cannot step on them),
  • Black tiles with purple arrows are tiles where the agent can move only in one direction (other actions will fail),
  • Red tiles give negative rewards,
  • Green tiles give positive rewards (the brighter, the higher),
  • Yellow tiles are quicksand, where all actions will fail with 90% probability,
  • The agent is the blue circle,
  • The orange arrow denotes the agent's last action,
  • The orange dot denotes that the agent did not try to move with its last action.

The smallest pre-built environment is Gym-Gridworlds/Empty-RandomStart-2x2-v0 (on the left): there are only 4 states, 5 actions, and the initial position is random. It is the simplest environment you can use to debug your algorithm.

Optional Features

Noisy Transition and Reward Functions

import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Full-4x5-v0", random_action_prob=0.1, reward_noise_std=0.05)

This makes the environment take a random action (instead of the action passed by the agent) with 10% probability, and Gaussian noise with 0.05 standard deviation is added to the reward.

POMDP
To turn the MDP into a POMDP and learn from partially-observable pixels, make the environment with view_radius=1 (or any integer). This way, only the tiles close to the agent (within the view radius) will be visible, while far away tiles will be masked by white noise. For example, this is the partially-observable version of the Full-4x5-v0 gridworld above.

import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Full-4x5-v0", render_mode="human", view_radius=1)
env.reset()
env.step(2) # DOWN

Gridworld Full Partial

Noisy Observations
Make the environment with observation_noise=0.2 (or any float between 0 and 1). With default observations, the float represents the probability that the position observed by the agent is random. With RGB observations, it represents the probability that a pixel is white noise, as shown below.

import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Full-4x5-v0", render_mode="human", observation_noise=0.2)
env.reset()
env.step(2) # DOWN

Gridworld Full Noisy

Random Goals
Make the environment with random_goals=True to randomize the position of positive rewards (positive only!) at every reset. To learn in this setting, you need to add the rewards position to the observation (MatrixWithGoalWrapper), or to learn from pixels.

Make Your Own Gridworld

.          empty tile
□          wall
_          quicksand
           pit (space character)
O          positive reward
o          smaller positive reward
X          negative reward
x          smaller negative reward
←↖↑↗→↘↓↙  one-directional tiles
  1. Encode your grid following the above mapping, and save it as txt file in gym_gridworlds/grids. For example save the grid below as 5x5_wall.txt.
.....
.□□□.
.□O..
.□□□.
.....

(IN PROGRESS) You can use map_editor.py to draw customized grids and save/load them to txt files. The current version supports only TravelField grids.

  1. Register the environment in gym_gridworlds/__init__.py, for example
register(
    id="Gym-Gridworlds/Wall-RandomStart-5x5-v0",
    entry_point="gym_gridworlds.gridworld:Gridworld",
    max_episode_steps=50,
    kwargs={
        "grid": "5x5_wall",
        "start_pos": None,  # random
    },
)
  1. Try it
import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Wall-RandomStart-5x5-v0", render_mode="human")
env.reset(seed=42)

Gridworld Full

Playground

You can use playground.py to test an environment. For example, run

python playground.py Gym-Gridworlds/Taxi-6x7-v0 --record
python playground.py Gym-Gridworlds/FourRooms-Original-13x13-v0 --env-arg slippery_prob=0.5 max_resolution=[512,512] --record
python playground.py Gym-Gridworlds/TravelField-28x28-v1 --env-arg distance_reward=True no_stay=True observation_noise=0.2 --record

You will be able to move the agent around the environment with the directional arrow keys, see the rewards received by the agent, and save gifs like the ones below.

Default MDP (Gridworld Class)

Action Space

The action is discrete in the range {0, 4} for {LEFT, RIGHT, DOWN, UP, STAY}. It is possible to remove the STAY action by making the environment with no_stay=True.
Diagonal actions {5, 8} for {UP_LEFT, DOWN_LEFT, DOWN_RIGHT, UP_RIGHT} are also supported but not used in the default MDP.

Observation Space

Default (True State)
The observation is discrete in the range {0, n_rows * n_cols - 1}. Each integer denotes the current location of the agent. For example, in a 3x3 grid the observations are

 0 1 2
 3 4 5
 6 7 8

The true state is always passed with the info dictionary, to retrieve it even when wrappers are used. This makes debugging easier (e.g., it is possible to count state visits even when RGB wrappers are used).

The observation can be transformed to better fit function approximation (e.g., if you use DQN) using wrappers from observation_wrappers.py. For example

  • CoordinateWrapper returns matrix coordinates (row, col). In the above example, obs = 3 becomes obs = (1, 0).
  • MatrixWrapper returns a map of the environment with one 1 at the agent's position. In the above example, obs = 3 becomes
 0 0 0
 1 0 0
 0 0 0

RGB
To use classic RGB pixel observations, make the environment with render_mode="rgb_array" and then wrap it with gymnasium.wrappers.AddRenderObservation.

Partial RGB
Pixel observations can be made partial by making the environment with view_radius. For example, if view_radius=1 the rendering will show the content of only the tiles around the agent, while all other tiles will be filled with white noise.

Noisy Observations
Make the environment with observation_noise=0.2 (or any float between 0 and 1). With default observations, the float represents the probability that the position observed by the agent is random. With RGB observations, it represents the probability that a pixel is white noise.

Starting State

By default, the episode starts with the agent at the top-left tile (0, 0). You can manually select the starting position by making the environment with the argument start_pos, e.g., start_pos=[(3, 4)]. You can use the key "max" to automatically select the end of the grid, e.g., start_pos=[("max", 0)] will place the agent at the bottom-right corner. If you make the environment with start_pos=None, the starting position will be random. In both cases (fixed and random), the starting position cannot be a tile with a wall or a pit.
Note that the starting position must be passed as a list of tuples. If more than one tuple is passed, the starting position will be randomly sampled from the list at every reset.

More Control Over The Starting State
If you want some starting states to be more likely to be sampled, repeat them within the list. For example, with start_pos=[(3, 4), (1, 0), (1, 0)] the agent has 66% chance of starting in (1, 0) and 33% of starting in (3, 4).
If you make the environment with loop_through_start_pos=True, the starting state will be different at every reset, following the order you passed with start_pos. This can be useful for testing environments with multiple starting states with only a few episodes. For example,

env = gymnasium.make("Gym-Gridworlds/Empty-10x10-v0", start_pos=[(3, 4), (1, 0), (2, 0)], loop_through_start_pos=True)
obs, _ = env.reset()
print(obs)  # 34 -> (3, 4) in matrix coordinates
obs, _ = env.reset()
print(obs)  # 10 -> (1, 0)
obs, _ = env.reset()
print(obs)  # 20 -> (2, 0)
obs, _ = env.reset()
print(obs)  # 34 -> (3, 4)
obs, _ = env.reset()
print(obs)  # 10 -> (1, 0)
...

Transition

By default, the transition is deterministic except in quicksand tiles, where any action fails with 90% probability (the agent does not move).
Transition can be made stochastic everywhere by passing random_action_prob. This is the probability that the action will be random. For example, if random_action_prob=0.1 there is a 10% chance that the agent will do a random action instead of doing the one passed to self.step(action).
Another way to add stochasticity is with slippery_prob, which is the probability that the agent slips and moves twice (similar to "sticky actions" in other environments).

Random Resets
You can pass random_reset_prob to have a chance that the environment self-resets at any step. This doesn't change the terminal and truncated flags, but simply transitions the agent to an initial state (i.e., the next state will be the one returned by env.reset()).
Useful to mimic episodic tasks in the infinite horizon setting (should not be used when there are terminal states).

Rewards

  • Doing STAY at the goal: +1
  • Doing STAY at a distracting goal: 0.1
  • Any action in penalty tiles: -10
  • Any action in small penalty tiles: -0.1
  • Walking on a pit tile: -100
  • Otherwise: 0

If the environment is made with no_stay=True, then the agent receives positive rewards for any action done in a goal state. Note that the reward still depends on the current state and not on the next state.

Positive rewards position can be randomized at every reset by making the environment with random_goals=True.

Noisy Rewards
White noise can be added to all rewards by passing reward_noise_std, or only to nonzero rewards with nonzero_reward_noise_std.

Auxiliary Rewards
Auxiliary rewards based on the Manhattan distance to the closest goal can be added by passing distance_reward=True or distance_difference_reward=True. The former is distance_at_current_state / max_distance, i.e., the distance from the current state scaled according to the size of the grid to be in the range [-1, 0]. The latter is distance_at_current_state - distance_at_next_state, thus it can be +1 (if the agent moves closer to the goal), 0 (if it does STAY), or -1 (if it moves further from the goal).

Episode End

By default, an episode ends if any of the following happens:

  • A positive reward is collected (termination),
  • Walking on a pit tile (termination),
  • The length of the episode is max_episode_steps (truncation).

It is possible to remove termination altogether by making the environment with infinite_horizon=True.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages