diff --git a/README.md b/README.md index 1da552d7..450528d8 100644 --- a/README.md +++ b/README.md @@ -217,6 +217,11 @@ Because JAX installation is different depending on your CUDA version, Haiku does First, follow these instructions to install JAX with the relevant accelerator support. +``` +pip install -r requirements.txt +``` + + ## General Information The project entrypoint is `pax/experiment.py`. The simplest command to run a game would be: diff --git a/docs/getting-started/agents.md b/docs/getting-started/agents.md index fb5aae31..d7c9711f 100644 --- a/docs/getting-started/agents.md +++ b/docs/getting-started/agents.md @@ -1,9 +1,92 @@ # Agents -## Agent 1 +## Overview + +Pax provides a number of fixed opponents and learning agents to train and train against. + +## Specifying an Agent + +Pax comes installed with an `Agent` class and several predefined agents. To specify an agent, import the `Agent` class and specify the agent parameters. + +``` +import jax.numpy as jnp +import Agent + +args = {"hidden": 16, "observation_spec": 5} +rng = jax.random.PRNGKey(0) +bs = 1 +init_hidden = jnp.zeros((bs, args.hidden)) +obs = jnp.ones((bs, 5)) + +agent = Agent(args) +state, memory = agent.make_initial_state(rng, init_hidden) +action, state, mem = agent.policy(rng, obs, mem) + +state, memory, stats = agent.update( + traj_batch, obs, state, mem +) + +mem = agent.reset_memory(mem, False) +``` + +To run an experiment with a specific agent, use a pre-made `.yaml` file located in `conf/...` or create your own, and specify the agent. In the below example, `agent1` is a learning agent that learns via PPO and `agent2` is an agent that only chooses the Cooperate action. + +``` +# Agents +agent1: 'PPO' +agent2: 'Altruistic' + +... +``` + +## List of Agents + +```{note} +Fixed agents are game-specific, while learning agents like PPO can be used in both games. +``` + +### agent1, agent2 + +#### Fixed + +Matrix games + +| Agent | Description | +| ----------- | ----------- | +| **`Altruistic`** | Always chooses the Cooperate (C) action. | +| **`Defect`** | Always chooses the Defect (D) action. | +| **`GrimTrigger`** | Chooses the C action on the first turn and reciprocates with the C action until the opponent chooses D, where Grim switches to only choosing D.| +| **`HyperAltruistic`** | Infinite matrix game variant of `Altruistic`. Always chooses the Cooperate (C) action.| +| **`HyperDefect`** | Infinite matrix game variant of `Defect`. Always chooses the Defect (D) action.| +| **`HyperTFT`** | Infinite matrix game variant of `TitForTat`. Chooses the C action on the first turn and reciprocates the opponent's last action.| +| **`Random`** | Randomly chooses the C or D action. | +| **`TitForTat`** | Chooses the C action on the first turn and reciprocates the opponent's last action.| + + +Coin Game + +| Agent | Description| +| ----------- | ----------- | +| **`EvilGreedy`** | Attempts to pick up the closest coin. If equidistant to two colored coins, then it chooses its opponents color coin.| +| **`GoodGreedy`** | Attempts to pick up the closest coin. If equidistant to two colored coins, then it chooses its own color coin. | +| **`RandomGreedy`** | Attempts to pick up the closest coin. If equidistant to two colored coins, then it randomly chooses a color coin. | +| **`Stay`** | Agent does not move.| + +#### Learning + +| Agent | Description | +| ----------- | ----------- | +| **`Naive`** | Simple learning agent that learns via REINFORCE. | +| **`NaiveEx`** | Infinite matrix game variant of `Naive`. Simple learning agent that learns via REINFORCE. | +| **`MFOS`** | Meta-learning algorithm for opponent shaping. | +| **`PPO`** | Learning agent parameterised by a multilayer perceptron that learns via PPO. | +| **`PPO_memory`** | Learning agent parameterised by a multilayer perceptron with a memory component that learns via PPO. | +| **`Tabular`** | Learning agent parameterised by a single layer perceptron that learns via PPO. | + +```{note} +`PPO_memory` serves as the core learning algorithm for both **Good Shepherd (GS)** and **Context and History Aware Other Shaping (CHAOS)** when the training with meta-learning. +``` -Lorem ipsum. -## Agent 2 -Lorem ipsum. diff --git a/docs/getting-started/environments.md b/docs/getting-started/environments.md index 69b807b7..3f720a9e 100644 --- a/docs/getting-started/environments.md +++ b/docs/getting-started/environments.md @@ -1,9 +1,94 @@ # Environments -## Environment 1 +## Overview +Pax supports two environments for learning agents to train within: matrix games and grid-world games. + +## Specifying the Environment + +Pax environments are similar to gymnax. To specify an environment, import the environment and specify the environment parameters. + +``` +from pax.envs.iterated_matrix_game import ( + IteratedMatrixGame, + EnvParams, +) + +env = IteratedMatrixGame(num_inner_steps=5) +env_params = EnvParams(payoff_matrix=payoff) + +# 0 = Defect, 1 = Cooperate +actions = (jnp.ones(()), jnp.ones(())) +obs, env_state = env.reset(rng, env_params) +done = False + +while not done: + obs, env_state, rewards, done, info = env.step( + rng, env_state, actions, env_params + ) +``` + +To specify the parameters for the environment: + +``` +... +# Environment +env_id: coin_game +env_type: meta +egocentric: True +env_discount: 0.96 +payoff: [[1, 1, -2], [1, 1, -2]] +... +``` + +## List of Environment Parameters + +### env_id +| Name | Description | +| :----------- | :----------- | +|`iterated_matrix_game`| Classic normal form game with a 2x2 payoff matrix repeatedly played over `n` steps. | +|`infinite_matrix_game` | Special case of the classic normal form game that calculates an exact value, simulating an infinite game. +|`coin_game` | Classic grid-world social dilemma environment. | + +### env_type + +| Name | Description | +| :----------- | :----------- | +|`sequential`| Classic normal form game with a 2x2 payoff matrix repeatedly played over `n` steps. | +|`meta`| Meta-learning regime, where an agent learns via meta-learning. | + +### egocentric +| Name | Description | +| :----------- | :----------- | +|*bool*| If `True`, sets an agent in the Coin Game environment to an egocentric view, empirically found to be more appropriate for other shaping. Else, sets an agent in to a non-egocentric view, in line with the original version. | + +### env_discount + +| Name | Description | +| :----------- | :----------- | +|*Numeric*| Meta-learning discount factor. Between 0 and 1. | + +### payoff +| Name | Description | +| :----------- | :----------- | +|*Array*| Custom payoff for game. | + +Example: + +``` +# if playing Coin Game +payoff: [[1, 1, -2], [1, 1, -2]] +``` + +``` +# if playing Matrix Games +payoff: [[-1, -1], [-3, 0], [0, -3], [-2, -2]] +``` + +```{note} +Docstrings are under constuction. Please check back later. +``` + + -Lorem ipsum. -## Environment 2 -Lorem ipsum. diff --git a/docs/getting-started/evaluation.md b/docs/getting-started/evaluation.md new file mode 100644 index 00000000..d4e2145d --- /dev/null +++ b/docs/getting-started/evaluation.md @@ -0,0 +1,81 @@ +# Saving & Loading + +Pax provides an easy way to save and load your models. + +## Overview + +Saving and loading allows users to save or load models locally or from Weight and Biases. Users can configure the experiment `.yaml` file to set up the save and load file path, either locally or online. + +## List of Saving Parameters + +### save +| Name | Description | +| :----------- | :----------- | +|*bool* | If `True`, the model is saved to the filepath specified by `save_dir`. | + + +### save_dir +| Name | Description | +| :----------- | :----------- | +|*String* | Filepath used to save a model. | + +### save_interval + +| Name | Description | +| :----------- | :----------- | +|*Int* | Number of iterations between saving a model. | + +Example +``` +# config.yaml +save: True +save_interval: 10 +save_dir: "./exp/${wandb.group}/${wandb.name}" +``` + +## List of Loading Parameters + +### model_path +| Name | Description | +| :----------- | :----------- | +|*String* | Filepath to load the model. | + +### run_path +| Name | Description | +| :----------- | :----------- | +|*String* | If using Weights and Biases (i.e. `wandb.log=True`), this is the run path of the model used to load the model. | + +Example +``` +# config.yaml +run_path: ucl-dark/cg/3mpgbfm2 +model_path: exp/coin_game-EARL-PPO_memory-vs-Random/run-seed-0/2022-09-08_20.41.03.643377/generation_30 +``` + +### wandb + +```{note} +The following parameters are used for Weights and Biases specific features. +``` + +``` +wandb: + entity: "ucl-dark" + project: cg + group: 'EARL-${agent1}-vs-${agent2}' + name: run-seed-${seed} + log: False +``` +| Name | Description | +| :----------- | :----------- | +|`entity` | Weights and Biases entity. | +|`project` | Weights and Biases project name. | +|`group` | Weights and Biases group name. | +|`name` | Weights and Biases run name. | +|`log` | Weights and Biases run name. | + + + + + + diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md index 58232513..4653f9cf 100644 --- a/docs/getting-started/installation.md +++ b/docs/getting-started/installation.md @@ -1,7 +1,3 @@ # Installation -Pax is written in pure Python, but depends on C++ code via JAX. - -Because JAX installation is different depending on your CUDA version, Haiku does not list JAX as a dependency in requirements.txt. - -First, follow these instructions to install JAX with the relevant accelerator support. +PAX will soon be available to install via the [Python Package Index](https://github.com/akbir/pax). For full installation instructions, please refer to the [Install Guide](https://github.com/akbir/pax) in the project README. diff --git a/docs/getting-started/runners.md b/docs/getting-started/runners.md index 899f24fb..2d385db3 100644 --- a/docs/getting-started/runners.md +++ b/docs/getting-started/runners.md @@ -1,9 +1,95 @@ # Runner -## Runner 1 +## Overview -Lorem ipsum. +Pax provides a number of experiment runners useful for different use cases of training and evaluating reinforcement learning agents. -## Runner 2 +## Specifying a Runner -Lorem ipsum. +Pax centers around its runners, pieces of custom experiment logic that leverage the speed of JAX. After specifying the environment and agents, a runner carries out the experiment. The code below shows a portion of a runner that carries out a rollout and updates the agent: + +``` +def _rollout(carry, unused): + """Runner for inner episode""" + ( + rngs, + obs, + a_state, + a_mem, + env_state, + env_params, + ) = carry + + # unpack rngs + rngs = self.split(rngs, 4) + action, a_state, new_a_mem = agent1.batch_policy( + a_state, + obs[0], + a_mem, + ) + + next_obs, env_state, rewards, done, info = env.step( + rngs, + env_state, + (action, action), + env_params, + ) + + traj = Sample( + obs1, + action, + rewards[0], + new_a1_mem.extras["log_probs"], + new_a1_mem.extras["values"], + done, + a1_mem.hidden, + ) + + return ( + rngs, + next_obs, + a1_state, + new_a1_mem, + env_state, + env_params, + ), ( + traj1, + traj2, + ) + + +agent = Agent(args) +state, memory = agent.make_initial_state(rng, init_hidden) + +for _ in range(num_updates): + final_timestep, batch_trajectory = jax.lax.scan( + _rollout, + ((obs, env_state, rng), rollout_length), + 10, + ) + + _, obs, rewards, a1_state, a1_mem, _, _ = final_timestep + + state, memory, stats = agent.update( + batch_trajectory, obs[0], state, memory + ) +``` + +To specify the runner in an experiment, use a pre-made `.yaml` file located in `conf/...` or create your own, and specify the runner with `runner`. In the below example, the `evo` flag and the `EvoRunner` used. + +``` +... +# Runner +runner: evo +... +``` + +## List of Runners + +### runner +| Runner | Description| +| ----------- | ----------- | +| **`eval`** | Evaluation runner, where a single, pre-trained agent is evaluated. | +| **`evo`** | Evolution runner, where two independent agents are trained via Evolutionary Strategies (ES). | +| **`rl`** | Multi-agent runner, where two independent agents are trained via reinforcement learning. | +| **`sarl`** | Single-agent runner, where a single agent is trained via reinforcement learning. | \ No newline at end of file diff --git a/docs/getting-started/training.md b/docs/getting-started/training.md new file mode 100644 index 00000000..f4adbbe3 --- /dev/null +++ b/docs/getting-started/training.md @@ -0,0 +1,178 @@ +# Training + +Pax provides fully configurable training parameters for experiments. + +## Overview + +Training parameters allow users to fully specify the training protocol of their experiment. Users can configure the experiment `.yaml` file to specify details such as episode length, number of environments, and much more. + +## List of Training Parameters + +### ppo + + + + +| Name | Type | Description | +| :----------- | :----------- | :----------- | +| `num_minibatches` | *int*| Number of minibatches. | +| `num_epochs` | *int* | Number of epochs. | +| `gamma` | *Numeric*| Discount factor $\gamma$. | +| `gae_lambda` | *Numeric*| Generalized advantage estimate $\lambda$ factor. | +| `ppo_clipping_epsilon` | *Numeric*| Clipping factor $\epsilon$. | +| `value_coeff` | *Numeric*| Value coefficient. | +| `clip_value` | *Numeric*| Clip value. | +| `max_gradient_norm` | *Numeric*| Max gradient norm. | +| `anneal_entropy` | *bool* | Whether to anneal the entropy term. | +| `entropy_coeff_start` | *Numeric*| Starting entropy annealing coefficient. | +| `entropy_coeff_horizon` |*Numeric*| Number of iterations before entropy coefficient reaches `entropy_coeff_end` | +| `entropy_coeff_end` | *Numeric*| Ending entropy annealing coefficient. | +| `lr_scheduling` | *bool* | Whether to annealing the learning rate. | +| `learning_rate` | *Numeric*| Learning rate. | +| `adam_epsilon` | *Numeric*| Adam epsilon. | +| `with_memory` | *bool*| Whether to use memory. | +| `with_cnn` |*bool* | Whether to use a CNN in Coin Game. | +| `output_channels` | *int*| Number of output channels. | +| `kernel_shape` | *Array*| Size of kernel shape. | +| `separate` | *bool*| Whether to use separate networks in CNN. | +| `hidden_size` | *Numeric*| Hidden size of memory layer. | + +Example +``` +# config.yaml +ppo: + num_minibatches: 8 + num_epochs: 2 + gamma: 0.96 + gae_lambda: 0.95 + ppo_clipping_epsilon: 0.2 + value_coeff: 0.5 + clip_value: True + ... +``` + +### es + +| Name | Type | Description | +| :----------- | :----------- | :----------- | +| `algo` | *String*| Algorithm to use. Currently supports `[OpenES, CMA_ES, SimpleGA]`| +| `sigma_init` | *Numeric* | Initial scale of isotropic Gaussian noise | +| `sigma_decay` | *Numeric*| Multiplicative decay factor | +| `sigma_limit` | *Numeric*| Smallest possible scale | +| `init_min` | *Numeric*| Range of parameter mean initialization - Min | +| `init_max` | *Numeric*| Range of parameter mean initialization - Max | +| `clip_min` | *Numeric*| Range of parameter proposals - Min | +| `clip_max` | *Numeric*| Range of parameter proposals - Max | +| `lrate_init` | *Numeric* | Initial learning rate | +| `lrate_decay` | *Numeric*| Multiplicative decay factor | +| `lrate_limit` |*Numeric*| Smallest possible lrate | +| `beta_1` | *Numeric*| Adam - beta_1 | +| `beta_2` | *Numeric* | Adam - beta_2 | +| `eps` | *Numeric*| eps constant, | +| `elite_ratio` | *Numeric*| Percentage of elites to keep. | + +Example +``` +# config.yaml +es: + algo: OpenES + sigma_init: 0.04 + sigma_decay: 0.999 + sigma_limit: 0.01 + init_min: 0.0 + init_max: 0.0 + clip_min: -1e10 + clip_max: 1e10 + lrate_init: 0.1 + lrate_decay: 0.9999 + lrate_limit: 0.001 + beta_1: 0.99 + beta_2: 0.999 + eps: 1e-8 + elite_ratio: 0.1 +``` + +## List of Training Hyperparameters + +### num_devices +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Number of devices used to train the agent. Values greater than `1` require multiple GPUs.| + +```{note} +The following piece of code can used to debug multi-devices on CPU if run at the top of `experiment.py`. +``` + +``` +import os +from jax.config import config +os.environ["XLA_FLAGS"] = "--xla_force_host_platform_device_count=2" +config.update('jax_disable_jit', True) +``` + +### num_envs +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Number of environments used to train the agent.| + +### num_generations + +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Number of generations to train the agent when training with evolution.| + +### num_inner_steps +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Number of inner steps within an episode. Set equal to `num_steps` when running `env: sequential`| + +### num_opps +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Number of opponents in each environment. Typically set to `1`. | + +### num_steps +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Number of steps in a meta episode. | + +Example: +``` +num_inner_steps: 16 # Episode length +num_steps: 9600 # Steps in a meta-episode +``` + +Following the formula `number of episodes = num_steps / num_inner_steps`, we can calculate the number of episodes. In this example, each rollout will contain 600 episodes of length 16 (`600 episodes = 9600 steps / 16 steps per episode`). + +### popsize +| Name | Description | +| :----------- | :----------- | +|*Numeric* | Size of population when training with evolution. | + +### top_k +| Name | Description | +| :----------- | :----------- | +| *Numeric* | Number of agents to show when training with evolution. | + +Example +``` +# config.yaml +top_k: 5 +popsize: 128 +num_envs: 50 +num_opps: 1 +num_devices: 2 +num_steps: 9600 +num_inner_steps: 16 +num_generations: 2000 +``` \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 1bb44563..e5d0352f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,10 +1,14 @@ -# Pax - Multi-Agent Learning in JAX +# Pax -Pax is an experiment runner for multi-agent research built on top of JAX. It supports "other agent shaping", "multi agent RL" and "single agent RL" experiments. It supports regular and meta agents, and evolutionary and RL-based optimisation. +````{note} +We are under construction at this time. Please check back later. +```` + +Pax is an experiment platform for multi-agent shaping research built on top of JAX. It provides support for other-agent shaping and single/multi-agent reinforcement learning experiments with matrix/2D **environments**, regular/meta-learning **agents**, and evolutionary/RL-based optimisation **runners**. > *Pax (noun) - a period of peace that has been forced on a large area, such as an empire or even the whole world* -Pax is composed of 3 components: Environments, Agents and Runners. + ```{toctree} @@ -17,4 +21,6 @@ getting-started/installation getting-started/environments getting-started/agents getting-started/runners +getting-started/training +getting-started/evaluation ```