Skip to content

[Ready to merge] Add Parallel Q Network (PQN) cleanrl example and single-discrete action support#227

Merged
Ivan-267 merged 13 commits intomainfrom
AddPQNSupport
Mar 14, 2025
Merged

[Ready to merge] Add Parallel Q Network (PQN) cleanrl example and single-discrete action support#227
Ivan-267 merged 13 commits intomainfrom
AddPQNSupport

Conversation

@Ivan-267
Copy link
Collaborator

@Ivan-267 Ivan-267 commented Mar 5, 2025

Adds basic Parallel Q network (PQN) support, based on the CleanRL script with changes like those applied to the PPO example.

https://docs.cleanrl.dev/rl-algorithms/pqn/

PQN is a parallelized version of the Deep Q-learning algorithm. It is designed to be more efficient than DQN by using multiple agents to interact with the environment in parallel. PQN can be thought of as DQN (1) without replay buffer and target networks, and (2) with layer normalizations and parallel environments.

This is well suited for GDRL as we usually use multiple parallel agents.

Modifies our action preprocessor to support a single discrete action (this part needs more testing).

The algorithm will support a single obs space and single action space for now.

TODO:

  • Fix failing tests
  • Add most of the standard arguments (for start only timesteps, env path, n_parallel, speedup and viz)
  • Test SB3 to see it's not affected (automated tests should cover basic SF/Rllib tests).

TODO (optional, might also be in a future PR)

  • Add rewards being reported as in PPO script
  • Add onnx export

@Ivan-267
Copy link
Collaborator Author

Ivan-267 commented Mar 5, 2025

PQN.mp4

Quick test - realtime video, training in editor.

Params used: should mostly be the defaults from CleanRL except the timesteps:

    # Algorithm specific arguments
    env_id: str = "CartPole-v1"
    """the id of the environment"""
    total_timesteps: int = 1_000_000
    """total timesteps of the experiments"""
    learning_rate: float = 2.5e-4
    """the learning rate of the optimizer"""
    num_envs: int = 4
    """the number of parallel game environments"""
    num_steps: int = 128
    """the number of steps to run for each environment per update"""
    num_minibatches: int = 4
    """the number of mini-batches"""
    update_epochs: int = 4
    """the K epochs to update the policy"""
    anneal_lr: bool = True
    """Toggle learning rate annealing"""
    gamma: float = 0.99
    """the discount factor gamma"""
    start_e: float = 1
    """the starting epsilon for exploration"""
    end_e: float = 0.05
    """the ending epsilon for exploration"""
    exploration_fraction: float = 0.5
    """the fraction of `total_timesteps` it takes from start_e to end_e"""
    max_grad_norm: float = 10.0
    """the maximum norm for the gradient clipping"""
    q_lambda: float = 0.65
    """the lambda for Q(lambda)"""

The env was modified to use discrete actions but might also have other modifications as it's a local test version.

@Ivan-267
Copy link
Collaborator Author

Ivan-267 commented Mar 6, 2025

Added onnx export support (agent can move in 4 directions only, can sometimes get stuck but this can happen with other algorithms too so it's not PQN related - something like frame-stack, recurrence or other techniques might help there).

onnx.mp4

The exported onnx shown in Netron:
exported_onnx

@Ivan-267
Copy link
Collaborator Author

Ivan-267 commented Mar 6, 2025

I've done some quick tests now with SB3 discrete, multi-discrete and continuous versions of BallChase. I briefly started training on Sample factory (WSL) as well.

Notes: pip install tyro will be needed as it's used by the CleanRL script to manage arguments. If we have some tutorial/doc for this algo in the future we can mention it.

@Ivan-267 Ivan-267 changed the title [WIP] Add Parallel Q Network (PQN) cleanrl example and single-discrete action support Add Parallel Q Network (PQN) cleanrl example and single-discrete action support Mar 6, 2025
@Ivan-267 Ivan-267 requested a review from edbeeching March 6, 2025 16:42
@Ivan-267 Ivan-267 changed the title Add Parallel Q Network (PQN) cleanrl example and single-discrete action support [Ready to merge] Add Parallel Q Network (PQN) cleanrl example and single-discrete action support Mar 6, 2025
@Ivan-267
Copy link
Collaborator Author

Ivan-267 commented Mar 11, 2025

Another test with n_parallel = 4 and:

    """the id of the environment"""
    total_timesteps: int = 5_000_000
    """total timesteps of the experiments"""
    learning_rate: float = 2.5e-4
    """the learning rate of the optimizer [note: automatically set]"""
    num_envs: int = 4
    """the number of parallel game environments"""
    num_steps: int = 32
    """the number of steps to run for each environment per update"""
    num_minibatches: int = 1
    """the number of mini-batches"""
    update_epochs: int = 32
    """the K epochs to update the policy"""
    anneal_lr: bool = False
    """Toggle learning rate annealing"""
    gamma: float = 0.99
    """the discount factor gamma"""
    start_e: float = 1.0
    """the starting epsilon for exploration"""
    end_e: float = 0.05
    """the ending epsilon for exploration"""
    exploration_fraction: float = 0.75
    """the fraction of `total_timesteps` it takes from start_e to end_e"""
    max_grad_norm: float = 10.0
    """the maximum norm for the gradient clipping"""
    q_lambda: float = 0.65
    """the lambda for Q(lambda)"""

The test env is not released, but it is a modification of the completed version of tutorial env with discrete action, frame stacking 8 for raycast and vector to goal, and one hot encoded previous action (not stacked) included in the obs.

RandomizedPositions.mp4

Copy link
Owner

@edbeeching edbeeching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ivan-267 Ivan-267 merged commit 12ab0d4 into main Mar 14, 2025
13 checks passed
@Ivan-267 Ivan-267 deleted the AddPQNSupport branch March 14, 2025 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants