Skip to content

Conversation

@MatthewCWeston
Copy link
Contributor

@MatthewCWeston MatthewCWeston commented Sep 7, 2025

Why are these changes needed?

At present, pettingzoo_shared_value_function.py is a placeholder file that returns an error message when run. This PR implements it in a way that permits direct comparison between itself, pettingzoo_parameter_sharing.py, and pettingzoo_independent_learning.py.

My motivation for this is that, when I first saw the placeholder file name, I hoped to use it as a reference when working on my current project, which requires a shared critic. Having completed that part of my project, I cleaned up and generalized the code so that I could share it with other RLlib developers.

In brief, this PR implements the placeholder example file by including an implementation of MAPPO in the examples section, in a similar vein to how other examples implement VPG, AlphaStar-style league-based training, and MobileNet. This provides a working example of how a shared value function, used in a wide variety of multi agent reinforcement learning papers, can be implemented in RLlib using the new API stack.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Release tests

Note

Introduces a Torch MAPPO with a shared critic and implements the PettingZoo Waterworld shared-value-function example, activating its CI test.

  • Examples (Multi-agent):
    • Implement examples/multi_agent/pettingzoo_shared_value_function.py using a shared critic (MAPPO) for PettingZoo Waterworld; wires MultiRLModuleSpec with per-agent policies and a shared_critic.
    • Minor doc tweak in examples/multi_agent/shared_encoder_cartpole.py (expected reward text).
  • Algorithm Scaffolding (MAPPO, Torch):
    • Add MAPPO core under examples/algorithms/mappo/:
      • mappo.py + MAPPOConfig (training params, defaults, validators).
      • mappo_learner.py base learner (entropy/KL schedulers, GAE connector hookup).
      • Torch impl: torch/mappo_torch_learner.py (policy loss, critic loss, KL/entropy metrics).
      • Default policy module: default_mappo_rl_module.py + torch/default_mappo_torch_rl_module.py.
      • Shared critic: shared_critic_rl_module.py + torch/shared_critic_torch_rl_module.py with shared_critic_catalog.py.
      • Catalogs: mappo_catalog.py; GAE connector: connectors/general_advantage_estimation.py.
  • Build/Test:
    • Activate py_test for examples/multi_agent/pettingzoo_shared_value_function in rllib/BUILD.bazel.

Written by Cursor Bugbot for commit 2ec32d4. This will update automatically on new commits. Configure here.

@MatthewCWeston MatthewCWeston requested a review from a team as a code owner September 7, 2025 00:57
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a working implementation of pettingzoo_shared_value_function.py, providing a valuable example of a multi-agent setup with a shared critic using a MAPPO-style algorithm. The implementation is well-structured across several new files, adhering to RLlib's new API stack conventions. My review focuses on improving maintainability, consistency with best practices, and documentation accuracy. Key suggestions include inheriting MAPPOConfig from PPOConfig to reduce redundancy, using the logging module for error handling instead of print statements, and correcting an inaccurate docstring.

@ray-gardener ray-gardener bot added rllib RLlib related issues docs An issue or change related to documentation community-contribution Contributed by the community labels Sep 7, 2025
@MatthewCWeston MatthewCWeston changed the title Working implementation of pettingzoo_shared_value_function.py [RLlib] Working implementation of pettingzoo_shared_value_function.py Sep 10, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 29, 2025
@MatthewCWeston
Copy link
Contributor Author

Ran a manual merge on BUILD, since it was updated elsewhere and this PR un-comments the relevant test. Checked everything again and found it working as expected.

cursor[bot]

This comment was marked as outdated.

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Sep 29, 2025
@HassamSheikh HassamSheikh self-assigned this Oct 9, 2025
@pseudo-rnd-thoughts pseudo-rnd-thoughts added the go add ONLY when ready to merge, run all tests label Nov 10, 2025
@juanm4morales
Copy link

This will be very helpful. A working PettingZoo shared value function example (MAPPO on the new RLlib API) is exactly what many of us need to implement centralized critics (as a guide).

@MatthewCWeston
Copy link
Contributor Author

@pseudo-rnd-thoughts Just checking in - is there anything further that I need to do to prepare this PR for a merge? It's had the "go" label for a while.

@pseudo-rnd-thoughts pseudo-rnd-thoughts removed the go add ONLY when ready to merge, run all tests label Dec 11, 2025
@pseudo-rnd-thoughts
Copy link
Member

pseudo-rnd-thoughts commented Jan 7, 2026

My primarily worry @MatthewCWeston is the number of files, as I believe that single-file examples are the best.

For me, I don't want to have to read 5+ different files to understand an example implementation.
For example, the original shared_critic implementation was a single-file example.

(I know that not all RLlib examples follow this philosophy, though it is something I'm interested in changing the future)

@MatthewCWeston
Copy link
Contributor Author

MatthewCWeston commented Jan 7, 2026

@pseudo-rnd-thoughts Looking through the files:

# Analogous to examples/_old_api_stack/models
/centralized_critic_models.py. 
# I should note that the original example didn't actually centralize the critic, it just concatenated the agents' observations before training two separate critics.
default_mappo_rl_module.py; shared_critic_rl_module.py

# Catalogs for the above modules, in keeping with the style of existing algorithm implementations.
shared_critic_catalog.py; mappo_catalog.py

# Analogous to CentralizedCritic(PPO), which is defined directly in ray/rllib/examples/_old_api_stack/centralized_critic.py
mappo.py

# Analogous to CentralizedValueMixin, which is defined directly in ray/rllib/examples/_old_api_stack/centralized_critic.py
connectors/general_advantage_estimation.py

# A Learner, which is new to the new API stack. Functionally, does some of the stuff handled by the classes defined in ray/rllib/examples/_old_api_stack/centralized_critic.py
mappo_learner.py

To summarize, much of it is a combination of following the precedent of other algorithm implementations in terms of file structure and accounting for the new API stack. Some of it, however, is the fact that the old example doesn't appear to actually employ a centralized critic in the sense used in reinforcement learning research - rather, it augments each agent's independent critic with the other agent's observations.

Given that the other two examples in the Pettingzoo 'series' are about sharing weights versus independent learning, and the title of the placeholder was pettingzoo_shared_value_function, I think the intent was for the critic's weights to be shared between agents, which isn't what the old API stack's example ended up doing.

If you like, I could merge the torch/general versions of some classes into the same files, and aggregate the models likewise. This matches the style of some other examples, and would cut down on the file count.

@JiangpengLI86
Copy link

JiangpengLI86 commented Jan 8, 2026

My primarily worry @MatthewCWeston is the number of files, as I believe that single-file examples are the best.

That said, from my perspective, aligning the structure with other algorithm implementations can give readers—especially beginners like myself—a clearer mental model of what each component is responsible for. In contrast, the existing examples/_old_api_stack/models/centralized_critic_models.py is somewhat difficult to follow.

Copy link

@JiangpengLI86 JiangpengLI86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue Summary

There is a minor bug in the current implementation of MAPPOGAEConnector: it does not correctly handle scenarios where other agents use stateful encoders (e.g., LSTMs). This leads to incorrect tensor shapes and indexing during advantage and value computation.


Reproduction:

The issue can be reproduced by enabling an LSTM-based model in the PettingZoo MAPPO example.
Replace

specs = {p: RLModuleSpec() for p in policies}

with:

    from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig
    model_config = DefaultModelConfig(
        use_lstm=True,
    )
    specs = {p: RLModuleSpec(model_config=model_config) for p in policies}

Proposed Fix:

The root cause is inconsistent tensor construction and indexing in
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
when aggregating per-agent observations, value predictions, and postprocessed targets.

Fix critic observation concatenation

Replace

critic_batch[Columns.OBS] = torch.cat(
[batch[k][Columns.OBS] for k in obs_mids], dim=1
)

with:

        critic_batch[Columns.OBS] = torch.cat(
            [batch[k][Columns.OBS] for k in obs_mids], dim=-1
        )

Fix critic-function prediction unpacking

Replace

vf_preds = {mid: vf_preds[:, i] for i, mid in enumerate(obs_mids)}

with:

        vf_preds = {mid: vf_preds[..., i] for i, mid in enumerate(obs_mids)}

Fix stacking of value targets and advantages

Replace

critic_batch[Postprocessing.VALUE_TARGETS] = np.stack(
[batch[mid][Postprocessing.VALUE_TARGETS] for mid in obs_mids], axis=1
)
critic_batch[Postprocessing.ADVANTAGES] = np.stack(
[batch[mid][Postprocessing.ADVANTAGES] for mid in obs_mids], axis=1
)

with:

        critic_batch[Postprocessing.VALUE_TARGETS] = np.stack(
            [batch[mid][Postprocessing.VALUE_TARGETS] for mid in obs_mids], axis=-1,
        )
        critic_batch[Postprocessing.ADVANTAGES] = np.stack(
            [batch[mid][Postprocessing.ADVANTAGES] for mid in obs_mids], axis=-1,
        )

These changes make MAPPOConnector compatible with stateful (e.g., LSTM-based) encoders from other agents without affecting the stateless case, and restore correct multi-agent value and advantage computation.

@MatthewCWeston
Copy link
Contributor Author

@JiangpengLI86 Happy to add support for stateful modules. Have you tested the changes, and can you add an lstm option flag to the example so I can run it?

@JiangpengLI86
Copy link

JiangpengLI86 commented Jan 10, 2026

@JiangpengLI86 Happy to add support for stateful modules. Have you tested the changes, and can you add an lstm option flag to the example so I can run it?

Sure thing. I tried it on myside and I slightly modified your original example script so you can test it out as well:

Example Script - Click to expand
from pettingzoo.sisl import waterworld_v4

from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.env.wrappers.pettingzoo_env import ParallelPettingZooEnv
from ray.rllib.examples.algorithms.mappo.connectors.general_advantage_estimation import (
    SHARED_CRITIC_ID,
)
from ray.rllib.examples.algorithms.mappo.mappo import MAPPOConfig
from ray.rllib.examples.algorithms.mappo.torch.shared_critic_torch_rl_module import (
    SharedCriticTorchRLModule,
)
from ray.rllib.utils.test_utils import (
    add_rllib_example_script_args,
    run_rllib_example_script_experiment,
)
from ray.tune.registry import register_env
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig
parser = add_rllib_example_script_args(
    default_iters=200,
    default_timesteps=1000000,
    default_reward=0.0,
)
parser.add_argument(
    "--use-lstm",
    action="store_true",
    help="Whether to use LSTM encoders for the agents' policies.",
)


if __name__ == "__main__":
    args = parser.parse_args()

    assert args.num_agents > 0, "Must set --num-agents > 0 when running this script!"

    # Here, we use the "Parallel" PettingZoo environment type.
    # This allows MAPPO's global observations to be constructed more neatly.
    def get_env(_):
        return ParallelPettingZooEnv(waterworld_v4.parallel_env())

    register_env("env", get_env)

    # Policies are called just like the agents (exact 1:1 mapping).
    policies = [f"pursuer_{i}" for i in range(args.num_agents)]

    # An agent for each of our policies, and a single shared critic
    env_instantiated = get_env({})  # neccessary for non-agent modules
    if args.use_lstm:
        model_config = DefaultModelConfig(
            use_lstm=True,
        )
    else:
        model_config = {}

    specs = {p: RLModuleSpec(model_config=model_config) for p in policies}
    # specs = {p: RLModuleSpec() for p in policies}
    specs[SHARED_CRITIC_ID] = RLModuleSpec(
        module_class=SharedCriticTorchRLModule,
        observation_space=env_instantiated.observation_space[policies[0]],
        action_space=env_instantiated.action_space[policies[0]],
        learner_only=True,  # Only build on learner
        model_config={
            "observation_spaces": env_instantiated.observation_space,
        },
    )

    base_config = (
        MAPPOConfig()
        .environment("env")
        .multi_agent(
            policies=policies + [SHARED_CRITIC_ID],
            # Exact 1:1 mapping from AgentID to ModuleID.
            policy_mapping_fn=(lambda aid, *args, **kwargs: aid),
        )
        .rl_module(
            rl_module_spec=MultiRLModuleSpec(
                rl_module_specs=specs,
            ),
        )
    )

    run_rllib_example_script_experiment(base_config, args)

The modification to the original connector is minor and easy to understand.

But enabling stateful centralized critic seems more complicated and often unnecessary, since it should be able to observe the whole environment and therefore Markovian.

@MatthewCWeston
Copy link
Contributor Author

@JiangpengLI86 Added.

@MatthewCWeston
Copy link
Contributor Author

Come to think of it, given the degree to which MAPPO has been developed here, it could serve as a general use algorithm, akin to APPO. It's a pretty ubiquitous MARL benchmark algorithm, after all. I could then solve the issue with file count by porting the algorithm implementation over from examples into algorithms, which would make pettingzoo_shared_value_function a single file, self-contained example script in line with the others.

Would this be a desirable change?

return MAPPOConfig()


class MAPPOConfig(AlgorithmConfig): # AlgorithmConfig -> PPOConfig -> MAPPO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to inherit MAPPOConfig from PPOConfig? I see you have made a comment but not inheriting it from PPOconfig. I see there is a significant overlap in the MAPPO and PPO implementation (60%-70% according to co-pilot).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out, but, unfortunately, in practice, every method implemented in MAPPOConfig wants to avoid things done in its PPOConfig equivalent, so it doesn't end up saving any space. The natural inheritance hierarchy would go the other way around (which can be done, if we want to add MAPPO's implementation to rllib/algorithms rather than rllib/examples).

@MatthewCWeston
Copy link
Contributor Author

@HassamSheikh Do the changes made above seem about right? Let me know if there are any other changes I should make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community docs An issue or change related to documentation rllib RLlib related issues unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants