-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[RLlib] Working implementation of pettingzoo_shared_value_function.py #56309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[RLlib] Working implementation of pettingzoo_shared_value_function.py #56309
Conversation
Signed-off-by: Matthew <[email protected]>
Signed-off-by: Matthew <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a working implementation of pettingzoo_shared_value_function.py, providing a valuable example of a multi-agent setup with a shared critic using a MAPPO-style algorithm. The implementation is well-structured across several new files, adhering to RLlib's new API stack conventions. My review focuses on improving maintainability, consistency with best practices, and documentation accuracy. Key suggestions include inheriting MAPPOConfig from PPOConfig to reduce redundancy, using the logging module for error handling instead of print statements, and correcting an inaccurate docstring.
Signed-off-by: Matthew <[email protected]>
Signed-off-by: Matthew <[email protected]>
Signed-off-by: Matthew <[email protected]>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
Ran a manual merge on BUILD, since it was updated elsewhere and this PR un-comments the relevant test. Checked everything again and found it working as expected. |
Signed-off-by: Matthew <[email protected]>
Signed-off-by: Matthew <[email protected]>
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew <[email protected]>
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Show resolved
Hide resolved
|
This will be very helpful. A working PettingZoo shared value function example (MAPPO on the new RLlib API) is exactly what many of us need to implement centralized critics (as a guide). |
|
@pseudo-rnd-thoughts Just checking in - is there anything further that I need to do to prepare this PR for a merge? It's had the "go" label for a while. |
|
My primarily worry @MatthewCWeston is the number of files, as I believe that single-file examples are the best. For me, I don't want to have to read 5+ different files to understand an example implementation. (I know that not all RLlib examples follow this philosophy, though it is something I'm interested in changing the future) |
Signed-off-by: Matthew <[email protected]>
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Show resolved
Hide resolved
|
@pseudo-rnd-thoughts Looking through the files: To summarize, much of it is a combination of following the precedent of other algorithm implementations in terms of file structure and accounting for the new API stack. Some of it, however, is the fact that the old example doesn't appear to actually employ a centralized critic in the sense used in reinforcement learning research - rather, it augments each agent's independent critic with the other agent's observations. Given that the other two examples in the Pettingzoo 'series' are about sharing weights versus independent learning, and the title of the placeholder was If you like, I could merge the torch/general versions of some classes into the same files, and aggregate the models likewise. This matches the style of some other examples, and would cut down on the file count. |
That said, from my perspective, aligning the structure with other algorithm implementations can give readers—especially beginners like myself—a clearer mental model of what each component is responsible for. In contrast, the existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue Summary
There is a minor bug in the current implementation of MAPPOGAEConnector: it does not correctly handle scenarios where other agents use stateful encoders (e.g., LSTMs). This leads to incorrect tensor shapes and indexing during advantage and value computation.
Reproduction:
The issue can be reproduced by enabling an LSTM-based model in the PettingZoo MAPPO example.
Replace
| specs = {p: RLModuleSpec() for p in policies} |
with:
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig
model_config = DefaultModelConfig(
use_lstm=True,
)
specs = {p: RLModuleSpec(model_config=model_config) for p in policies}Proposed Fix:
The root cause is inconsistent tensor construction and indexing in
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
when aggregating per-agent observations, value predictions, and postprocessed targets.
Fix critic observation concatenation
Replace
ray/rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Lines 66 to 68 in e7b02c4
| critic_batch[Columns.OBS] = torch.cat( | |
| [batch[k][Columns.OBS] for k in obs_mids], dim=1 | |
| ) |
with:
critic_batch[Columns.OBS] = torch.cat(
[batch[k][Columns.OBS] for k in obs_mids], dim=-1
)Fix critic-function prediction unpacking
Replace
| vf_preds = {mid: vf_preds[:, i] for i, mid in enumerate(obs_mids)} |
with:
vf_preds = {mid: vf_preds[..., i] for i, mid in enumerate(obs_mids)}Fix stacking of value targets and advantages
Replace
ray/rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Lines 138 to 143 in e7b02c4
| critic_batch[Postprocessing.VALUE_TARGETS] = np.stack( | |
| [batch[mid][Postprocessing.VALUE_TARGETS] for mid in obs_mids], axis=1 | |
| ) | |
| critic_batch[Postprocessing.ADVANTAGES] = np.stack( | |
| [batch[mid][Postprocessing.ADVANTAGES] for mid in obs_mids], axis=1 | |
| ) |
with:
critic_batch[Postprocessing.VALUE_TARGETS] = np.stack(
[batch[mid][Postprocessing.VALUE_TARGETS] for mid in obs_mids], axis=-1,
)
critic_batch[Postprocessing.ADVANTAGES] = np.stack(
[batch[mid][Postprocessing.ADVANTAGES] for mid in obs_mids], axis=-1,
)These changes make MAPPOConnector compatible with stateful (e.g., LSTM-based) encoders from other agents without affecting the stateless case, and restore correct multi-agent value and advantage computation.
|
@JiangpengLI86 Happy to add support for stateful modules. Have you tested the changes, and can you add an lstm option flag to the example so I can run it? |
Sure thing. I tried it on myside and I slightly modified your original example script so you can test it out as well: Example Script - Click to expandfrom pettingzoo.sisl import waterworld_v4
from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec
from ray.rllib.core.rl_module.rl_module import RLModuleSpec
from ray.rllib.env.wrappers.pettingzoo_env import ParallelPettingZooEnv
from ray.rllib.examples.algorithms.mappo.connectors.general_advantage_estimation import (
SHARED_CRITIC_ID,
)
from ray.rllib.examples.algorithms.mappo.mappo import MAPPOConfig
from ray.rllib.examples.algorithms.mappo.torch.shared_critic_torch_rl_module import (
SharedCriticTorchRLModule,
)
from ray.rllib.utils.test_utils import (
add_rllib_example_script_args,
run_rllib_example_script_experiment,
)
from ray.tune.registry import register_env
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig
parser = add_rllib_example_script_args(
default_iters=200,
default_timesteps=1000000,
default_reward=0.0,
)
parser.add_argument(
"--use-lstm",
action="store_true",
help="Whether to use LSTM encoders for the agents' policies.",
)
if __name__ == "__main__":
args = parser.parse_args()
assert args.num_agents > 0, "Must set --num-agents > 0 when running this script!"
# Here, we use the "Parallel" PettingZoo environment type.
# This allows MAPPO's global observations to be constructed more neatly.
def get_env(_):
return ParallelPettingZooEnv(waterworld_v4.parallel_env())
register_env("env", get_env)
# Policies are called just like the agents (exact 1:1 mapping).
policies = [f"pursuer_{i}" for i in range(args.num_agents)]
# An agent for each of our policies, and a single shared critic
env_instantiated = get_env({}) # neccessary for non-agent modules
if args.use_lstm:
model_config = DefaultModelConfig(
use_lstm=True,
)
else:
model_config = {}
specs = {p: RLModuleSpec(model_config=model_config) for p in policies}
# specs = {p: RLModuleSpec() for p in policies}
specs[SHARED_CRITIC_ID] = RLModuleSpec(
module_class=SharedCriticTorchRLModule,
observation_space=env_instantiated.observation_space[policies[0]],
action_space=env_instantiated.action_space[policies[0]],
learner_only=True, # Only build on learner
model_config={
"observation_spaces": env_instantiated.observation_space,
},
)
base_config = (
MAPPOConfig()
.environment("env")
.multi_agent(
policies=policies + [SHARED_CRITIC_ID],
# Exact 1:1 mapping from AgentID to ModuleID.
policy_mapping_fn=(lambda aid, *args, **kwargs: aid),
)
.rl_module(
rl_module_spec=MultiRLModuleSpec(
rl_module_specs=specs,
),
)
)
run_rllib_example_script_experiment(base_config, args)The modification to the original connector is minor and easy to understand. But enabling stateful centralized critic seems more complicated and often unnecessary, since it should be able to observe the whole environment and therefore Markovian. |
|
@JiangpengLI86 Added. |
cbaae89 to
e7b02c4
Compare
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Show resolved
Hide resolved
Signed-off-by: Matthew <[email protected]>
Signed-off-by: Matthew <[email protected]>
|
Come to think of it, given the degree to which MAPPO has been developed here, it could serve as a general use algorithm, akin to APPO. It's a pretty ubiquitous MARL benchmark algorithm, after all. I could then solve the issue with file count by porting the algorithm implementation over from Would this be a desirable change? |
Signed-off-by: Matthew <[email protected]>
…ton/rllib_current_attention into pettingzoo_shared_vf
rllib/examples/algorithms/mappo/connectors/general_advantage_estimation.py
Show resolved
Hide resolved
| return MAPPOConfig() | ||
|
|
||
|
|
||
| class MAPPOConfig(AlgorithmConfig): # AlgorithmConfig -> PPOConfig -> MAPPO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to inherit MAPPOConfig from PPOConfig? I see you have made a comment but not inheriting it from PPOconfig. I see there is a significant overlap in the MAPPO and PPO implementation (60%-70% according to co-pilot).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it out, but, unfortunately, in practice, every method implemented in MAPPOConfig wants to avoid things done in its PPOConfig equivalent, so it doesn't end up saving any space. The natural inheritance hierarchy would go the other way around (which can be done, if we want to add MAPPO's implementation to rllib/algorithms rather than rllib/examples).
|
@HassamSheikh Do the changes made above seem about right? Let me know if there are any other changes I should make. |
Why are these changes needed?
At present,
pettingzoo_shared_value_function.pyis a placeholder file that returns an error message when run. This PR implements it in a way that permits direct comparison between itself,pettingzoo_parameter_sharing.py, andpettingzoo_independent_learning.py.My motivation for this is that, when I first saw the placeholder file name, I hoped to use it as a reference when working on my current project, which requires a shared critic. Having completed that part of my project, I cleaned up and generalized the code so that I could share it with other RLlib developers.
In brief, this PR implements the placeholder example file by including an implementation of MAPPO in the
examplessection, in a similar vein to how other examples implement VPG, AlphaStar-style league-based training, and MobileNet. This provides a working example of how a shared value function, used in a wide variety of multi agent reinforcement learning papers, can be implemented in RLlib using the new API stack.Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Introduces a Torch MAPPO with a shared critic and implements the PettingZoo Waterworld shared-value-function example, activating its CI test.
examples/multi_agent/pettingzoo_shared_value_function.pyusing a shared critic (MAPPO) for PettingZoo Waterworld; wiresMultiRLModuleSpecwith per-agent policies and ashared_critic.examples/multi_agent/shared_encoder_cartpole.py(expected reward text).examples/algorithms/mappo/:mappo.py+MAPPOConfig(training params, defaults, validators).mappo_learner.pybase learner (entropy/KL schedulers, GAE connector hookup).torch/mappo_torch_learner.py(policy loss, critic loss, KL/entropy metrics).default_mappo_rl_module.py+torch/default_mappo_torch_rl_module.py.shared_critic_rl_module.py+torch/shared_critic_torch_rl_module.pywithshared_critic_catalog.py.mappo_catalog.py; GAE connector:connectors/general_advantage_estimation.py.py_testforexamples/multi_agent/pettingzoo_shared_value_functioninrllib/BUILD.bazel.Written by Cursor Bugbot for commit 2ec32d4. This will update automatically on new commits. Configure here.