[RLlib] Fix failing env step in `MultiAgentEnvRunner`. #55567

kamil-kaczmarek · 2025-08-13T06:01:04Z

Why are these changes needed?

Fix failing release test: learning_tests_multi_agent_cartpole_appo_multi_gpu.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kamil Kaczmarek <[email protected]>

gemini-code-assist · 2025-08-13T06:01:08Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

simonsays1980

Great PR @kamil-kaczmarek! Thanks a lot for working through this complex setup. I left some comments and request another debug iteration to understand better why this happens now and the intended process is somehow interrupted.

rllib/env/env_runner.py

rllib/env/multi_agent_env_runner.py

simonsays1980 · 2025-08-13T09:22:39Z

rllib/env/multi_agent_env_runner.py

@@ -361,6 +363,9 @@ def _sample(
            # Try stepping the environment.
            results = self._try_env_step(actions_for_env)
            if results == ENV_STEP_FAILURE:
+                logging.warning(


Nice message. Can we unify all messages patterns?

I didn't notice single pattern established across RLlib. If you like this direction I can unify for this PR first, then expand and unify component by component. WDYT?

simonsays1980 · 2025-08-13T09:27:38Z

rllib/env/multi_agent_env_runner.py

@@ -346,7 +348,7 @@ def _sample(
                        metrics_prefix_key=(MODULE_TO_ENV_CONNECTOR,),
                    )
                # In case all environments had been terminated `to_module` will be
-                # empty and no actions are needed b/c we reset all environemnts.
+                # empty and no actions are needed b/c we reset all environments.


I wonder why this happens now that the to_module is None. Can we debug another round and see where this happens? Then check the autoreset and the connector run (I know this is complex).

What should happen is: env resets automatically; init obs goes through the to_module connector pipeline and produces to_module which can in turn passed through the module and the to_env pipeline to produce an action.

Let me investigate this more. This first happened last Thursday in the release tests. Will look into the code diff.

… true after env.step(). Signed-off-by: Kamil Kaczmarek <[email protected]>

Signed-off-by: Kamil Kaczmarek <[email protected]>

github-actions · 2025-09-04T00:35:17Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: Kamil Kaczmarek <[email protected]>

kamil-kaczmarek · 2025-09-04T06:58:41Z

unstale

kamil-kaczmarek added 2 commits August 12, 2025 22:54

handle failing env step gracefully

20993f3

Signed-off-by: Kamil Kaczmarek <[email protected]>

lint

ca8d68d

Signed-off-by: Kamil Kaczmarek <[email protected]>

kamil-kaczmarek requested a review from simonsays1980 August 13, 2025 06:01

Merge branch 'master' into kk/fix-failing-env-step

d08aaed

kamil-kaczmarek self-assigned this Aug 13, 2025

kamil-kaczmarek marked this pull request as ready for review August 13, 2025 06:01

kamil-kaczmarek requested a review from a team as a code owner August 13, 2025 06:01

simonsays1980 requested changes Aug 13, 2025

View reviewed changes

sven1977 changed the title ~~Fix failing env step in MultiAgentEnvRunner~~ [RLlib] Fix failing env step in MultiAgentEnvRunner. Aug 13, 2025

kamil-kaczmarek and others added 3 commits August 14, 2025 00:43

Fix bool assignment for cases where terminated and truncated are both…

11dd9a3

… true after env.step(). Signed-off-by: Kamil Kaczmarek <[email protected]>

typos

f77c13d

Signed-off-by: Kamil Kaczmarek <[email protected]>

Merge branch 'master' into kk/fix-failing-env-step

4b9b8df

kamil-kaczmarek requested a review from simonsays1980 August 14, 2025 08:02

ray-gardener bot added rllib RLlib related issues release-test release test labels Aug 14, 2025

kamil-kaczmarek added 2 commits August 15, 2025 11:51

Merge branch 'master' into kk/fix-failing-env-step

c57b2f7

Merge branch 'master' into kk/fix-failing-env-step

b2ce4b0

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 4, 2025

Merge branch 'master' into kk/fix-failing-env-step

ffc9f90

Signed-off-by: Kamil Kaczmarek <[email protected]>

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RLlib] Fix failing env step in `MultiAgentEnvRunner`. #55567

[RLlib] Fix failing env step in `MultiAgentEnvRunner`. #55567

Uh oh!

kamil-kaczmarek commented Aug 13, 2025

Uh oh!

gemini-code-assist bot commented Aug 13, 2025

Uh oh!

simonsays1980 left a comment

Uh oh!

Uh oh!

Uh oh!

simonsays1980 Aug 13, 2025

Uh oh!

kamil-kaczmarek Aug 13, 2025

Uh oh!

simonsays1980 Aug 13, 2025

Uh oh!

kamil-kaczmarek Aug 13, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

kamil-kaczmarek commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

[RLlib] Fix failing env step in MultiAgentEnvRunner. #55567

Are you sure you want to change the base?

[RLlib] Fix failing env step in MultiAgentEnvRunner. #55567

Uh oh!

Conversation

kamil-kaczmarek commented Aug 13, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot commented Aug 13, 2025

Uh oh!

simonsays1980 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

simonsays1980 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

kamil-kaczmarek Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

simonsays1980 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

kamil-kaczmarek Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

kamil-kaczmarek commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[RLlib] Fix failing env step in `MultiAgentEnvRunner`. #55567

[RLlib] Fix failing env step in `MultiAgentEnvRunner`. #55567

kamil-kaczmarek commented Sep 4, 2025 •

edited

Loading