Add OpenSpiel interface by jhtschultz · Pull Request #87 · google-deepmind/acme

jhtschultz · 2020-11-17T03:16:02Z

This pull request is a follow up to #80 and adds the ability to run Acme’s DQN on OpenSpiel games. All changes and additions are contained in the acme/open_spiel directory. The example run_dqn.py runs Acme’s DQN on OpenSpiel’s tic-tac-toe. It’s set up to print out a full playthrough (policies included) every 1000 episodes making it easy to verify that the agent is in fact learning a good policy. The agents start playing sensibly after roughly 10k episodes, and by about 50k episodes pretty much every game in which an exploratory action is not taken results in a draw.

Though I’ve endeavored to keep the code clean and production ready, this PR is only intended to be a starting point. There are a number of TODOs. Some are only meant to call attention to important considerations, others are tasks which seemed better addressed after there’s been a chance for feedback and discussion.

Perhaps the most important design feature, and the one that stands to improve most from further thought, is the environment wrapper. In an effort to keep as much existing structure intact as possible, the wrapper is built around OpenSpiel’s python rl_environement - which notably is not a dm_env. An alternative approach is to wrap the OpenSpiel game directly as a dm_env. However, the dm_env doesn’t appear to have been explicitly designed with multiplayer games in mind. Perhaps someone who has worked on dm_env, or the related (and very useful) tree repo might have some insight into how best to do this.

I look forward to feedback and I’ll be happy to implement any suggested changes. You’ll notice that very few modifications to the DQN agent/learner were necessary. With this interface it should be straightforward to adapt other Acme algorithms to run on OpenSpiel games :-)

jhtschultz · 2020-11-17T03:18:34Z

And to import OpenSpiel, add these two lines to Acme's virtual environment.

export PYTHONPATH=$PYTHONPATH:/<path_to_open_spiel>
export PYTHONPATH=$PYTHONPATH:/<path_to_open_spiel>/build/python

mwhoffman · 2020-11-17T16:28:21Z

acme/open_spiel/agents/tf/dqn/learning.py

+    with tf.GradientTape() as tape:
+      # Evaluate our networks.
+      q_tm1 = self._network(o_tm1)
+      q_t_value = self._target_network(o_t)


Is this the main change for DQN in this case? IE I wonder (as I think @fastturtle was alluding to) if the legal_actions mask is moved into the observations directly if this logic can then be moved into the policy (perhaps via a wrapper around a "raw q-function network"). This would then allow a broader array of potential algorithms to be applied in this setting.

mwhoffman · 2020-11-17T16:41:45Z

Take this all with a grain of salt, because I'm thinking out loud a little bit at the moment, but:

I don't know how much of a pain it would be to retroactively mark some of these files as copies (i.e. git cp). I'm thinking mostly with respect to the learner, but as I think this might be painful, could you highlight some of the changes made here just to center the discussion?

But from looking I think the main changes in the agent/actor/learner (please correct me if I'm wrong!) revolve around the legal actions. I've noted this above, and I wonder if we can push these directly as observations and let the network handle the legal actions. I think this will allow the learner to remain unchanged. Ideally the same could be done for actor/agent (putting aside the changes to epsilon greedy that are probably needed). But if this only really helps the learner that'd be great.

Ideally we could have something where we have an openspiel-specific Environment Loop, wrapper, and potentially a "Q-network wrapper" but the agent itself isn't changed (and we can peel back as much of that as is necessary if/when this ideal isn't possible 😄).

jhtschultz · 2020-11-18T03:17:40Z

Ahh I see, yeah moving the legal actions mask into the observation and letting the network handle everything sounds like the right approach. Should streamline things quite a bit.

Tried to mark the new files as copies but couldn’t get the git diff to work, it kept marking everything as new additions. Either way I think you already identified the core issue. I’ll implement those changes with the goal of confining everything to the network, loop and wrapper. Thanks for the quick feedback!

jhtschultz · 2020-11-21T18:36:12Z

Pushed a big update that's pretty close the ideal @mwhoffman described. Moving legal actions into the observation really was the way to go, simplified everything dramatically. All that’s needed now to run the OpenSpiel example is the wrapper, environment loop, custom networks, and the slightest of modifications to the DQN agent.

The issue with the DQN agent is the policy network. Currently the DQN agent creates the policy network by appending trfl’s epsilon_greedy() layer to the learning network. As far as I can tell, no matter what our custom network outputs this layer will select illegal actions. My workaround was to add the policy network as an optional argument. It offers more flexibility while not changing the default behavior of the DQN agent.

Now that OpenSpiel’s footprint is much smaller, I moved it out of its own directory and placed the files in the appropriate folders. The wrapper is < 150 lines, env loop < 250, and the networks are pretty simple, so this update gets us a lot closer to the finish line.

fastturtle

Thanks for the PR, left some comments :)

acme/environment_loops/open_spiel_environment_loop.py

fastturtle · 2020-11-24T20:15:17Z

acme/environment_loops/open_spiel_environment_loop.py

+                           discount=timestep.discount[player],
+                           step_type=timestep.step_type)
+
+  # TODO Remove? Currently used for debugging.


I'm not very familiar with TODOs in OS code, but should we have an issue to track them?

I'll clear out all the TODOs before it gets merged, there aren't many. Left these here just to make sure they got addressed.

fastturtle · 2020-11-24T20:17:24Z

acme/environment_loops/open_spiel_environment_loop.py

+                           step_type=timestep.step_type)
+
+  # TODO Remove? Currently used for debugging.
+  def _print_policy(self, timestep: dm_env.TimeStep, player: int):


Is this generally useful, or was it while you were debugging a particular issue?

This is here so that simply by running the environment loop one can easily verify the agents are learning correctly.

I have mixed feelings as to whether it should make the final cut. It's nice for the user to be able to run an example and see exactly what's going on without any additional work on their part. On the other hand, it's of questionable usefulness in larger environments and makes the environment loop slightly less elegant. If anyone has a preference I'm happy to go with that.

acme/environment_loops/open_spiel_environment_loop.py

acme/wrappers/open_spiel_wrapper.py

acme/tf/networks/legal_actions.py

fastturtle · 2020-11-24T20:23:11Z

acme/tf/networks/legal_actions.py

+    return outputs
+
+
+# TODO Function to update epsilon


What does this TODO refer to?

Ah good question, I meant to ask...is there currently an easy way to decay DQN's epsilon parameter? There doesn't appear to be, it seems fixed once the policy network is created. This is an important feature for some use cases and since I was building an epsilon greedy sonnet module anyway, it seemed like something we'd want to include.

AFAIK there isn't a standard way for doing parameter decay. I'm guessing you've already thought of this, but you'll have to do it using your tf.Variable approach. I'm fine keeping the TODO, but could you flush it out a bit (not too much) so it has more context.

acme/environment_loops/open_spiel_environment_loop.py

fastturtle · 2020-11-24T20:26:35Z

Also there's an issue with pytest failing, looks like it isn't installing openspiel.

jhtschultz · 2020-11-25T15:42:47Z

@fastturtle thanks for the comments! Regarding pytest failing, OpenSpiel does not currently have a pip package for direct installation (though it's in the works). Given that, any advice on how to handle this? And tagging @lanctot, who knows all things about installing OpenSpiel :)

lanctot · 2020-11-25T23:56:08Z

@fastturtle @jhtschultz I am planning for an OpenSpiel release in a few weeks, and have it in PyPi, but only by source distribution for now (which means it'll be 1-2 min install, and requires the machines have CMake, clang++, and python3-dev). Do you think that would solve it?

Edit: I hope to have a bdist wheel at some point next year, but it requires quite a bit of planning and preparation to get right.

jhtschultz · 2021-01-07T17:32:20Z

Hi! Wanted to check in on the status of this PR.

I saw commit 996c63d followed the approach taken here and added the ability to provide DQN's policy network directly. That's great because now this PR requires zero changes to Acme’s DQN.

Commit 996c63d also added NetworkWithMaskedEpsilonGreedy, which uses trfl.epsilon_greedy(). As documented here google-deepmind/trfl#27 and google-deepmind/trfl#28, there is a bug in trfl’s action masking which currently renders this network unusable. Once fixed, I could refactor the networks used in this PR to use NetworkWithMaskedEpsilonGreedy instead.

I’ve been implementing NFSP, a deep-RL algorithm for imperfect information games, using this Acme-OpenSpiel interface and the interface works quite well. If we can identify any remaining obstacles to importing this PR I'll be happy to work on them. Thanks!

lanctot · 2021-01-07T17:46:43Z

@fastturtle thanks for the comments! Regarding pytest failing, OpenSpiel does not currently have a pip package for direct installation (though it's in the works). Given that, any advice on how to handle this? And tagging @lanctot, who knows all things about installing OpenSpiel :)

I spoke to @fastturtle yesterday. OpenSpiel does have a pip now but it still requires a build from source, which requires specific things from your environment and could take a long time. So, until we have a binary wheel (at least a few months), we've agreed just disable tests that require/depend on OpenSpiel.

Please keep the code for these tests around, though. When it's merged, I will enable this test internally to ensure that it passes on our side. Thanks!

…ses.

…open_spiel

jhtschultz · 2021-01-14T03:35:43Z

@fastturtle thanks for looking it over! Most of the changes are already in. I still need to do some further digging to fully explain the -np.inf threshold causing the priority error. Good call on your part to highlight that. I'll address the comments individually and let you know when everything is ready to go on my end.

fastturtle · 2021-01-14T10:47:19Z

@jhtschultz sounds good, I'll keep an eye out for your comments.

jhtschultz · 2021-01-14T20:34:35Z

@fastturtle Done! Not sure why some pytests get stuck. Recent commits, including the previous one ff9b7ad, have been consistently passing so I think everything is good to go. Let me know if there's anything else I can do.

…open_spiel

jhtschultz · 2021-01-15T14:16:19Z

We exposed _use_observation as a property in OpenSpiel's rl_environment, and the wrapper here has been updated to use the property.

fastturtle

Looks really good, some last comments from me 👍

acme/wrappers/open_spiel_wrapper.py

jhtschultz · 2021-01-15T16:18:34Z

Thanks! Good catches, fixed 👍

fastturtle

This looks good to me, and has my approval. @mwhoffman PTAL.

mwhoffman · 2021-01-27T12:48:28Z

acme/wrappers/open_spiel_wrapper.py

+# pytype: enable=import-error
+
+
+class OLT(NamedTuple):


It might make more sense to define the OLT class outside this wrapper (or in the legal_actions policy definitions) and import here. That way we can skip one level of trying and potentially use the legal actions policies outside of openspiel envs. But we can play around with that after getting this change in...

qstanczyk · 2021-02-02T07:12:45Z

acme/environment_loops/open_spiel_environment_loop.py

+    # Run an episode.
+    while not timestep.last():
+      # Generate an action from the agent's policy and step the environment.
+      if self._environment.is_turn_based:


Hi, I was thinking about possibility of making this environment loop more generic by covering cases of multi-agent/multi-team environments that are not necessarily turn based or fully simultaneous.
Lets consider multi-agent, turn-based environment (for example soccer environment like https://github.com/google-research/football/) where each team consists of 11 players, but teams execute actions in turns. To cover OpenSpiel and this use case with one implementation we could:

eliminate _environment.is_turn_based

replace _environment.get_state.current_player() with a set of current players
That would cover "# TODO Support simultaneous move games." in a generic way.
What do you think about that?

Hi @qstanczyk,

Can you elaborate on your motivation for this? I worry because if we make this too general, it gets harder to interface with the specific API it was designed to be used with and could introduce some awkward bugs for use cases we didn't anticipate by supporting multiple APIs.

Also I don't understand the football env example. If that environment is doing turn-based already, then it should fit exactly the OpenSpiel API as well already (which as you say is already turn-based and you just have 11 consecutive actions from the same player before switching the current player to the next one)?

WDYT about a wrapper of the football env as an OpenSpiel python game (living in this repos or in OpenSpiel)? Sounds like they are compatible from your description. Let me know if I'm missing something.

If turn-based Soccer environment would be integrated the way you suggest (11 actions for each team) then what is the use case you anticipate for "!is_turn_based"? Any multi-agent environment can be implemented by passing N actions, so maybe in such case there is no need to have "is_turn_based" check around?
My motivation was to try and generalize the API by supporting any is_turn_based / multi-agent mix. I'm fine with not taking this thought into account, but I just wonder at this point if "is_turn_based" is needed.

If turn-based Soccer environment would be integrated the way you suggest (11 actions for each team) then what is the use case you anticipate for "!is_turn_based"?

Wait, is that not how it's implemented? I only assumed that based on what you said. The !is_turn_based is necessary here because of the way OpenSpiel categorizes its dynamics. Some games require actions from all agents (simultaneous game) and apply them all at once, and others are turn-based so you only get one for the current player and apply that action only.

Any multi-agent environment can be implemented by passing N actions, so maybe in such case there is no need to have "is_turn_based" check around?

Yes, I agree, but that's not how OpenSpiel implements turn-based games.

My motivation was to try and generalize the API by supporting any is_turn_based / multi-agent mix.

Ok yes I think I understand. I think this would make more sense if we had a use case for this generalization, and I'm not seeing it (yet). So my hesitatation is due mainly to this wrapper being designed for OpenSpiel specifically.

OpenSpiel treats these two cases separately because the algorithms that are built on top of the different type of games depend on these choices. Also, in OpenSpiel a simultaneous game doesn't need to be fully simultaneous at all states and also it's fine for a subset of agents to only have 1 legal pass-like move, hence my suggestion of wrapping the ones you mention as OpenSpiel simultaneous-move games, because then this interface would work on them. (Would that work, e.g. for the football env?)

Ack, I was just thinking out-loud about generalization of the API. We don't look at hooking up GRF at this point, but was wondering if this OpenSpiel environment loop could be more generic.

qstanczyk · 2021-02-02T07:14:16Z

acme/environment_loops/open_spiel_environment_loop.py

+          if self._should_update:
+            self._actors[player_id].update()
+      self._observed_first = [False] * len(self._actors)
+      self._prev_actions = [pyspiel.INVALID_ACTION] * len(self._actors)


Maybe initialize these two once at the beginning of run_episode? That would eliminate code duplication.

qstanczyk · 2021-02-02T07:19:44Z

acme/environment_loops/open_spiel_environment_loop.py

+      if not self._observed_first[player]:
+        player_timestep = dm_env.TimeStep(
+            observation=timestep.observation[player],
+            reward=None,


Doesn't OpenSpiel return reward in this case? In generic case I think we should use reward provided by the environment. Can't we call _get_player_timestep instead (setting dm_env.StepType.FIRST)?

qstanczyk · 2021-02-02T07:23:58Z

acme/environment_loops/open_spiel_environment_loop.py

+
+    # Make the first observation.
+    self._send_observation(timestep, self._environment.current_player)
+


Can't _send_observation be called as the first thing in the loop (which would eliminate duplicated code)?

mwhoffman · 2021-02-04T12:07:12Z

A few minor changes were made (mostly linter reasons!) and this was submitted in 7fa9b80. So I'll close this down for now, and any further discussion can carry on outside this PR. Thanks for all the hard work!

Add OpenSpiel interface.

fd809ae

google-cla bot added the cla: yes label Nov 17, 2020

mwhoffman reviewed Nov 17, 2020

View reviewed changes

OpenSpiel interface v2.

a02237e

fastturtle requested changes Nov 24, 2020

View reviewed changes

jhtschultz added 2 commits November 25, 2020 23:14

Implement review feedback.

c8a625b

Fix run_dqn.py example so each agent gets its own network.

4ea4d78

lanctot mentioned this pull request Jan 7, 2021

Illegal action handling in policy gradient algorithms (QPG, RM, RPG and A2C) google-deepmind/open_spiel#422

Closed

jhtschultz and others added 9 commits January 7, 2021 22:58

Disable pytest import errors.

55e0ce7

Merge branch 'master' into open_spiel

58f3ba0

Remove duplicate policy_network argument.

8d0f41e

Set prev_actions default to INVALID_ACTION instead of None.

1f9251d

Remove verbose printouts from env loop.

5f0191f

Comment out OpenSpielWrapper from __init__.py so pytest passes.

5c20c23

Comment out files in __init__.py with OpenSpiel imports so pytest pas…

35d434f

…ses.

Merge branch 'open_spiel' of https://github.com/jhtschultz/acme into …

4a44c64

…open_spiel

Add OpenSpiel wrapper test.

2ccd84b

jhtschultz force-pushed the open_spiel branch 3 times, most recently from 9e67acd to ec2e247 Compare January 9, 2021 02:14

Add OpenSpiel env loop test.

785fb86

jhtschultz added 3 commits January 14, 2021 00:02

Wrap __init__.py imports in try-except block.

b95a28c

Remove unnecessary pylint disable.

4a3677c

Narrow down pytype disable import-error and re-enable.

7853707

jhtschultz added 7 commits January 14, 2021 16:09

Refactor handling of terminal states.

fa986b2

Clarify note regarding use of -np.inf.

920d282

Remove unused import.

c5b0070

Remove default threshold argument.

6141f1a

Pylint formatting changes.

f676b8a

Clarify TODO.

ff9b7ad

Remove unnecessary pylint disable.

5caa0ac

jhtschultz mentioned this pull request Jan 14, 2021

Expose _use_observation as environment property google-deepmind/open_spiel#481

Merged

jhtschultz added 3 commits January 14, 2021 23:21

Remove unnecessary pylint disable.

b46960d

Merge branch 'open_spiel' of https://github.com/jhtschultz/acme into …

6af4844

…open_spiel

Access environment._use_observation attribute via property.

092aaba

fastturtle reviewed Jan 15, 2021

View reviewed changes

Fix type hints.

e00393a

fastturtle reviewed Jan 18, 2021

View reviewed changes

fastturtle assigned mwhoffman Jan 18, 2021

This was linked to issues Jan 18, 2021

[Feature] OpenSpiel integration/support #80

Closed

[Multi-agent][feature request] Acme multi-agent support #71

Closed

fastturtle removed a link to an issue Jan 18, 2021

[Multi-agent][feature request] Acme multi-agent support #71

Closed

mwhoffman approved these changes Jan 27, 2021

View reviewed changes

qstanczyk requested changes Feb 2, 2021

View reviewed changes

mwhoffman closed this Feb 4, 2021


		# Make the first observation.
		self._send_observation(timestep, self._environment.current_player)

Conversation

jhtschultz commented Nov 17, 2020

Uh oh!

jhtschultz commented Nov 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwhoffman commented Nov 17, 2020

Uh oh!

jhtschultz commented Nov 18, 2020

Uh oh!

jhtschultz commented Nov 21, 2020

Uh oh!

fastturtle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fastturtle commented Nov 24, 2020

Uh oh!

jhtschultz commented Nov 25, 2020

Uh oh!

lanctot commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhtschultz commented Jan 7, 2021

Uh oh!

lanctot commented Jan 7, 2021

Uh oh!

jhtschultz commented Jan 14, 2021

Uh oh!

fastturtle commented Jan 14, 2021

Uh oh!

jhtschultz commented Jan 14, 2021

Uh oh!

jhtschultz commented Jan 15, 2021

Uh oh!

fastturtle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhtschultz commented Jan 15, 2021

Uh oh!

fastturtle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lanctot Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lanctot commented Nov 25, 2020 •

edited

Loading

lanctot Feb 2, 2021 •

edited

Loading

lanctot Feb 2, 2021 •

edited

Loading