Skip to content

Add OpenSpiel interface#87

Closed
jhtschultz wants to merge 42 commits intogoogle-deepmind:masterfrom
jhtschultz:open_spiel
Closed

Add OpenSpiel interface#87
jhtschultz wants to merge 42 commits intogoogle-deepmind:masterfrom
jhtschultz:open_spiel

Conversation

@jhtschultz
Copy link
Member

This pull request is a follow up to #80 and adds the ability to run Acme’s DQN on OpenSpiel games. All changes and additions are contained in the acme/open_spiel directory. The example run_dqn.py runs Acme’s DQN on OpenSpiel’s tic-tac-toe. It’s set up to print out a full playthrough (policies included) every 1000 episodes making it easy to verify that the agent is in fact learning a good policy. The agents start playing sensibly after roughly 10k episodes, and by about 50k episodes pretty much every game in which an exploratory action is not taken results in a draw.

Though I’ve endeavored to keep the code clean and production ready, this PR is only intended to be a starting point. There are a number of TODOs. Some are only meant to call attention to important considerations, others are tasks which seemed better addressed after there’s been a chance for feedback and discussion.

Perhaps the most important design feature, and the one that stands to improve most from further thought, is the environment wrapper. In an effort to keep as much existing structure intact as possible, the wrapper is built around OpenSpiel’s python rl_environement - which notably is not a dm_env. An alternative approach is to wrap the OpenSpiel game directly as a dm_env. However, the dm_env doesn’t appear to have been explicitly designed with multiplayer games in mind. Perhaps someone who has worked on dm_env, or the related (and very useful) tree repo might have some insight into how best to do this.

I look forward to feedback and I’ll be happy to implement any suggested changes. You’ll notice that very few modifications to the DQN agent/learner were necessary. With this interface it should be straightforward to adapt other Acme algorithms to run on OpenSpiel games :-)

@google-cla google-cla bot added the cla: yes label Nov 17, 2020
@jhtschultz
Copy link
Member Author

And to import OpenSpiel, add these two lines to Acme's virtual environment.

export PYTHONPATH=$PYTHONPATH:/<path_to_open_spiel>
export PYTHONPATH=$PYTHONPATH:/<path_to_open_spiel>/build/python

with tf.GradientTape() as tape:
# Evaluate our networks.
q_tm1 = self._network(o_tm1)
q_t_value = self._target_network(o_t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the main change for DQN in this case? IE I wonder (as I think @fastturtle was alluding to) if the legal_actions mask is moved into the observations directly if this logic can then be moved into the policy (perhaps via a wrapper around a "raw q-function network"). This would then allow a broader array of potential algorithms to be applied in this setting.

@mwhoffman
Copy link
Collaborator

Take this all with a grain of salt, because I'm thinking out loud a little bit at the moment, but:

I don't know how much of a pain it would be to retroactively mark some of these files as copies (i.e. git cp). I'm thinking mostly with respect to the learner, but as I think this might be painful, could you highlight some of the changes made here just to center the discussion?

But from looking I think the main changes in the agent/actor/learner (please correct me if I'm wrong!) revolve around the legal actions. I've noted this above, and I wonder if we can push these directly as observations and let the network handle the legal actions. I think this will allow the learner to remain unchanged. Ideally the same could be done for actor/agent (putting aside the changes to epsilon greedy that are probably needed). But if this only really helps the learner that'd be great.

Ideally we could have something where we have an openspiel-specific Environment Loop, wrapper, and potentially a "Q-network wrapper" but the agent itself isn't changed (and we can peel back as much of that as is necessary if/when this ideal isn't possible 😄).

@jhtschultz
Copy link
Member Author

Ahh I see, yeah moving the legal actions mask into the observation and letting the network handle everything sounds like the right approach. Should streamline things quite a bit.

Tried to mark the new files as copies but couldn’t get the git diff to work, it kept marking everything as new additions. Either way I think you already identified the core issue. I’ll implement those changes with the goal of confining everything to the network, loop and wrapper. Thanks for the quick feedback!

@jhtschultz
Copy link
Member Author

Pushed a big update that's pretty close the ideal @mwhoffman described. Moving legal actions into the observation really was the way to go, simplified everything dramatically. All that’s needed now to run the OpenSpiel example is the wrapper, environment loop, custom networks, and the slightest of modifications to the DQN agent.

The issue with the DQN agent is the policy network. Currently the DQN agent creates the policy network by appending trfl’s epsilon_greedy() layer to the learning network. As far as I can tell, no matter what our custom network outputs this layer will select illegal actions. My workaround was to add the policy network as an optional argument. It offers more flexibility while not changing the default behavior of the DQN agent.

Now that OpenSpiel’s footprint is much smaller, I moved it out of its own directory and placed the files in the appropriate folders. The wrapper is < 150 lines, env loop < 250, and the networks are pretty simple, so this update gets us a lot closer to the finish line.

Copy link
Collaborator

@fastturtle fastturtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, left some comments :)

discount=timestep.discount[player],
step_type=timestep.step_type)

# TODO Remove? Currently used for debugging.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with TODOs in OS code, but should we have an issue to track them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clear out all the TODOs before it gets merged, there aren't many. Left these here just to make sure they got addressed.

step_type=timestep.step_type)

# TODO Remove? Currently used for debugging.
def _print_policy(self, timestep: dm_env.TimeStep, player: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this generally useful, or was it while you were debugging a particular issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here so that simply by running the environment loop one can easily verify the agents are learning correctly.

I have mixed feelings as to whether it should make the final cut. It's nice for the user to be able to run an example and see exactly what's going on without any additional work on their part. On the other hand, it's of questionable usefulness in larger environments and makes the environment loop slightly less elegant. If anyone has a preference I'm happy to go with that.

return outputs


# TODO Function to update epsilon
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this TODO refer to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good question, I meant to ask...is there currently an easy way to decay DQN's epsilon parameter? There doesn't appear to be, it seems fixed once the policy network is created. This is an important feature for some use cases and since I was building an epsilon greedy sonnet module anyway, it seemed like something we'd want to include.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there isn't a standard way for doing parameter decay. I'm guessing you've already thought of this, but you'll have to do it using your tf.Variable approach. I'm fine keeping the TODO, but could you flush it out a bit (not too much) so it has more context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@fastturtle
Copy link
Collaborator

Also there's an issue with pytest failing, looks like it isn't installing openspiel.

@jhtschultz
Copy link
Member Author

@fastturtle thanks for the comments! Regarding pytest failing, OpenSpiel does not currently have a pip package for direct installation (though it's in the works). Given that, any advice on how to handle this? And tagging @lanctot, who knows all things about installing OpenSpiel :)

@lanctot
Copy link

lanctot commented Nov 25, 2020

@fastturtle @jhtschultz I am planning for an OpenSpiel release in a few weeks, and have it in PyPi, but only by source distribution for now (which means it'll be 1-2 min install, and requires the machines have CMake, clang++, and python3-dev). Do you think that would solve it?

Edit: I hope to have a bdist wheel at some point next year, but it requires quite a bit of planning and preparation to get right.

@jhtschultz
Copy link
Member Author

Hi! Wanted to check in on the status of this PR.

I saw commit 996c63d followed the approach taken here and added the ability to provide DQN's policy network directly. That's great because now this PR requires zero changes to Acme’s DQN.

Commit 996c63d also added NetworkWithMaskedEpsilonGreedy, which uses trfl.epsilon_greedy(). As documented here google-deepmind/trfl#27 and google-deepmind/trfl#28, there is a bug in trfl’s action masking which currently renders this network unusable. Once fixed, I could refactor the networks used in this PR to use NetworkWithMaskedEpsilonGreedy instead.

I’ve been implementing NFSP, a deep-RL algorithm for imperfect information games, using this Acme-OpenSpiel interface and the interface works quite well. If we can identify any remaining obstacles to importing this PR I'll be happy to work on them. Thanks!

@lanctot
Copy link

lanctot commented Jan 7, 2021

@fastturtle thanks for the comments! Regarding pytest failing, OpenSpiel does not currently have a pip package for direct installation (though it's in the works). Given that, any advice on how to handle this? And tagging @lanctot, who knows all things about installing OpenSpiel :)

I spoke to @fastturtle yesterday. OpenSpiel does have a pip now but it still requires a build from source, which requires specific things from your environment and could take a long time. So, until we have a binary wheel (at least a few months), we've agreed just disable tests that require/depend on OpenSpiel.

Please keep the code for these tests around, though. When it's merged, I will enable this test internally to ensure that it passes on our side. Thanks!

@jhtschultz jhtschultz force-pushed the open_spiel branch 3 times, most recently from 9e67acd to ec2e247 Compare January 9, 2021 02:14
@jhtschultz
Copy link
Member Author

@fastturtle thanks for looking it over! Most of the changes are already in. I still need to do some further digging to fully explain the -np.inf threshold causing the priority error. Good call on your part to highlight that. I'll address the comments individually and let you know when everything is ready to go on my end.

@fastturtle
Copy link
Collaborator

@jhtschultz sounds good, I'll keep an eye out for your comments.

@jhtschultz
Copy link
Member Author

@fastturtle Done! Not sure why some pytests get stuck. Recent commits, including the previous one ff9b7ad, have been consistently passing so I think everything is good to go. Let me know if there's anything else I can do.

@jhtschultz
Copy link
Member Author

We exposed _use_observation as a property in OpenSpiel's rl_environment, and the wrapper here has been updated to use the property.

Copy link
Collaborator

@fastturtle fastturtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, some last comments from me 👍

@jhtschultz
Copy link
Member Author

Thanks! Good catches, fixed 👍

Copy link
Collaborator

@fastturtle fastturtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, and has my approval. @mwhoffman PTAL.

# pytype: enable=import-error


class OLT(NamedTuple):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to define the OLT class outside this wrapper (or in the legal_actions policy definitions) and import here. That way we can skip one level of trying and potentially use the legal actions policies outside of openspiel envs. But we can play around with that after getting this change in...

# Run an episode.
while not timestep.last():
# Generate an action from the agent's policy and step the environment.
if self._environment.is_turn_based:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I was thinking about possibility of making this environment loop more generic by covering cases of multi-agent/multi-team environments that are not necessarily turn based or fully simultaneous.
Lets consider multi-agent, turn-based environment (for example soccer environment like https://github.com/google-research/football/) where each team consists of 11 players, but teams execute actions in turns. To cover OpenSpiel and this use case with one implementation we could:

  • eliminate _environment.is_turn_based
  • replace _environment.get_state.current_player() with a set of current players
    That would cover "# TODO Support simultaneous move games." in a generic way.
    What do you think about that?

Copy link

@lanctot lanctot Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qstanczyk,

Can you elaborate on your motivation for this? I worry because if we make this too general, it gets harder to interface with the specific API it was designed to be used with and could introduce some awkward bugs for use cases we didn't anticipate by supporting multiple APIs.

Also I don't understand the football env example. If that environment is doing turn-based already, then it should fit exactly the OpenSpiel API as well already (which as you say is already turn-based and you just have 11 consecutive actions from the same player before switching the current player to the next one)?

WDYT about a wrapper of the football env as an OpenSpiel python game (living in this repos or in OpenSpiel)? Sounds like they are compatible from your description. Let me know if I'm missing something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If turn-based Soccer environment would be integrated the way you suggest (11 actions for each team) then what is the use case you anticipate for "!is_turn_based"? Any multi-agent environment can be implemented by passing N actions, so maybe in such case there is no need to have "is_turn_based" check around?
My motivation was to try and generalize the API by supporting any is_turn_based / multi-agent mix. I'm fine with not taking this thought into account, but I just wonder at this point if "is_turn_based" is needed.

Copy link

@lanctot lanctot Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If turn-based Soccer environment would be integrated the way you suggest (11 actions for each team) then what is the use case you anticipate for "!is_turn_based"?

Wait, is that not how it's implemented? I only assumed that based on what you said. The !is_turn_based is necessary here because of the way OpenSpiel categorizes its dynamics. Some games require actions from all agents (simultaneous game) and apply them all at once, and others are turn-based so you only get one for the current player and apply that action only.

Any multi-agent environment can be implemented by passing N actions, so maybe in such case there is no need to have "is_turn_based" check around?

Yes, I agree, but that's not how OpenSpiel implements turn-based games.

My motivation was to try and generalize the API by supporting any is_turn_based / multi-agent mix.

Ok yes I think I understand. I think this would make more sense if we had a use case for this generalization, and I'm not seeing it (yet). So my hesitatation is due mainly to this wrapper being designed for OpenSpiel specifically.

OpenSpiel treats these two cases separately because the algorithms that are built on top of the different type of games depend on these choices. Also, in OpenSpiel a simultaneous game doesn't need to be fully simultaneous at all states and also it's fine for a subset of agents to only have 1 legal pass-like move, hence my suggestion of wrapping the ones you mention as OpenSpiel simultaneous-move games, because then this interface would work on them. (Would that work, e.g. for the football env?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I was just thinking out-loud about generalization of the API. We don't look at hooking up GRF at this point, but was wondering if this OpenSpiel environment loop could be more generic.

if self._should_update:
self._actors[player_id].update()
self._observed_first = [False] * len(self._actors)
self._prev_actions = [pyspiel.INVALID_ACTION] * len(self._actors)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe initialize these two once at the beginning of run_episode? That would eliminate code duplication.

if not self._observed_first[player]:
player_timestep = dm_env.TimeStep(
observation=timestep.observation[player],
reward=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't OpenSpiel return reward in this case? In generic case I think we should use reward provided by the environment. Can't we call _get_player_timestep instead (setting dm_env.StepType.FIRST)?


# Make the first observation.
self._send_observation(timestep, self._environment.current_player)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't _send_observation be called as the first thing in the loop (which would eliminate duplicated code)?

@mwhoffman
Copy link
Collaborator

A few minor changes were made (mostly linter reasons!) and this was submitted in 7fa9b80. So I'll close this down for now, and any further discussion can carry on outside this PR. Thanks for all the hard work!

@mwhoffman mwhoffman closed this Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] OpenSpiel integration/support

5 participants