Conversation
|
And to import OpenSpiel, add these two lines to Acme's virtual environment. |
| with tf.GradientTape() as tape: | ||
| # Evaluate our networks. | ||
| q_tm1 = self._network(o_tm1) | ||
| q_t_value = self._target_network(o_t) |
There was a problem hiding this comment.
Is this the main change for DQN in this case? IE I wonder (as I think @fastturtle was alluding to) if the legal_actions mask is moved into the observations directly if this logic can then be moved into the policy (perhaps via a wrapper around a "raw q-function network"). This would then allow a broader array of potential algorithms to be applied in this setting.
|
Take this all with a grain of salt, because I'm thinking out loud a little bit at the moment, but: I don't know how much of a pain it would be to retroactively mark some of these files as copies (i.e. But from looking I think the main changes in the agent/actor/learner (please correct me if I'm wrong!) revolve around the legal actions. I've noted this above, and I wonder if we can push these directly as observations and let the network handle the legal actions. I think this will allow the learner to remain unchanged. Ideally the same could be done for actor/agent (putting aside the changes to epsilon greedy that are probably needed). But if this only really helps the learner that'd be great. Ideally we could have something where we have an openspiel-specific Environment Loop, wrapper, and potentially a "Q-network wrapper" but the agent itself isn't changed (and we can peel back as much of that as is necessary if/when this ideal isn't possible 😄). |
|
Ahh I see, yeah moving the legal actions mask into the observation and letting the network handle everything sounds like the right approach. Should streamline things quite a bit. Tried to mark the new files as copies but couldn’t get the git diff to work, it kept marking everything as new additions. Either way I think you already identified the core issue. I’ll implement those changes with the goal of confining everything to the network, loop and wrapper. Thanks for the quick feedback! |
|
Pushed a big update that's pretty close the ideal @mwhoffman described. Moving legal actions into the observation really was the way to go, simplified everything dramatically. All that’s needed now to run the OpenSpiel example is the wrapper, environment loop, custom networks, and the slightest of modifications to the DQN agent. The issue with the DQN agent is the policy network. Currently the DQN agent creates the policy network by appending trfl’s Now that OpenSpiel’s footprint is much smaller, I moved it out of its own directory and placed the files in the appropriate folders. The wrapper is < 150 lines, env loop < 250, and the networks are pretty simple, so this update gets us a lot closer to the finish line. |
fastturtle
left a comment
There was a problem hiding this comment.
Thanks for the PR, left some comments :)
| discount=timestep.discount[player], | ||
| step_type=timestep.step_type) | ||
|
|
||
| # TODO Remove? Currently used for debugging. |
There was a problem hiding this comment.
I'm not very familiar with TODOs in OS code, but should we have an issue to track them?
There was a problem hiding this comment.
I'll clear out all the TODOs before it gets merged, there aren't many. Left these here just to make sure they got addressed.
| step_type=timestep.step_type) | ||
|
|
||
| # TODO Remove? Currently used for debugging. | ||
| def _print_policy(self, timestep: dm_env.TimeStep, player: int): |
There was a problem hiding this comment.
Is this generally useful, or was it while you were debugging a particular issue?
There was a problem hiding this comment.
This is here so that simply by running the environment loop one can easily verify the agents are learning correctly.
I have mixed feelings as to whether it should make the final cut. It's nice for the user to be able to run an example and see exactly what's going on without any additional work on their part. On the other hand, it's of questionable usefulness in larger environments and makes the environment loop slightly less elegant. If anyone has a preference I'm happy to go with that.
acme/tf/networks/legal_actions.py
Outdated
| return outputs | ||
|
|
||
|
|
||
| # TODO Function to update epsilon |
There was a problem hiding this comment.
What does this TODO refer to?
There was a problem hiding this comment.
Ah good question, I meant to ask...is there currently an easy way to decay DQN's epsilon parameter? There doesn't appear to be, it seems fixed once the policy network is created. This is an important feature for some use cases and since I was building an epsilon greedy sonnet module anyway, it seemed like something we'd want to include.
There was a problem hiding this comment.
AFAIK there isn't a standard way for doing parameter decay. I'm guessing you've already thought of this, but you'll have to do it using your tf.Variable approach. I'm fine keeping the TODO, but could you flush it out a bit (not too much) so it has more context.
|
Also there's an issue with pytest failing, looks like it isn't installing openspiel. |
|
@fastturtle thanks for the comments! Regarding pytest failing, OpenSpiel does not currently have a pip package for direct installation (though it's in the works). Given that, any advice on how to handle this? And tagging @lanctot, who knows all things about installing OpenSpiel :) |
|
@fastturtle @jhtschultz I am planning for an OpenSpiel release in a few weeks, and have it in PyPi, but only by source distribution for now (which means it'll be 1-2 min install, and requires the machines have CMake, clang++, and python3-dev). Do you think that would solve it? Edit: I hope to have a bdist wheel at some point next year, but it requires quite a bit of planning and preparation to get right. |
|
Hi! Wanted to check in on the status of this PR. I saw commit 996c63d followed the approach taken here and added the ability to provide DQN's policy network directly. That's great because now this PR requires zero changes to Acme’s DQN. Commit 996c63d also added I’ve been implementing NFSP, a deep-RL algorithm for imperfect information games, using this Acme-OpenSpiel interface and the interface works quite well. If we can identify any remaining obstacles to importing this PR I'll be happy to work on them. Thanks! |
I spoke to @fastturtle yesterday. OpenSpiel does have a pip now but it still requires a build from source, which requires specific things from your environment and could take a long time. So, until we have a binary wheel (at least a few months), we've agreed just disable tests that require/depend on OpenSpiel. Please keep the code for these tests around, though. When it's merged, I will enable this test internally to ensure that it passes on our side. Thanks! |
9e67acd to
ec2e247
Compare
|
@fastturtle thanks for looking it over! Most of the changes are already in. I still need to do some further digging to fully explain the |
|
@jhtschultz sounds good, I'll keep an eye out for your comments. |
|
@fastturtle Done! Not sure why some pytests get stuck. Recent commits, including the previous one ff9b7ad, have been consistently passing so I think everything is good to go. Let me know if there's anything else I can do. |
|
We exposed |
fastturtle
left a comment
There was a problem hiding this comment.
Looks really good, some last comments from me 👍
|
Thanks! Good catches, fixed 👍 |
fastturtle
left a comment
There was a problem hiding this comment.
This looks good to me, and has my approval. @mwhoffman PTAL.
| # pytype: enable=import-error | ||
|
|
||
|
|
||
| class OLT(NamedTuple): |
There was a problem hiding this comment.
It might make more sense to define the OLT class outside this wrapper (or in the legal_actions policy definitions) and import here. That way we can skip one level of trying and potentially use the legal actions policies outside of openspiel envs. But we can play around with that after getting this change in...
| # Run an episode. | ||
| while not timestep.last(): | ||
| # Generate an action from the agent's policy and step the environment. | ||
| if self._environment.is_turn_based: |
There was a problem hiding this comment.
Hi, I was thinking about possibility of making this environment loop more generic by covering cases of multi-agent/multi-team environments that are not necessarily turn based or fully simultaneous.
Lets consider multi-agent, turn-based environment (for example soccer environment like https://github.com/google-research/football/) where each team consists of 11 players, but teams execute actions in turns. To cover OpenSpiel and this use case with one implementation we could:
- eliminate _environment.is_turn_based
- replace _environment.get_state.current_player() with a set of current players
That would cover "# TODO Support simultaneous move games." in a generic way.
What do you think about that?
There was a problem hiding this comment.
Hi @qstanczyk,
Can you elaborate on your motivation for this? I worry because if we make this too general, it gets harder to interface with the specific API it was designed to be used with and could introduce some awkward bugs for use cases we didn't anticipate by supporting multiple APIs.
Also I don't understand the football env example. If that environment is doing turn-based already, then it should fit exactly the OpenSpiel API as well already (which as you say is already turn-based and you just have 11 consecutive actions from the same player before switching the current player to the next one)?
WDYT about a wrapper of the football env as an OpenSpiel python game (living in this repos or in OpenSpiel)? Sounds like they are compatible from your description. Let me know if I'm missing something.
There was a problem hiding this comment.
If turn-based Soccer environment would be integrated the way you suggest (11 actions for each team) then what is the use case you anticipate for "!is_turn_based"? Any multi-agent environment can be implemented by passing N actions, so maybe in such case there is no need to have "is_turn_based" check around?
My motivation was to try and generalize the API by supporting any is_turn_based / multi-agent mix. I'm fine with not taking this thought into account, but I just wonder at this point if "is_turn_based" is needed.
There was a problem hiding this comment.
If turn-based Soccer environment would be integrated the way you suggest (11 actions for each team) then what is the use case you anticipate for "!is_turn_based"?
Wait, is that not how it's implemented? I only assumed that based on what you said. The !is_turn_based is necessary here because of the way OpenSpiel categorizes its dynamics. Some games require actions from all agents (simultaneous game) and apply them all at once, and others are turn-based so you only get one for the current player and apply that action only.
Any multi-agent environment can be implemented by passing N actions, so maybe in such case there is no need to have "is_turn_based" check around?
Yes, I agree, but that's not how OpenSpiel implements turn-based games.
My motivation was to try and generalize the API by supporting any is_turn_based / multi-agent mix.
Ok yes I think I understand. I think this would make more sense if we had a use case for this generalization, and I'm not seeing it (yet). So my hesitatation is due mainly to this wrapper being designed for OpenSpiel specifically.
OpenSpiel treats these two cases separately because the algorithms that are built on top of the different type of games depend on these choices. Also, in OpenSpiel a simultaneous game doesn't need to be fully simultaneous at all states and also it's fine for a subset of agents to only have 1 legal pass-like move, hence my suggestion of wrapping the ones you mention as OpenSpiel simultaneous-move games, because then this interface would work on them. (Would that work, e.g. for the football env?)
There was a problem hiding this comment.
Ack, I was just thinking out-loud about generalization of the API. We don't look at hooking up GRF at this point, but was wondering if this OpenSpiel environment loop could be more generic.
| if self._should_update: | ||
| self._actors[player_id].update() | ||
| self._observed_first = [False] * len(self._actors) | ||
| self._prev_actions = [pyspiel.INVALID_ACTION] * len(self._actors) |
There was a problem hiding this comment.
Maybe initialize these two once at the beginning of run_episode? That would eliminate code duplication.
| if not self._observed_first[player]: | ||
| player_timestep = dm_env.TimeStep( | ||
| observation=timestep.observation[player], | ||
| reward=None, |
There was a problem hiding this comment.
Doesn't OpenSpiel return reward in this case? In generic case I think we should use reward provided by the environment. Can't we call _get_player_timestep instead (setting dm_env.StepType.FIRST)?
|
|
||
| # Make the first observation. | ||
| self._send_observation(timestep, self._environment.current_player) | ||
|
|
There was a problem hiding this comment.
Can't _send_observation be called as the first thing in the loop (which would eliminate duplicated code)?
|
A few minor changes were made (mostly linter reasons!) and this was submitted in 7fa9b80. So I'll close this down for now, and any further discussion can carry on outside this PR. Thanks for all the hard work! |
This pull request is a follow up to #80 and adds the ability to run Acme’s DQN on OpenSpiel games. All changes and additions are contained in the acme/open_spiel directory. The example
run_dqn.pyruns Acme’s DQN on OpenSpiel’s tic-tac-toe. It’s set up to print out a full playthrough (policies included) every 1000 episodes making it easy to verify that the agent is in fact learning a good policy. The agents start playing sensibly after roughly 10k episodes, and by about 50k episodes pretty much every game in which an exploratory action is not taken results in a draw.Though I’ve endeavored to keep the code clean and production ready, this PR is only intended to be a starting point. There are a number of TODOs. Some are only meant to call attention to important considerations, others are tasks which seemed better addressed after there’s been a chance for feedback and discussion.
Perhaps the most important design feature, and the one that stands to improve most from further thought, is the environment wrapper. In an effort to keep as much existing structure intact as possible, the wrapper is built around OpenSpiel’s python
rl_environement- which notably is not adm_env. An alternative approach is to wrap the OpenSpiel game directly as adm_env. However, thedm_envdoesn’t appear to have been explicitly designed with multiplayer games in mind. Perhaps someone who has worked ondm_env, or the related (and very useful)treerepo might have some insight into how best to do this.I look forward to feedback and I’ll be happy to implement any suggested changes. You’ll notice that very few modifications to the DQN agent/learner were necessary. With this interface it should be straightforward to adapt other Acme algorithms to run on OpenSpiel games :-)