Action Space Factorization #55

Bam4d · 2021-11-29T11:51:53Z

Bam4d
Nov 29, 2021
Collaborator

At the moment I believe the API design covers 2-layer actions ... "select an entity and then perform an action" which i think is covered by the SelectEntityActionSpace.

However in some more complicated games there may be an arbitrary number of action "factors" or "parts", where each action "part" is dependent on the selection of the previous part.

An example of this would be in Alpha Star, where the action might contain a sequence of smaller sub-actions: "select units" -> "select some action (move, attack, fly etc)" -> "select destination xy".

Each "part" of the action can also change the mask which is dependent on the previous selected "part".

I think the DoTa paper dealt with this by effectively quantizing all the possilble action combinations (yeilding a huge action space).
In my own work I've dealt with this by building a "conditional action tree", which effectively contains all the valid actions and masks for each part, and the actions are constructed by traversing this tree.

What do we think is the right method to persue for these kinds of action spaces?

Theomat · 2021-11-29T13:30:38Z

Theomat
Nov 29, 2021
Collaborator

I believe that conditional action trees are much more elegant, scalable and human understandable.

0 replies

cswinter · 2021-11-29T15:00:18Z

cswinter
Nov 29, 2021
Maintainer

Awesome stuff, I enjoyed the talk you gave on this a while back! I'll have to read the paper and think about this for a bit before I can form a strong opinion so I just have some high-level comments for now.

I think the correct approach is to design the best possible API without limiting ourselves by thinking about implementation constraints, and then figure how to make it efficient. Still, it could be useful to come up with a "minimal" version that is easy to implement and which we can start experimenting with and then extend later.
Currently, actions are implemented as follows:
- For each action, the environment specifies the IDs of the entities that can execute this action. So we have a Dict[str, [u64]]
- These actor IDs from multiple environments are combined into ragged buffers: Dict[str, RaggedBufferI64]
- On the rollouts, we select the embeddings of actors, calculate logprobs, and sample an action. This gives use corresponding logprobs: Dict[str, RaggedBufferF32] and actions: Dict[str, RaggedBufferI64] batches with the same keys and ragged buffer shape as the actor_ids.
Rough idea for how we might extend 2. to allow for some version of conditional action trees: For conditional actions, instead of supplying a set of actor indices, define a function that takes as input the ragged actors: Dict[str, RaggedBufferI64] and actions: Dict[str, RaggedBufferI64] dicts of all previously executed actions, and then outputs a new RaggedBufferI64 of actor ids on the fly.
Another way in which we can already support complex autoregressive action spaces is to push them fully into the environment by splitting them over multiple steps. On each step, the environment only makes one of the actions available, and can then choose what actions are available on the next step depending on the agents previous choice.

9 replies

cswinter Nov 30, 2021
Maintainer

I have a very strong prior that (4.) is an excellent choice that we should take extremely seriously. Let me try to convey my intuition.

The key variable we are in disagreement over is how much of the forward pass to repeat between consecutive actions. At one extreme end of this continuum, we do one shared forward pass and evaluate all actions in parallel (let's call this independent). At the other end, we repeat the entire forward pass after every action (autoregressive). The optimal choice will depend on the environment/action space and we should ultimately support both.

As you've noted, independent requires fewer forward passes. However, somewhat counter-intuitively, this does not imply higher efficiency. This is because (a) you have to do more work in each forward pass (to calculate not just one, but several correct actions) and (b) the network has access to less information since it won't know for sure what other action choices it will make (particularly in the lower layers where it won't even have accurate probabilities yet). Now, if the correct action choices don't depend much on the other actions, and most of the computations for the correct action can be shared, independent comes out ahead. But particularly for any tasks with non-trivial action spaces and stronger dependencies, I would expect higher capability and better efficiency from autoregressive. The key question is how much computation you need to expend to reach a given level of performance. Autoregressive can achieve the same level of performance with a smaller model/cheaper forward passes (since each forward pass has to perform less work) and has a higher maximum possible level of capability (since it has access to additional information).

Let me give some concrete examples of where autoregressive is absolutely necessary and independent will entirely fail to achieve good performance:

Language modeling: Language modeling is an extremely well-studied problem. All the best models are large autoregressive transformers that perform a full forward pass to generate a single token of just a few characters. If you try to predict multiple tokens ahead, performance rapidly degrades. There has been a lot of research into generating more text in one shot, and so far nothing has come close to the capabilities and efficiency of fully autoregressive models.
For any RL task, you could output actions for not just the current but a few subsequent timesteps as well. Clearly, this is not going to be a good idea in the majority of cases.

cswinter Nov 30, 2021
Maintainer

I do agree that in the case of many mostly independent actors that all take one action, a single forward pass with a smaller (possibly autoregressive) action head is likely to be a good choice.

jeremysalwen Nov 30, 2021
Collaborator

I want to note that my implementation of autoregressive action selection (in Stone Ground Hearth Battles) chooses a multi-part action using a single (batched) pass of a neural network, without recomputing the entire network, and without choosing the different components independently.

We take the representation used to select the first component of the action, concatenate the information about which first component was chosen, pass it through an additional transformer layer, and use that output to define the second component policy. I think this is the best of both worlds. This works naturally with an explicit CAT, but I think we would need something "extra" to get this working if we were representing the choices as spread out over multiple timesteps of the environment.

https://github.com/JDBumgardner/stone_ground_hearth_battles/blob/05dba5d0c2c58b1c2f6a6011f5896e8d1b8f3617/hearthstone/training/pytorch/networks/transformer_net.py#L219-L266

Theomat Nov 30, 2021
Collaborator

When I said that I was not convinced by 4., I did not mean I was against autoregressive actions. I agree with you that we need an autoregressive model. I just imagined implementing it in a way similar to what @jeremysalwen just described above and the multiple subsets managed by the environment seems like they add new complexity and impose constraints on the rest of the implementation whereas with @jeremysalwen's implementation for example there would little to no constraints with respect to design choices on the actor.

cswinter Nov 30, 2021
Maintainer

Perhaps I was being a little too forceful here, I just really don't want us to prematurely discard any approaches before we've gathered solid data on their effectiveness. In ML it can be difficult to predict what will actually work and the best solution is often surprising. I think we're mostly in agreement here, and y'all should follow your own intuitions on what to try first.

jeremysalwen · 2021-11-29T16:38:00Z

jeremysalwen
Nov 29, 2021
Collaborator

I think that conditional action trees are the way to go. If we have the action space represented as a CAT, we can always go back and enumerate/quantize it later.

As a guiding principle, I think we would want to express our action spaces in a compositional way. e.g. I would think about how someone else could could create a parallel implementation of conditional action trees, that would work the same as our implementation, and would allow mutual nesting.

0 replies

Bam4d · 2021-11-29T16:41:08Z

Bam4d
Nov 29, 2021
Collaborator Author

For some CAT code for how I was doing exploration and masking pretty efficiently (batched inference over the tree structure):
https://github.com/Bam4d/conditional-action-trees/blob/main/conditional_action_trees/conditional_action_exploration.py#L9

1 reply

Bam4d Nov 29, 2021
Collaborator Author

This impl isn't doing autoregression (just conditional masking), but if we wanted to add that and re-calculate logits for each tree branch, it shouln't be too complicated

Action Space Factorization #55

Uh oh!

Uh oh!

Bam4d Nov 29, 2021 Collaborator

Replies: 4 comments · 10 replies

Uh oh!

Theomat Nov 29, 2021 Collaborator

Uh oh!

cswinter Nov 29, 2021 Maintainer

Uh oh!

Uh oh!

cswinter Nov 30, 2021 Maintainer

Uh oh!

Uh oh!

cswinter Nov 30, 2021 Maintainer

Uh oh!

jeremysalwen Nov 30, 2021 Collaborator

Uh oh!

Theomat Nov 30, 2021 Collaborator

Uh oh!

cswinter Nov 30, 2021 Maintainer

Uh oh!

jeremysalwen Nov 29, 2021 Collaborator

Uh oh!

Bam4d Nov 29, 2021 Collaborator Author

Uh oh!

Bam4d Nov 29, 2021 Collaborator Author

Bam4d
Nov 29, 2021
Collaborator

Replies: 4 comments 10 replies

Theomat
Nov 29, 2021
Collaborator

cswinter
Nov 29, 2021
Maintainer

cswinter Nov 30, 2021
Maintainer

cswinter Nov 30, 2021
Maintainer

jeremysalwen Nov 30, 2021
Collaborator

Theomat Nov 30, 2021
Collaborator

cswinter Nov 30, 2021
Maintainer

jeremysalwen
Nov 29, 2021
Collaborator

Bam4d
Nov 29, 2021
Collaborator Author

Bam4d Nov 29, 2021
Collaborator Author