Skip to content

[Question] What is the point of having a separate .rewards dictionary? #1300

@daxmavy

Description

@daxmavy

Question

As far as I can tell, rewards are managed in the following manner:

  1. _cumulative_rewards[agent] is returned by env.last() (along with observation, termination, truncation, info)
  2. Policy chooses an action, which is then executed by env.step(action).

I know that rewards are more complicated than in the typical RL environment, because the reward for agent_0 should in some circumstances be adjusted during the turn of agent_1, for example. IIUC, _cumulative_rewards is used to account for this.

Within step (in 2. above), the following occurs:

  1. We set self._cumulative_rewards[agent] = 0, because the policy for agent has already received this reward from env.last() and processed it while choosing a next action.
  2. The dictionary self.rewards is updated according to the consequences of action.
  3. We then use self._accumulate_rewards() to update the self._cumulative_rewards dictionary. From this code, this straightforwardly just increments self._cumulative_rewards with the values in self.rewards.

I have two questions:

  1. Why is self.rewards even needed? Why not just directly adjust self._cumulative_rewards to incorporate the consequences of action?
  2. Am I correct in thinking that self.rewards should be set to 0 for all agents after every call to self._accumulate_rewards()? My reasoning is that otherwise, the values in rewards will be added multiple times to _cumulative_rewards, which is undesirable. If this is the case, why isn't this functionality built into self._cumulative_rewards?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions