-
-
Notifications
You must be signed in to change notification settings - Fork 475
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Question
As far as I can tell, rewards are managed in the following manner:
_cumulative_rewards[agent]is returned byenv.last()(along withobservation,termination,truncation,info)- Policy chooses an
action, which is then executed byenv.step(action).
I know that rewards are more complicated than in the typical RL environment, because the reward for agent_0 should in some circumstances be adjusted during the turn of agent_1, for example. IIUC, _cumulative_rewards is used to account for this.
Within step (in 2. above), the following occurs:
- We set
self._cumulative_rewards[agent] = 0, because the policy foragenthas already received this reward fromenv.last()and processed it while choosing a nextaction. - The dictionary
self.rewardsis updated according to the consequences ofaction. - We then use
self._accumulate_rewards()to update theself._cumulative_rewardsdictionary. From this code, this straightforwardly just incrementsself._cumulative_rewardswith the values inself.rewards.
I have two questions:
- Why is self.rewards even needed? Why not just directly adjust self._cumulative_rewards to incorporate the consequences of
action? - Am I correct in thinking that self.rewards should be set to 0 for all agents after every call to
self._accumulate_rewards()? My reasoning is that otherwise, the values in rewards will be added multiple times to_cumulative_rewards, which is undesirable. If this is the case, why isn't this functionality built intoself._cumulative_rewards?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested