You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am an undergraduate student attempting to replicate the LEAP training environment from mujoco_playground to implement a dexterous hand RL task. I have been stuck on this for a month, so any insights would be greatly appreciated!
1. The Problem:
Initially, I designed a simple “reaching” reward defined as the negative of the distance to the target (R=−d), without using the reward.tolerance utility.
However, as training progressed, the agent learned to move away from the target. Strangely, the reported reward values were increasing during this process.
2. The Fix:
After mimicking the LEAP environment implementation, I switched to using the reward.tolerance function (which seems to create a bounded/capped reward). With this change, the agent behaved correctly and learned to reach the target successfully.
3. My Question:
Does using reward.tolerance fundamentally change the gradient flow or the optimization landscape compared to a raw linear negative distance? I suspect this might be related to how gradients are calculated or the handling of unbounded values, but I have hit a wall trying to understand the root cause.
The following is my Python code, palm_dist is always less than 0.5 reward_reach_palm = reward.tolerance(palm_dist, bounds=(-0.05, 0.0), margin=0.5, sigmoid='linear', value_at_margin=0.0) reward_reach_palm = -palm_dist
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I am an undergraduate student attempting to replicate the LEAP training environment from mujoco_playground to implement a dexterous hand RL task. I have been stuck on this for a month, so any insights would be greatly appreciated!
1. The Problem:
Initially, I designed a simple “reaching” reward defined as the negative of the distance to the target (R=−d), without using the reward.tolerance utility.
However, as training progressed, the agent learned to move away from the target. Strangely, the reported reward values were increasing during this process.
2. The Fix:
After mimicking the LEAP environment implementation, I switched to using the reward.tolerance function (which seems to create a bounded/capped reward). With this change, the agent behaved correctly and learned to reach the target successfully.
3. My Question:
Does using reward.tolerance fundamentally change the gradient flow or the optimization landscape compared to a raw linear negative distance? I suspect this might be related to how gradients are calculated or the handling of unbounded values, but I have hit a wall trying to understand the root cause.
The following is my Python code, palm_dist is always less than 0.5
reward_reach_palm = reward.tolerance(palm_dist, bounds=(-0.05, 0.0), margin=0.5, sigmoid='linear', value_at_margin=0.0) reward_reach_palm = -palm_distBeta Was this translation helpful? Give feedback.
All reactions