Best way to "mask" gradients in varying length sequences #10563

valentinmace · 2022-05-04T15:42:06Z

valentinmace
May 4, 2022

I have a batch of predictions from my transformer model and a batch of labels.

Say both predictions and labels have the shape (256, 1000, 32):

256 is the batch size
1000 is the trajectory length (I am in a RL setting where my episode size is 1000)
32 is the size of an action (or state of whatever)

I am using a standard L2 loss to fit predictions and labels. However, in a given trajectory, the agent might die before the end of an episode, let's say the 600th timestep, leading to 400 timesteps on this trajectory that are irrelevant.

So I have 600 "good" predictions and labels and 400 "irrelevant" that I don't want to affect my model during loss computation and backpropagation. What is the best way to ignore these in jax so that only good data is considered ?

Am I wrong thinking that putting both labels and data of the last 400 timesteps to 0 will provide an error of 0 and not affect my training ?

Thanks in advance.

YouJiacheng · 2022-05-05T06:40:46Z

YouJiacheng
May 5, 2022

IIUC, in addition to set the labels to 0, you should set model outputs of the last 400 timesteps to 0 as well.
You can use ouput * mask or use jax.numpy.where(mask, output, 0).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best way to "mask" gradients in varying length sequences #10563

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Best way to "mask" gradients in varying length sequences #10563

Uh oh!

valentinmace May 4, 2022

Replies: 1 comment

Uh oh!

Uh oh!

YouJiacheng May 5, 2022

valentinmace
May 4, 2022

YouJiacheng
May 5, 2022