RelEnt uses trajectories of varying length

Our implementation of RelEnt currently works with trajectories of varying length. (This is because we rely on our collect_trajs util, which returns when an episode ends.)

By contrast, the RelEnt paper does all calculations under the assumption of a fixed trajectory length.

I'm not sure if this is problematic, but open this issue lest we forget to look into this.