Our implementation of RelEnt currently works with trajectories of varying length. (This is because we rely on our collect_trajs util, which returns when an episode ends.)
By contrast, the RelEnt paper does all calculations under the assumption of a fixed trajectory length.
I'm not sure if this is problematic, but open this issue lest we forget to look into this.