When training on hardware, at some random point in traning (not the same every time) the episodes suddenly starts terminating after one step and the batch of 2048 steps suddenly becomes 2048 individual episodes of 1 step each, naturally always yielding 0 reward (for the QubeSwingupEnv)