gpt padding.py proves the forward(), using a small part of dataset. But the shape of the output is to be considered. see newest at code