scripts/preprocess.py should sort tokens lexicographically

Currently the indexes are assigned to tokens on the first occurrence basis. If the text is changed (think of fixing a typo or training a pre-trained model on a different corpus) the indexes might be reassigned what will break subsequent training initialized from a checkpoint.