what is the max number of tokens allowed per caption for the text encoder? is it the usual 77 or can be more? I seem to see using CLIP tokenizer