-
Notifications
You must be signed in to change notification settings - Fork 390
Open
Description
The code tokval = " ".join(tokval.split()) has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:
"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results..
Maybe the token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?
Metadata
Metadata
Assignees
Labels
No labels