Code-Code CodeCompletion-token not tokenizing all spaces

The code `tokval = " ".join(tokval.split())` has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:

https://github.com/microsoft/CodeXGLUE/blob/ac74a62802a0dd159b3258c78a2df8ad36cdf2b9/Code-Code/CodeCompletion-token/dataset/py150/preprocess.py#L53C17-L53C50

"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results.. 

Maybe the <SPACE> token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code-Code CodeCompletion-token not tokenizing all spaces #190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code-Code CodeCompletion-token not tokenizing all spaces #190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions