-
Notifications
You must be signed in to change notification settings - Fork 390
Description
I noticed when calculating CodeBLEU on a snippet without language keywords that the results returned by the BLEU and weighted n-gram match components are not equal to each other. Looking into the code, it looks like that weighted n-gram match uses n-gram recall in its calculations while regular BLEU uses precision. However, this doesn't look like the description of weighted n-gram match in the CodeBLEU paper, as equation (2) reports that precision is used for that calculation. Similarly, the brevity penalty only makes sense if the intended calculation is precision, as one can trivially get high precision by dropping tokens from the reference in the generation but recall tends to reward longer generations. My question is then whether recall is intended for the weighted n-gram match calculation and whether we should expect to get the same result from running BLEU and weighted n-gram match on snippets without keywords.