Question about CodeBLEU inconsistencies

I noticed when calculating CodeBLEU on a snippet without language keywords that the results returned by the BLEU and weighted n-gram match components are not equal to each other.  Looking into the code, it looks like that weighted n-gram match uses n-gram recall in its calculations while regular BLEU uses precision.  However, this doesn't look like the description of weighted n-gram match in the CodeBLEU paper, as equation (2) reports that precision is used for that calculation.  Similarly, the brevity penalty only makes sense if the intended calculation is precision, as one can trivially get high precision by dropping tokens from the reference in the generation but recall tends to reward longer generations.  My question is then whether recall is intended for the weighted n-gram match calculation and whether we should expect to get the same result from running BLEU and weighted n-gram match on snippets without keywords.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about CodeBLEU inconsistencies #191

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about CodeBLEU inconsistencies #191

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions