Skip to content

Question about CodeBLEU inconsistencies #191

@qinlanshen

Description

@qinlanshen

I noticed when calculating CodeBLEU on a snippet without language keywords that the results returned by the BLEU and weighted n-gram match components are not equal to each other. Looking into the code, it looks like that weighted n-gram match uses n-gram recall in its calculations while regular BLEU uses precision. However, this doesn't look like the description of weighted n-gram match in the CodeBLEU paper, as equation (2) reports that precision is used for that calculation. Similarly, the brevity penalty only makes sense if the intended calculation is precision, as one can trivially get high precision by dropping tokens from the reference in the generation but recall tends to reward longer generations. My question is then whether recall is intended for the weighted n-gram match calculation and whether we should expect to get the same result from running BLEU and weighted n-gram match on snippets without keywords.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions