-
Notifications
You must be signed in to change notification settings - Fork 390
Description
I am a PhD student using CodeBLEU in my research on code generation evaluation. I noticed a discrepancy between the numbers reported in the paper and the results obtained using the official GitHub implementation.
In Example 1 of the paper, it is stated:
“The number of all sub-trees of the reference AST generated by tree-sitter is 21 and the hit number for the candidate is 13, so the syntactic AST match score is 13/21 ∗100 = 61.90(%)”
However, when I run the official CodeBLEU implementation on the same candidate and reference code, I obtain:
match_count = 11
total_count = 19
AST Syntax Match Score ≈ 0.5789
candidate_code = """
public static int Sign(double d){
return (float)((d==0)?0:(c<0.0)?-1:1);
"""
reference_code = """
public static int Sign(double d){
return (int)((d==0)?0:(d<0)?-1:1);
}
"""
This leads to a lower AST match score than what is reported in the paper.
Could you please clarify:
-
Was Example 1 in the paper tested with the official implementation, or was it a simplified toy example for illustration?
-
Is the official GitHub implementation considered the authoritative version, even if some example numbers differ from the paper?
Understanding this will help ensure that I interpret CodeBLEU results correctly in my research.
Thank you very much for your guidance!