-
Notifications
You must be signed in to change notification settings - Fork 220
Evaluation
Reshinth Adithyan edited this page Jul 25, 2021
·
9 revisions
HumanEval dataset is used as the evaluation set in the work Evaluating Large Language Models Trained on Code. It comprises of 164 Human written Programming Problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. It aims to evaluate, Functional Correctness unlike the standard CodeBLEU, which does comparison via Syntactical and Semantic Nature of the Generated Code to that of the generated code. We benchmark the evaluation below.
| Model | pass@1 | pass@2 | pass@5 | pass@10 |
|---|---|---|---|---|
| EleutherAI/gpt-neo | 0.12% | 0.24% | 0.61% | 1.22% |
| gpt-neo-125M-apps | 0.06% | 0.12% | 0.30% | 0.61% |
| dedup-filtered-no-resize-2048bs | 0.00% | 0.00% | 0.00% | 0.00% |
| 1024-filtered | 0.00% | 0.00% | 0.00% | 0.00% |
| dedup-2048 | 0.00% | 0.00% | 0.00% | 0.00% |
- Apps Eval description
- Our evaluation script and how to run it
- Issues we had with getting APPs eval to work