Skip to content

Evaluation

Reshinth Adithyan edited this page Jul 25, 2021 · 9 revisions

Evaluation

  • HumanEval Dataset

HumanEval dataset is used as the evaluation set in the work Evaluating Large Language Models Trained on Code. It comprises of 164 Human written Programming Problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. It aims to evaluate, Functional Correctness unlike the standard CodeBLEU, which does comparison via Syntactical and Semantic Nature of the Generated Code to that of the generated code. We benchmark the evaluation below.

Model pass@1 pass@2 pass@5 pass@10
EleutherAI/gpt-neo 0.12% 0.24% 0.61% 1.22%
gpt-neo-125M-apps 0.06% 0.12% 0.30% 0.61%
dedup-filtered-no-resize-2048bs 0.00% 0.00% 0.00% 0.00%
1024-filtered 0.00% 0.00% 0.00% 0.00%
dedup-2048 0.00% 0.00% 0.00% 0.00%
  • Apps Eval description
  • Our evaluation script and how to run it
  • Issues we had with getting APPs eval to work

Page Directory

Clone this wiki locally