Evaluation

HumanEval Dataset

HumanEval dataset is used as the evaluation set in the work Evaluating Large Language Models Trained on Code. It comprises of 164 Human written Programming Problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. It aims to evaluate, Functional Correctness unlike the standard CodeBLEU, which does comparison via Syntactical and Semantic Nature of the Generated Code to that of the generated code. We benchmark the evaluation below.

Model	pass@1	pass@2	pass@5	pass@10
EleutherAI/gpt-neo	0.12%	0.24%	0.61%	1.22%
gpt-neo-125M-apps	0.06%	0.12%	0.30%	0.61%
dedup-filtered-no-resize-2048bs	0.00%	0.00%	0.00%	0.00%
1024-filtered	0.00%	0.00%	0.00%	0.00%
dedup-2048	0.00%	0.00%	0.00%	0.00%

Apps Eval description
Our evaluation script and how to run it
Issues we had with getting APPs eval to work

Page Directory

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Evaluation

HumanEval Dataset

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Page Directory

Clone this wiki locally