Performance
HumanEval Benchmark
A dataset of 164 hand-written Python programming problems used to evaluate code generation models. HumanEval measures functional correctness by running generated code against unit tests.
Performance
A dataset of 164 hand-written Python programming problems used to evaluate code generation models. HumanEval measures functional correctness by running generated code against unit tests.