Skip to content

Performance

HumanEval Benchmark

A dataset of 164 hand-written Python programming problems used to evaluate code generation models. HumanEval measures functional correctness by running generated code against unit tests.