Performance

SWE-bench

A benchmark of real GitHub issues from popular Python repositories used to evaluate AI agents on software engineering tasks. SWE-bench tests whether a model can understand a bug report and produce a correct patch.

← Full glossary