Performance
SWE-bench
A benchmark of real GitHub issues from popular Python repositories used to evaluate AI agents on software engineering tasks. SWE-bench tests whether a model can understand a bug report and produce a correct patch.
Performance
A benchmark of real GitHub issues from popular Python repositories used to evaluate AI agents on software engineering tasks. SWE-bench tests whether a model can understand a bug report and produce a correct patch.