Currently there is this benchmark that is designed for full-on repo fixing https://www.swebench.com/
It is used for other software such as OpenDevin, AutoCodeRover, Aider, and SWE-Agent.
Reference for testing: https://github.com/aorwall/SWE-bench-docker