Software Engineering & AI
PerfBench: Can Agents Resolve Real-World Performance Bugs?
Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. This paper introduces PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks, revealing that state-of-the-art coding agents struggle significantly but can be improved with performance-aware tooling and instructions.
Quantifiable Impact & Key Findings
Our research uncovers critical insights into AI agent capabilities for performance optimization, highlighting both challenges and significant potential for improvement.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Background and Related Work
Performance bugs are a unique class of software defects that impact efficiency without causing functional failures. They tend to be harder to detect and fix than functional bugs. Recent advances in Software Engineering agents have shown promise in automated bug fixing, but existing benchmarks primarily focus on functional correctness, leaving a significant gap in understanding how well these agents handle non-functional bugs such as performance or security issues.
PerfBench Construction
PerfBench is a benchmark specifically designed to evaluate software engineering agents on performance bug fixing tasks in .NET applications. It comprises 81 carefully curated and manually verified tasks from popular open-source .NET repositories on GitHub, each representing a real performance issue fixed by developers. The benchmark features a novel evaluation harness for agent-generated performance benchmarks and validates fixes by comparing execution metrics.
Experimental Setup
The evaluation harness automates the entire testing process, using agent-generated BenchmarkDotNet tests. We execute agent-written tests before and after changes, along with existing unit tests. Metrics include Success Rate, Performance Improvement (%, kbs, ms), Token Usage, Steps Taken, and Dollar Cost. We evaluated OpenHands agents in baseline and performance-aware configurations with GPT-4.1 and Claude Sonnet 4.
Benchmark Results
Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks. The baseline OpenHands agent achieves only a ~3% success rate. However, by developing OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions, we achieved a ~20% success rate, demonstrating a substantial improvement and the potential for targeted approaches.
Enterprise AI Agent Workflow for Performance Optimization
| Feature | Baseline OpenHands (GPT-4.1) | OpenHands-Perf-Agent (GPT-4.1) |
|---|---|---|
| Success Rate | 1.2% (1/81) | 14.8% (12/81) |
| Avg Steps | 47.2 | 84.3 |
| Avg Tokens | 1.3M | 1.9M |
| Key Strengths |
|
|
Case Study: Optimizing Memory Allocation in .NET
One critical finding from PerfBench highlights the prevalence of memory management issues, accounting for over 40% of all performance bugs. Our OpenHands-Perf-Agent demonstrated a 18.2% success rate in this category, compared to 6.0% for the baseline agent.
This improvement was achieved by explicitly guiding the agent to use BenchmarkDotNet with MemoryDiagnoser, allowing it to identify and resolve excessive allocations. For instance, in one task, the agent successfully refactored a collection initialization to prevent an OutOfMemoryException, leading to significant memory savings.
Calculate Your Potential AI ROI
Estimate the significant time and cost savings your enterprise could achieve by implementing intelligent automation.
Your Enterprise AI Implementation Roadmap
A strategic, phased approach ensures seamless integration and maximum value realization for your organization.
Discovery & Strategy
Comprehensive assessment of current workflows, identification of high-impact AI opportunities, and development of a tailored AI strategy aligned with your business objectives.
Pilot & Prototyping
Rapid development and deployment of a proof-of-concept or pilot AI solution to validate feasibility, measure initial impact, and gather feedback for refinement.
Integration & Scaling
Seamless integration of AI solutions into existing enterprise systems, scaling across departments, and ensuring robust performance and security at scale.
Optimization & Governance
Continuous monitoring, performance optimization, model fine-tuning, and establishing AI governance frameworks for sustained value and responsible AI practices.
Ready to Transform Your Enterprise with AI?
Partner with us to leverage cutting-edge AI for unparalleled efficiency, innovation, and competitive advantage.