Programming with Pixels

Towards Generalist Software Engineering Agents

Pranjal Aggarwal, Sean Welleck

Carnegie Mellon University


Paper Dataset Leaderboard Demo Code Contact

Goal of PwP

Our motivating hypothesis is that achieving general-purpose Software Engineering (SWE) agents requires a shift to computer-use agents that interact with computers as humans do: by observing the screen, typing, and clicking.

  • We recast software engineering as interacting directly with an IDE through visual perception and basic actions
  • This allows agents to perform any task possible in an IDE and leverage all tools without requiring specialized APIs
  • To achieve this goal, we introduce:
    1. Programming with Pixels (PwP): A software engineering agent environment for computer-use agents 🖥️
    2. PwP-Bench: A benchmark spanning 15 diverse SWE tasks across 8 programming languages 📊
    3. Experimental results showing that general-purpose computer-use agents can approach or even surpass specialized tool-based agents 📈

The Limits of Tool-Based Paradigm

Recent SWE agents have largely followed a tool-based paradigm, where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they:

  • Require predefined tools for each task
  • Do not scale across programming languages and domains
  • Need significant human effort to implement and may contain bugs
  • Are limited to the predefined actions provided by the tool API

For example, an agent designed to manage GitHub pull requests lacks debugging abilities unless specifically programmed into the agent's API.

The PwP Paradigm

PwP Framework Diagram

PwP is a VSCode-based IDE environment where agents perceive the screen and use primitive actions such as typing, pointing, and clicking. This provides two key advantages:

  • Expressiveness: Allows agents to complete any software engineering task achievable in an IDE, without language or domain-specific modifications
  • Natural Tool Access: Agents can interact with any tools available in the IDE—including debuggers, linters, and code suggestions—through basic actions. No need to reinvent the wheel for AI agents!

Key Findings

  • Promising Performance: General-purpose computer-use agents can approach or exceed state-of-the-art tool-based agents without task-specific modifications
  • Visual Grounding Issues: Even state-of-the-art computer-use agents struggle with visual grounding in complex IDE interfaces
  • Limited Tool Use: Current agents lack the ability to use many IDE tools that could make their tasks trivial
  • Performance Gap: Only one model, Claude (Anthropic), performs well, highlighting the need for further research
  • Untapped Potential: When agents can directly access IDE tools without visual interaction, they show significant performance improvements

Core Contributions

PwP Environment

PwP Environment Architecture
The PwP environment provides a VSCode-based IDE within a sandboxed container, enabling agents to interact through screen observations and primitive actions.

We introduce the first software engineering-focused environment for evaluating computer-use agents, using a modified VSCode IDE with:

  • Full multimodality support (text, image, video, audio)
  • General observation space (screenshots) and action space (typing, clicking)
  • Complete state checkpointing for reinforcement learning
  • Access to all IDE tools through visual interaction
  • Execution-based reward calculation

How to interact with PwP

How to interact with PwP?
from pwp import PwPBench

bench = PwPBench('design2code')

dataset = bench.get_dataset()
row = dataset[0]

env = bench.get_env(row)

for step in range(20):
    obs = env.get_observation()
    action = agent.get_action(obs)
    
    env.step(action)

score = bench.get_reward(env, row)
Simple API for Computer-Use Agents

PwP provides a straightforward API for evaluating computer-use agents:

  • Initialize: Create a PwPBench instance for a specific task
  • Load data: Get dataset and select specific examples
  • Set up environment: Configure the task-specific environment
  • Agent loop: Observe the screen, decide actions, and execute them
  • Evaluation: Calculate execution-based rewards for the agent's performance

This simplified interface works across all 15 tasks and 8 programming languages without any task-specific modifications!

PwP-Bench

Key Features

Comprehensive Coverage: Unification of 15 SWE tasks spanning 8 programming languages, multiple modalities and domains
Multimodal Evaluation: Tasks across text, images, and combined modalities
Realistic & Execution-Based: Based on actual development scenarios with evaluation that tests functionality rather than text similarity
Easily Extensible: Adding new tasks requires just 2 scripts, making future adaptations simple
Want your dataset included? We welcome contributions! Contact us to discuss integrating your tasks into PwP-Bench.

Explore Datasets

Hover over any segment to see details about individual datasets

Code Gen & Editing
Multimodal Code Gen
Domain-Specific Code Gen
General SWE Tasks
Code Gen & Editing
Multimodal Code Gen
Domain Specific
SWE Tasks
Hover over a segment to see details
Select a dataset from the chart above to view more information.

Agent Analysis

Agent Performance Comparison
Comparison of different agent models on PwP-Bench tasks.

Our extensive evaluation of computer-use agents reveals:

  • Computer-use agents can match or exceed specialized tool-based agents on many tasks
  • Current models suffer from poor visual grounding in complex IDE interfaces
  • Agents fail to edit files effectively due to visual perception limitations
  • The Claude Computer-Use agent demonstrates basic IDE tool proficiency but struggles with advanced tools
  • Current agents are largely incapable of recovering from errors
  • Training models to better utilize IDE tools would substantially improve performance

Common Agent Limitations

Poor Visual Grounding
Visual Grounding Issue
Agents struggle to identify the correct UI elements for interaction
File Editing Failures
File Editing Issue
Agents incorrectly position text when editing files

PwP in Action

Code Editing Demo

The agent fixes a bug in a Python function by analyzing the code context, locating the error, and implementing the correction through natural IDE interactions.

UI Development Demo

The agent is tasked with creating a simple HTML page given an image. The agent opens the image, and the live preview of code it writes side by side. It uses the preview to iteratively improve the code.

IDE Configuration Demo

The agent configures development environment settings, installing necessary extensions and adjusting workspace preferences through visual interaction with VSCode.

Code Refactoring Demo

The agent is tasked with renaming a variable in a complex Python Repostiroy. It successfully uses VSCode rename functionaly to rename the variable correctly across the project.

Leaderboard

Model % Resolved Date Site
🏆 Computer-Use Agent Claude-3.5 Sonnet - 20241022 46.8 2025-02-24
🥈 Computer-Use Agent GPT-4o 32.3 2025-02-24
🥉 Computer-Use Agent Gemini-1.5 Pro 18.1 2025-02-24
Computer-Use Agent GPT-4o-mini 17.8 2025-02-24
Computer-Use Agent Gemini-1.5 Flash 8.9 2025-02-24

Citation

@misc{aggarwal2025programmingpixelscomputerusemeets,
      title={Programming with Pixels: Computer-Use Meets Software Engineering}, 
      author={Pranjal Aggarwal and Sean Welleck},
      year={2025},
      eprint={2502.18525},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.18525}, 
}