Programming with Pixels: Computer-Use Meets Software-Engineering

Goal of PwP

Our motivating hypothesis is that achieving general-purpose Software Engineering (SWE) agents requires a shift to computer-use agents that interact with computers as humans do: by observing the screen, typing, and clicking.

We recast software engineering as interacting directly with an IDE through visual perception and basic actions
This allows agents to perform any task possible in an IDE and leverage all tools without requiring specialized APIs
To achieve this goal, we introduce:
1. Programming with Pixels (PwP): A software engineering agent environment for computer-use agents 🖥️
2. PwP-Bench: A benchmark spanning 15 diverse SWE tasks across 8 programming languages 📊
3. Experimental results showing that general-purpose computer-use agents can approach or even surpass specialized tool-based agents 📈

The Limits of Tool-Based Paradigm

Recent SWE agents have largely followed a tool-based paradigm, where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they:

Require predefined tools for each task
Do not scale across programming languages and domains
Need significant human effort to implement and may contain bugs
Are limited to the predefined actions provided by the tool API

For example, an agent designed to manage GitHub pull requests lacks debugging abilities unless specifically programmed into the agent's API.

The PwP Paradigm

PwP is a VSCode-based IDE environment where agents perceive the screen and use primitive actions such as typing, pointing, and clicking. This provides two key advantages:

Expressiveness: Allows agents to complete any software engineering task achievable in an IDE, without language or domain-specific modifications
Natural Tool Access: Agents can interact with any tools available in the IDE—including debuggers, linters, and code suggestions—through basic actions. No need to reinvent the wheel for AI agents!

Key Findings

Promising Performance: General-purpose computer-use agents can approach or exceed state-of-the-art tool-based agents without task-specific modifications
Visual Grounding Issues: Even state-of-the-art computer-use agents struggle with visual grounding in complex IDE interfaces
Limited Tool Use: Current agents lack the ability to use many IDE tools that could make their tasks trivial
Performance Gap: Only one model, Claude (Anthropic), performs well, highlighting the need for further research
Untapped Potential: When agents can directly access IDE tools without visual interaction, they show significant performance improvements

PwP Environment

We introduce the first software engineering-focused environment for evaluating computer-use agents, using a modified VSCode IDE with:

Full multimodality support (text, image, video, audio)
General observation space (screenshots) and action space (typing, clicking)
Complete state checkpointing for reinforcement learning
Access to all IDE tools through visual interaction
Execution-based reward calculation

How to interact with PwP

How to interact with PwP?

from pwp import PwPBench

bench = PwPBench('design2code')

dataset = bench.get_dataset()
row = dataset[0]

env = bench.get_env(row)

for step in range(20):
    obs = env.get_observation()
    action = agent.get_action(obs)
    
    env.step(action)

score = bench.get_reward(env, row)

Simple API for Computer-Use Agents

PwP provides a straightforward API for evaluating computer-use agents:

Initialize: Create a PwPBench instance for a specific task
Load data: Get dataset and select specific examples
Set up environment: Configure the task-specific environment
Agent loop: Observe the screen, decide actions, and execute them
Evaluation: Calculate execution-based rewards for the agent's performance

This simplified interface works across all 15 tasks and 8 programming languages without any task-specific modifications!

PwP-Bench

Key Features

Comprehensive Coverage: Unification of 15 SWE tasks spanning 8 programming languages, multiple modalities and domains

Multimodal Evaluation: Tasks across text, images, and combined modalities

Realistic & Execution-Based: Based on actual development scenarios with evaluation that tests functionality rather than text similarity

Easily Extensible: Adding new tasks requires just 2 scripts, making future adaptations simple

Want your dataset included? We welcome contributions! Contact us to discuss integrating your tasks into PwP-Bench.

Explore Datasets

Hover over any segment to see details about individual datasets

Code Gen & Editing

Multimodal Code Gen

Domain-Specific Code Gen

General SWE Tasks

Code Gen & Editing

Multimodal Code Gen

Domain Specific

SWE Tasks

Hover over a segment to see details

Select a dataset from the chart above to view more information.

Agent Analysis

Agent Performance Comparison — Comparison of different agent models on PwP-Bench tasks.

Our extensive evaluation of computer-use agents reveals:

Computer-use agents can match or exceed specialized tool-based agents on many tasks
Current models suffer from poor visual grounding in complex IDE interfaces
Agents fail to edit files effectively due to visual perception limitations
The Claude Computer-Use agent demonstrates basic IDE tool proficiency but struggles with advanced tools
Current agents are largely incapable of recovering from errors
Training models to better utilize IDE tools would substantially improve performance

Common Agent Limitations

Poor Visual Grounding

Visual Grounding Issue — Agents struggle to identify the correct UI elements for interaction

File Editing Failures

File Editing Issue — Agents incorrectly position text when editing files

PwP in Action

Code Editing
UI Development
IDE Configuration
Code Refactoring

The agent fixes a bug in a Python function by analyzing the code context, locating the error, and implementing the correction through natural IDE interactions.

The agent is tasked with creating a simple HTML page given an image. The agent opens the image, and the live preview of code it writes side by side. It uses the preview to iteratively improve the code.

The agent configures development environment settings, installing necessary extensions and adjusting workspace preferences through visual interaction with VSCode.

The agent is tasked with renaming a variable in a complex Python Repostiroy. It successfully uses VSCode rename functionaly to rename the variable correctly across the project.

Leaderboard

PwP-Bench Lite

Model	% Resolved	Date
🏆 Computer-Use Agent Claude-3.5 Sonnet - 20241022	46.8	2025-02-24
🥈 Computer-Use Agent GPT-4o	32.3	2025-02-24
🥉 Computer-Use Agent Gemini-1.5 Pro	18.1	2025-02-24
Computer-Use Agent GPT-4o-mini	17.8	2025-02-24
Computer-Use Agent Gemini-1.5 Flash	8.9	2025-02-24

Citation

@misc{aggarwal2025programmingpixelscomputerusemeets,
      title={Programming with Pixels: Computer-Use Meets Software Engineering}, 
      author={Pranjal Aggarwal and Sean Welleck},
      year={2025},
      eprint={2502.18525},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.18525}, 
}

Programming with Pixels

Towards Generalist Software Engineering Agents