/

Research

Don't Bet on One Sensor

Summary

GUI-only benchmarks produce GUI-only agents. MCP-only benchmarks produce MCP-only agents. Zhang et al posed the question directly: API agents vs. GUI agents. The finding: hybrid approaches combine their strengths. After a year of building mixed-modality training environments, our data agrees. Tasks that tightly couple both modalities create a new vector for improvement that neither unlocks alone.

If the future agent is multi-modal, the training environment has to be too.

Zhang et al presents the first comprehensive comparative study of API-based and GUI-based LLM agents. The conclusion: these paradigms "diverge significantly in architectural complexity, development workflows, and user interaction models," but continuing innovations are "poised to blur the lines between API and GUI-driven agents, paving the way for more flexible, adaptive solutions."

At Chakra, we build multi-modal environments for frontier lab RL training. We've been operating at this intersection for the past year. Here's what we see.

Precision vs. Universality

There are two primary modalities for how agents interact with software. Each comes with a fundamental tradeoff.

API agents (models that interact with software through structured endpoints like MCP) are fast, precise, and deterministic. A Notion MCP call to create a page takes milliseconds. No screenshots, no pixel interpretation, no ambiguity.

GUI agents (models that interact through visual interfaces) are universal. They work on any software with a screen, no integration required. But they're slow, compute-heavy, and brittle. Previous to 2026, OSWorld, the standard desktop agent benchmark, the best foundation models scored around 35-45%. Agentic frameworks pushed this toward 61%. Even systems approaching the human baseline of 72% struggled with precise interactions: calendar date picking, form inputs, multi-step workflows requiring exact click targets.

The tempting conclusion is to default to API agents. But that assumes the API coverage is sufficient, and that there isn’t a better path forward.

The API coverage myth

Most enterprise software doesn't have comprehensive API coverage. MCP adoption has accelerated: 5,800+ servers, 300+ clients, major deployments across Fortune 500 companies. While the argument seems to resurface every month (Is MCP Dead?), the implication here is broad. Programmatic surfaces for tool-calling have taken off. But coverage is uneven. Figma's MCP server is design-to-code: read-heavy, limited write operations. Canva's covers a fraction of the product surface. Even Notion's - arguably the most complete MCP implementation on the market - can't do everything the GUI can. There is no MCP tool for configuring database views or changing access permissions.

On long enough timelines, maybe every application ships a comprehensive API. Google is clearly aware of the gap: WebMCP, launched last week in Chrome Canary, aims to let any website expose structured tools to agents. But a standard is only as good as its adoption, and there's no forcing function for developers to implement it. Building agents for a world that doesn't exist yet is a bad training strategy. Today, the enterprise software landscape is a patchwork: some surfaces are instrumented, most aren't. You need both.

Convergence in practice

The paper's key insight is convergence: hybrid approaches combine their strengths. This matches what we see in the wild.

OpenClaw hit 180,000 GitHub stars in weeks. It uses terminal commands, browser control, APIs, native GUI, voice, whatever the task requires. Users don't specify modality. They say "do this" and the agent picks the path. It took off not because of any single capability, but because it doesn't artificially limit itself to one.

This is directionally where enterprise agents are headed too. A real workflow in Notion isn't purely GUI or purely MCP. You query a database via API, click through the interface to configure a view, then use MCP to update properties programmatically. The modality is invisible to the user. It should be invisible to the model.

Breaking the plateaus

Our data shows both score types plateau independently. Tasks that tightly couple both modalities create a new vector for improvement that neither modality unlocks alone.

This is why we build mixed modality environments. Every environment we ship - Notion, Figma, Canva, Slack, etc. - supports both GUI and MCP interaction. Our tasks force the model to plan across tool surfaces: query via API, act via GUI, verify the result. These mixed tasks are where frontier models still fail at sub-30% pass rates (pass@5), and where the training signal is richest.

Don't bet on one sensor

Self-driving offers a useful parallel. Camera-only systems work most of the time, but struggle on edge cases that structured data would solve. LiDAR-only systems are precise where the world is mapped, blind where it isn't. The systems in production use multiple sensors together, not because any single one is bad, but because the real world doesn't reduce to a single modality.

The same logic applies to agents. Don't handicap the model. Train it on everything.

We obsess over task design at Chakra Labs.

If you're interested in pushing the edges of model capacity with frontier lab researchers, please reach out.

Nirmal Krishnan

Chakra Team

Nirmal Krishnan is a co-founder at Chakra Labs. Prior to starting the company, he spent time in data, markets, and early-stage startups. He studied computer science and machine learning at Johns Hopkins, where he pursued a bachelor’s and master's degree in computational genomics, publishing papers on prostate cancer and induced pluripotent stem cells.

We build infrastructure
for frontier-defining problems.

Access research-grade infrastructure for agent development. Deterministic environments with frame-accurate state control, high-fidelity trajectory datasets, and mixed-modality training capability.

Frontier Data Laboratory

Contact

Social Channels

Company Resources

Newsletter

Copyright ©2026 Chakra Labs. Unauthorized duplication or use of the content of this website is prohibited.

We build infrastructure for frontier-defining problems.

Access research-grade infrastructure for agent development. Deterministic environments with frame-accurate state control, high-fidelity trajectory datasets, and mixed-modality training capability.

Frontier Data Laboratory

Contact

Social Channels

Company Resources

Newsletter

Copyright ©2026 Chakra Labs. Unauthorized duplication or use of the content of this website is prohibited.