I have been watching the new wave of computer-use agents and honestly, most of what they do is incredible. The mouse moves like a real user. Forms get filled. Sites get navigated. Tabs get juggled. Tasks that needed a person sitting in front of a screen are now running on their own, end to end, with surprisingly little hand-holding. Five years ago this was science fiction. Today it is a demo you can run on your own laptop, and the future of how we use software clearly runs through here.
Then you watch long enough and the same shape keeps showing up. The agent looks great for the first half of the task, sometimes the first ninety percent, and then the workflow hits a login wall, a captcha, a 3-D Secure popup, a card form that wants a human phone tap, and everything stalls. The agent was good enough to make the stall feel absurd. That is new. A year or two ago the impressive part was that it could move through the task at all.
That is the actual story of computer use right now. Agents got useful faster than the systems around them got ready to trust, verify, and bill them.
the cursor is the new interface
For a few years we assumed AI would live in a chat box. Type, answer, repeat. The interface was words.
That assumption is dead. The interface is your operating system now. The agent gets the mouse, the keyboard, the browser tabs, sometimes your card. It does not describe how to do the thing. It does the thing.
Every serious lab is on this. Anthropic shipped computer use in late 2024, and Claude has been driving desktops since. OpenAI followed with Operator. Google has Gemini computer use plus Project Mariner inside Chrome. Perplexity built Comet, an entire browser shaped around an agent. Manus came out of China with a general-purpose agent that books trips, files paperwork, and scrapes whatever you point it at. Microsoft and Apple are wiring this into the OS itself, quieter, slower, much closer to where the real leverage is.
Pick any of them. The message is the same. The agent is the new user.
how far we actually got
Two years ago, asking a model to complete a real multi-step task on a real desktop was a coin flip at best. OSWorld, the most honest early benchmark for desktop agents, was sitting at around 14% when it launched in 2024. The progress since has been fast enough that any single number starts going stale almost immediately.
On “can the agent see and click the right thing”, the field is getting very good. Pixel-grounding benchmarks like ScreenSpot are above 90% for the top models. The eyes work. The hands mostly work. The old joke that agents cannot use computers is aging badly.
On “can the agent finish a real workflow”, the answer is messier. The best systems are now brushing against human baselines on OSWorld-style tasks, which is a ridiculous improvement from 2024. But benchmark parity is not the same thing as operational trust. Newer cross-application evals still show ugly failure rates once the task requires several apps, conditional judgment, cleanup, and persistence. Even a 75% task success rate means one in four things still falls over somewhere.
And that “somewhere” is almost never “I could not read the button”. It is the page reloaded, the login screen came back, a captcha appeared, a popup blocked the click, the cart wants your card, or the agent did the right local action inside the wrong global state.
The cursor is not the bottleneck. The world around the cursor is.
the five shapes people are shipping
There is a tendency to talk about “computer use” like it is one product. It is at least five, and the shapes have very different trade-offs.
The desktop agent runs on your machine and can touch anything you can touch. Anthropic’s computer use sits here. Max power, max blast radius. If it gets confused, it can delete things you cared about. Trust bar is high. Best for power users and developers who can sandbox it.
The browser agent lives inside a tab. Perplexity Comet, Arc’s agent, the various Chromium-based experiments. Safer becuase the world it can break is small, but it hits a wall every time a workflow leaves the browser. Try to download a PDF, sign it, attach it back somewhere, and the seams show up immediately.
The managed agent runs in the cloud. OpenAI Operator is the cleanest example. Convenient because you do not give it your laptop, but the moment the site you care about decides cloud IP’s look like bots, you are stuck in captcha purgatory.
The OS-level agent is the long game. Apple, Microsoft, and Google’s mobile push are all trying to wire agents into the system itself: real accessibility trees, real intents, real content types. When the OS exposes structure, the agent stops guessing. This is also the lane with the most patience, which is why Apple in particular looks slow untill they suddenly are not.
The generalist agent is the demo champion. Manus, Mariner-style long-horizon planners. They will plan your trip, file your form, scrape your competitor. They will also confidently do the wrong thing for ninety steps before failing in a way you cannot easily unwind.
Each shape is real. None of them is finished.
where the workflow actually breaks
If you sit and watch a serious computer-use demo end to end, the failure mode is almost always the same. The agent looks brilliant for the first half of the task and then loses to something boring.
It is not glamorous. Finding the product is easy. Adding it to the cart is easy. Then auth shows up, fraud signals trip, a phone-based 3-D Secure challenge fires off, and the agent stalls because the verification was designed to assume there is a human holding a phone, not a long-running automation.
Most demos quietly stop right before this cliff. The honest ones hand control back to the user at checkout, which kind of defeats the point of an agent.

the protocols are the actual story
The reason buying is so painful is structural. The stack assumes a human in a browser, with cookies, a card on file, a phone for 3-D Secure, and a fraud score built from device fingerprints and behavior. An agent breaks almost every one of those assumptions, and the merchant has no clean way to know “this purchase was authorized by Edi, with these limits, through this agent, for this purpose”. So they either block it, or they let it through and hope.
This is what protocols are quietly fixing. Three layers matter.
MCP is the floor. It is the boring, beautiful plumbing that lets an agent talk to a tool, a database, a file system, an API in one standard way. It is already widely adopted across Anthropic, OpenAI, Cursor, and most serious IDEs. If you are building anything agentic, MCP is the layer you build on, not around.
A2A is the layer above. Google and a coalition of partners pushed it out in 2025 as an open protocol for agents from different vendors to talk to each other. My booking agent calls your inventory agent, they negotiate, work happens, result comes back. If A2A actually catches on, the web stops being a pile of HTML that an agent has to squint at, and starts being a network of cooperating services with structured handshakes.
AP2 is the one I find most concrete. The Agent Payments Protocol, announced by Google with Mastercard, American Express, Coinbase, PayPal and a long list of issuers in late 2025, is an attempt to do for agent commerce what 3-D Secure tried to do for online cards. The agent presents a verifiable mandate signed by the human. The merchant verifies it. The issuer verifies it. Payment flows, with cryptographic evidence at every step. “This agent bought this item for this user with this budget” finally has a real signature behind it instead of a session cookie and a prayer.
If you are a merchant, this is the upgrade path you cannot ignore for long. Either you wire up agent-friendly payments with verifiable mandates, or you start quietly blocking a non-trivial slice of your traffic, because that slice is no longer human.
the messy middle is where we live
The bad news is that none of these protocols are live across the long tail. MCP is the closest. A2A is real but mostly between cooperative parties. AP2 has the right names on the press release and nowhere near enough merchant support to run your week through it yet.
So agents in 2026 are stuck doing two jobs at once.
That is why the moment feels so strange. Agents are good enough to use every day, and not good enough to forget about while they hold your card, your calendar, your inbox, or your company account.
One job is helping the long tail of legacy sites and apps that will never speak A2A by pretending to be a human. This is a perception and resilience problem. Better screen understanding, better recovery when a popup appears, better memory of “I already tried this and it failed”. The flashy demo lives here, and so does most of the brittleness. Every UI redesign quietly breaks an agent somewhere.
The other job is helping the leading edge of services adopt MCP, A2A, AP2, so the next round can be cleaner. This is mostly a trust and identity problem. Who signed the mandate, who is liable, how do we revoke, what audit trail do we keep. Less glamorous. Much more durable.
The winners over the next few years will not be the labs with the flashiest demo. They will be the ones that solve the perception problem and the trust problem at the same time, so the agent is both capable of running a task on a legacy site and credible enough to do it on a modern one. Most of the current crop is good at one and pretending about the other.
what to actually do
If you build software, four things are worth doing this year. Design your UI so an agent can also use it: real accessibility trees, structured labels, predictable flows, not just whatever the design system spat out. Ship an MCP server if you have any tool surface worth exposing. Start thinking about an A2A integration even before you build it, because the shape of your data and identity model matters more than the wire format. If you take payments, track AP2 and similar mandates, even if you do not implement yet, because the fraud and identity decisions you make in 2026 will determine whether you can adopt cleanly in 2027.
If you do not build software, the practical version is shorter. The next few apps you install, ask whether an agent could run it for you. If yes, thats your new productivity ceiling. If no, that is a future migration waiting to happen.
The cursor is moving. Whether it is moved by you or by something acting on your behalf is going to become a question you ask several times a day. I am not nostalgic about clicking through forms. I am, however, nervous about the gap between what these agents can already do and what the systems on the other side are ready to verify, trust and bill. That gap is the work of the next two years, and the companies that close it first are going to look very different from the ones currently winning the demo cycle.
The demo is the easy part. The plumbing under it is the rest of the decade.