Judging If an Agent Is Reliable: The Only Metric That Matters — Did It Actually Finish the Job?
To judge whether an Agent is reliable, there is only one core metric: did it actually finish the job?
How the Industry Evaluates Agents
The industry's common approach is simple: equip the Agent with a virtual machine preloaded with applications or simulated web pages, then score its actions. This logic has given rise to benchmarks such as OSWorld (for evaluating computer skills) and Tau2 (for assessing customer service workflows).
When GPT-5.5 was released, these rankings were cited as well. Every time a new model launches, these charts are trotted out for display — but there's an unspoken flaw: evaluations using simulators measure actions, not outcomes.
The main purpose of benchmarks is to identify flaws in existing models. In the field of computer use, the biggest challenge is the prevalence of performative Agents: many are skilled at putting on a show — they can complete simple tasks like copying files and deliver polished final reports.
But in real-world office scenarios, what we care about is whether complex, cross-software tasks that span hundreds of steps are actually completed.
Introducing SaaS-Bench
To address this issue, the team at UniPat Lab developed a new tool: SaaS-Bench, designed to call out Agents that talk a good game but deliver little.
They packaged a suite of well-known open-source SaaS tools — such as Mattermost, OnlyOffice, and ownCloud — into Docker containers, creating a genuine office environment. The goal is to observe how Agents operate and verify whether the database actually changes after their actions.
The test results show that Opus and GPT are clearly ahead of the pack. Yet even the top-performing model scores less than 50% under this real-world validation.
Note: DeepSeek, GLM, and MiniMax do not support multimodal capabilities, so their scores are affected.
Why Real-World Validation Matters
What we call "real" must be verifiable.
In the past, testing GUI ability usually meant setting up a static webpage environment to see if the Agent could click buttons correctly — much like a driving test: checking whether you can parallel park, stay within the lines, and so on.
But driving on real roads is a different story. Normal office work is business-oriented and the environment is complex. For example, sometimes the Agent clicks successfully and the page even redirects — but the backend may receive no response, because it clicked a fake link.
Real-world computer environments are always full of weird edge cases.
Going back to first principles: an Agent's words can lie, but a database cannot. We only need to check changes in the database — and this is exactly how SaaS-Bench was born.
Only completing the full chain counts:
Task Input → Agent → SaaS Apps (Docker) → Browser Interaction → Verification (State Check) → Score
Benchmark Design
The UniPat team containerized 23 open-source SaaS applications in Docker. The test scenarios span six domains: software development, business finance, healthcare management, team collaboration, agricultural supply chains, and independent media. Each scenario uses real-world business data.
Twenty-three apps across six industries — chances are your company uses several of them. Notably, out of all 106 tasks, 93.4% involve two or more apps, and half (53 tasks) require collaboration across three apps. There are 74 text-only tasks and 32 tasks requiring multimodal understanding.
This mirrors our typical work habits: constantly switching between apps to copy and paste. By contrast, previous GUI benchmarks mostly tested single-app tasks within 50 steps.
Take healthcare management as an example: a doctor first writes a SOAP note in OpenEMR, then fills out reporting fields in OpnForm, and finally generates an official document in OnlyOffice — juggling three systems back and forth.
Previous benchmarks mostly tested single-app tasks within 50 steps, while SaaS-Bench consists almost entirely of long-horizon tasks of over 100 steps. Any attempt to cut corners or fake progress will fail the final database validation.
How Tasks Are Created
Tasks are built with a human-in-the-loop process. First, large models generate initial data based on professional roles and task seeds. Then experts manually filter, execute, and align the validators to ensure all tasks are representative and verifiable — roughly four stages in total.
Whether an operation is correct is verified by checking the database. A validator sits behind the scenes: each task has a verify.py script that automatically runs SQL queries on the database and calls APIs to fetch statuses. As soon as a task finishes, the verifier directly checks whether the fields in the database are correct.
SaaS-Bench Leaderboard
Note: DeepSeek/GLM/MiniMax are unimodal.
Model testing falls into two main categories: text-only tasks and multimodal tasks. Both interact with SaaS interfaces via a browser, but with a key difference: multimodal models receive screenshots + accessibility trees, while non-multimodal models get only accessibility trees.
For multimodal models, Claude Opus 4.7 takes first place with a checkpoint score of 43.9% and a resolved score of 3.8%. GPT-5.5 High is nearly tied at 43.8% checkpoint score, but only 1.9% resolved.
To clarify: resolved means the task was completed perfectly; checkpoint awards partial credit for progress. Clearly, even top-tier Opus struggles with real office software operations — this matches real-world experience.
Among Chinese multimodal models, K2.6 stands out as the strongest. It's widely recognized that K2.5 marked a turning point for Kimi, and Kimi K2.6 is open-sourced: one developer, with his 300 Agents.
For non-multimodal models (DeepSeek/GLM/MiniMax), focusing solely on text-only tasks, the newly released DeepSeek V4 outperforms GLM and MiniMax — consistent with the common belief that newer models are stronger.
Two interesting trends emerge:
- Nearly all multimodal models score higher on theoretically harder multimodal tasks.
- Multimodal models also perform better on text-only computer-use tasks.
For the second point, since unimodal models rely solely on accessibility trees while multimodal models also get screenshots, this suggests that even for Agents, combining text and visuals improves information understanding.
The Longer the Task, the Lower the Success Rate
Longer tasks are more prone to errors. The data illustrates this clearly:
- Single-app tasks average a score of 53%, while tasks spanning four apps drop to around 20% success rate.
- Tasks completed in under 50 steps have an average success rate above 50%, but this falls to roughly 20% for 400-step tasks.
- Tasks with six or fewer checkpoints score 65%, while those with 18 or more drop to 27%.
In short: the more complex the task, the lower the score. Mathematically, this makes sense — even with a 95% pass rate per checkpoint, 12 consecutive checkpoints yield only a 54% overall success rate.
97.3% of tasks exceed 100 steps, with the longest reaching over 300. This is exactly how real office workflows work. The longer the task, the higher the chance of a mistake at any step, and the harder it is to recover later. Breaking tasks into early, mid, and late stages, all models follow the same pattern: gaining points early, losing points late.
All models decline steadily — no exceptions. At the same time, the single-step error rate is not constant. A mistake in an early step can drag down the success rate of many subsequent steps, and self-correction is hard.
One small misstep at step seven brings down the rest. In this task, the goal was to create a company client named Arcturus Digital. The Agent entered the contact name along with the company name, which triggered the logic for a personal client instead — resulting in the creation of a person named Elena Vasquez. Consequently, subsequent processes such as invoicing, payment recording, and account reconciliation all failed because they were linked to the wrong entity.
Clearly, even a minor early mistake can lead to significant downstream consequences.
Databases Don't Lie
Large models often have a bad habit of "promising first, apologizing later." Verifying against the database is a game-changer. If you ask an Agent to self-check, it will confidently say, "Don't worry, the restaurant is 100% booked." But when you check the database, you quickly spot the problem: many self-reported successes are pure hallucinations.
If you only look at the Agent's report, you'll often be convinced you got it right. That's when you need to bring in the "cyber truth-teller" and make your Agent show you the full data.
For example, Opus 4.6 noticed a wrong date in a task and said, "I'll fix it right away, no problem," then reported, "Billing date updated to 2026-03-20." But if you check via the API, the backend still shows: billing date 03-19.
Intent says done, status says not done — both sides think they're right. The Agent believes it succeeded in terms of intent; its self-reflection says "I will fix it," but it doesn't always succeed. The verifier exists exactly to expose how Agents try to cut corners.
From Leaderboards to Training Data
Over the past two years, Computer-Use Agents (CUA) have faced one core challenge: a severe shortage of training data. Recent papers like WebSTAR, GUI-360, and Video2GUI all open with the same conclusion: scarcity of high-quality trajectory data.
Most CUA training data comes from manual labeling — expensive and not scalable. The rest is synthetic data from simplified environments — cheap, but unrealistic.
The real value of SaaS-Bench lies in its environment: it can reliably generate long-horizon, cross-app trajectories with real backend validation. For Agents aiming to master office workflows, this environment is extremely valuable.
Summary
If we truly want Agents to be adopted across industries, we need better ways to evaluate their behavior — ensuring they deliver real results, not just empty promises.
When evaluating Agents, we shouldn't only look at polished final reports and neat formatting. What matters most is whether the job actually got done.
This is exactly why SaaS-Bench matters: it provides a way to "detect lies" and an environment to generate real data — essentially, a performance scorecard for future Agents.