What they did
The authors built KnowU-Bench, an online evaluation platform running on reproducible Android emulators, comprising 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Critically, user profiles are hidden from the agent — only behavioral logs are exposed — forcing agents to genuinely infer preferences rather than simply looking them up from provided context. An LLM-driven user simulator, grounded in structured profiles, enables multi-turn clarification dialogues and realistic consent handling.
The benchmark evaluates the full proactive decision chain: whether an agent can detect when intervention is appropriate, seek user consent, execute the correct GUI actions, and — importantly — back off gracefully when the user declines. Evaluation uses a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. The authors tested several frontier models including Claude Sonnet 4.6.
Key findings
- Agents that perform well on explicit task execution see performance degrade dramatically — falling below 50% — when given vague instructions requiring preference inference or intervention calibration.
- The primary bottlenecks are not GUI navigation itself, but preference acquisition (eliciting missing information through dialogue) and intervention calibration (knowing when to act, ask, or stay silent).
- Even Claude Sonnet 4.6, a frontier model, exhibited this degradation pattern, suggesting the gap is not simply a matter of model scale.
- The benchmark exposes that post-rejection restraint — knowing when to stop after a user declines — is a distinct and poorly handled challenge for current agents.
Why it matters
As mobile AI assistants move toward proactive, personalized behavior — anticipating needs rather than just following orders — the ability to infer preferences and calibrate interventions becomes essential. KnowU-Bench demonstrates that current agents, even top-performing ones, lack these capabilities in realistic interactive settings. This shifts the evaluation focus from "can the agent click the right button" to "does the agent know when and how to help," providing a concrete framework for measuring progress toward trustworthy personal digital assistants.
Caveats
The user simulator, while LLM-grounded in structured profiles, may not capture the full complexity and inconsistency of real human behavior during clarification dialogues. The benchmark covers 192 tasks across specific app categories on Android, and generalization to other platforms or task domains is untested. The LLM-as-a-Judge component of evaluation introduces its own biases and reliability concerns. Additionally, the proactive tasks assume a particular model of consent and intervention that may not align with all cultural or individual expectations.