Mobile AI Agents Fail When They Must Infer User Preferences on Their Own

A new benchmark reveals that even frontier language models struggle to elicit preferences through interaction and calibrate when to proactively intervene, despite excelling at explicit GUI tasks.

Tongbo Chen · Zhengxi Lu · Zhan Xu · Guocheng Shao · Shaohan Zhao · Fei Tang · +10 more
Research Digest··3 min read
Chen et al. · AI-generated illustration · Zotpaper
Chen et al. · AI-generated illustration · Zotpaper
Chen et al. introduce KnowU-Bench, an interactive benchmark for evaluating personalized mobile agents across 192 tasks in a live Android emulation environment. Their key finding: agents that perform well on explicit instructions see performance drop below 50% when they must infer hidden user preferences or decide when to proactively assist, exposing a fundamental gap between interface competence and genuine personal assistance.

What they did

The authors built KnowU-Bench, an online evaluation platform running on reproducible Android emulators, comprising 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Critically, user profiles are hidden from the agent — only behavioral logs are exposed — forcing agents to genuinely infer preferences rather than simply looking them up from provided context. An LLM-driven user simulator, grounded in structured profiles, enables multi-turn clarification dialogues and realistic consent handling.

The benchmark evaluates the full proactive decision chain: whether an agent can detect when intervention is appropriate, seek user consent, execute the correct GUI actions, and — importantly — back off gracefully when the user declines. Evaluation uses a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. The authors tested several frontier models including Claude Sonnet 4.6.

Key findings

  • Agents that perform well on explicit task execution see performance degrade dramatically — falling below 50% — when given vague instructions requiring preference inference or intervention calibration.
  • The primary bottlenecks are not GUI navigation itself, but preference acquisition (eliciting missing information through dialogue) and intervention calibration (knowing when to act, ask, or stay silent).
  • Even Claude Sonnet 4.6, a frontier model, exhibited this degradation pattern, suggesting the gap is not simply a matter of model scale.
  • The benchmark exposes that post-rejection restraint — knowing when to stop after a user declines — is a distinct and poorly handled challenge for current agents.

Why it matters

As mobile AI assistants move toward proactive, personalized behavior — anticipating needs rather than just following orders — the ability to infer preferences and calibrate interventions becomes essential. KnowU-Bench demonstrates that current agents, even top-performing ones, lack these capabilities in realistic interactive settings. This shifts the evaluation focus from "can the agent click the right button" to "does the agent know when and how to help," providing a concrete framework for measuring progress toward trustworthy personal digital assistants.

Caveats

The user simulator, while LLM-grounded in structured profiles, may not capture the full complexity and inconsistency of real human behavior during clarification dialogues. The benchmark covers 192 tasks across specific app categories on Android, and generalization to other platforms or task domains is untested. The LLM-as-a-Judge component of evaluation introduces its own biases and reliability concerns. Additionally, the proactive tasks assume a particular model of consent and intervention that may not align with all cultural or individual expectations.

§

Analysis

This work sits at the intersection of two active research threads: GUI agent benchmarks (like AndroidWorld and OSWorld) and personalization evaluation (like LaMP and LongLaMP). KnowU-Bench's contribution is specifically in combining these — requiring agents to operate in a live environment while handling the messy, interactive process of preference discovery. The finding that preference inference, not GUI manipulation, is the bottleneck resonates with broader observations in the agent community that tool use has advanced faster than judgment and social reasoning. The benchmark also raises an important design question: how much proactive behavior should an agent exhibit, and who defines the threshold? Future work will likely need to address not just whether agents can calibrate interventions, but whether the calibration targets themselves are well-defined.

newspaper

Research Digest

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.