Three independent research papers published simultaneously on arXiv this week converge on a striking conclusion: much of what the artificial intelligence research community has accepted as evidence of sophisticated emergent behaviour in large language model (LLM) agent systems may not withstand rigorous scrutiny.
The papers, covering distinct but related areas of AI agent research, each identify structural blind spots in how experiments are designed and results are interpreted — raising questions about the robustness of a growing body of published work.
Emergent Consensus May Be a Modelling Artefact
In the first paper, researcher Dongxu Yang introduces a measurement called "coupling gain" (gamma) to quantify how much one LLM agent's stated opinion shifts in response to a neighbour's view. Testing five leading frontier models, Yang finds that coupling gain is stable and model-distinguishing, ranging from 0.15 to 0.43 across systems — but crucially, it reflects general evidence-responsiveness rather than anything uniquely social.
More provocatively, Yang applies a diagnostic test to a widely cited 2023 paper by Chuang and colleagues that claimed to demonstrate emergent consensus among LLM agents. The diagnostic reveals that apparent consensus on settled factual questions was driven by each model's pre-existing knowledge biases — not by genuine inter-agent influence. Only on genuinely debatable claims did real averaging behaviour occur. "Emergent consensus must be read from coupling in the target interaction," Yang concludes, warning researchers against conflating the two phenomena.
Simulated Customers Are Too Willing to Buy
The second paper, authored by Liang Chen, examines a different pillar of AI agent research: the use of LLMs to simulate human users for training and evaluation purposes. Using a dataset of 2,790 real conversations between an LLM sales agent and actual customers — including 793 with verified payment records — Chen identifies what he terms a "disengagement deficit."
Simulators accurately reproduced the conversational behaviour of customers who ultimately made a purchase, but dramatically misrepresented those who walked away. Real non-buyers expressed resistance and disengaged; simulated non-buyers continued asking about pricing and remained engaged in the conversation. The effect halved expressed resistance (from 25.1% to 13.5%) and nearly doubled deliberation behaviours. The pattern held across multiple model families, including DeepSeek, and proved stubbornly resistant to simple prompting fixes. The practical consequence, Chen argues, is that AI sales agents trained or evaluated on such simulators will appear to perform far better than they actually do with real customers.
Coordination Gains May Fall Within the Noise Floor
The third paper, by Alibek Kaliyev and Artem Maryanskyy, targets multi-agent coordination benchmarks — experiments that compare different architectures by measuring how much better agents perform when they can communicate and coordinate. The researchers establish a "noise floor" by running configuration-equivalent protocols that should produce identical results, finding paired performance gaps of up to 18 percentage points between runs of nominally identical setups.
Checking this noise floor against ten recently published multi-agent coordination papers, they find that seven report headline improvements smaller than the variability observed between equivalent runs, with one more sitting inside the margin. None of the original papers tested whether their gains would survive a same-model paired replication.
Taken together, the three papers do not argue that LLM agent research is without merit. Rather, they call for standardised measurement protocols — including noise-floor reporting, decision-fidelity validation against real outcomes, and diagnostic checks for model-prior artefacts — before empirical claims about agent behaviour are accepted as evidence of genuine social or coordination dynamics.