Two papers from academic researchers, both published on arXiv this week, shed light on the structural limitations of large language models (LLMs) when tasked with generating argumentative text and managing extended dialogue.
Argument Collapse: When AI Flattens the Debate
A study led by Yekyung Kim and colleagues at the University of Massachusetts Amherst examined whether LLMs, when used to draft public-facing arguments, systematically reduce the diversity of viewpoints in public discourse — a phenomenon they term "argument collapse."
The researchers compared 1,039 human responses drawn from 195 New York Times debate prompts and 448 responses from 61 longer Boston Review forums against a dataset of 23,384 LLM-generated essays on the same topics.
The results were striking. In the NYT corpus, 65.3% of human-written main arguments were unique within a given debate, meaning contributors consistently brought fresh perspectives. By contrast, only 3.4% of LLM-generated main arguments were unique — a nearly twentyfold difference.
The homogeneity extended beyond top-level claims. Among essays sharing the same main argument, 41% of human sub-arguments were unique, compared to just 9.1% from LLMs. Qualitatively, the researchers found LLMs tended toward "generalized and hedged" supporting points, while humans favoured concrete, topic-specific reasoning.
Structurally, LLM essays also followed a noticeably rigid arc: opening with a direct claim, then moving quickly to policy proposals — a pattern that held across both short NYT responses and longer Boston Review forum contributions.
Asking LLMs to deliberately generate diverse responses improved variety somewhat, but the researchers found a typical model recovered only about half the distinct human arguments, with much of the added variation falling outside the range of arguments humans actually make.
DYCP: A Technical Fix for Long Conversations
Separately, researchers Nayoung Choi, Jonathan Zhang, and Jinho D. Choi proposed a practical solution to a different LLM limitation: the difficulty of efficiently handling long conversations with frequent topic shifts.
While modern LLMs support increasingly large "context windows" — the volume of text they can process at once — doing so is computationally expensive and slow. Their system, called DyCP (Dynamic Context Pruning), operates outside the LLM itself and dynamically selects only the most relevant portions of a conversation history based on the current message, without requiring any pre-processing or predefined topic boundaries.
Tested across three long-form dialogue benchmarks — LoCoMo, MT-Bench+, and SCM4LLMs — the researchers found DyCP maintained competitive answer quality while using context more selectively and reducing inference time, making it potentially useful for customer service bots, virtual assistants, and other applications requiring sustained multi-turn conversations.
Together, the two papers point to a maturing field grappling with real-world deployment challenges: not just whether LLMs can produce fluent text, but whether that text is genuinely diverse, contextually appropriate, and computationally sustainable at scale.