As artificial intelligence developers increasingly rely on large language models (LLMs) to evaluate the quality of other AI-generated text, a troubling pattern has attracted scrutiny: models appear to rate their own outputs more favourably than those produced by rival systems. The phenomenon, dubbed 'self-preference bias' or colloquially 'LLM narcissism,' has raised questions about the integrity of automated evaluation pipelines used in everything from chatbot training to model selection.
A pair of papers published this week on arXiv by researchers including Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, and colleagues offer a more nuanced picture of the problem — and a promising, if imperfect, remedy.
Questioning the Narcissism Framing
The first study, 'Are LLM Evaluators Really Narcissists?', urges caution before attributing self-preferring verdicts to bias alone. The researchers argue that prior experiments failed to adequately account for a simpler explanation: a model acting as a judge may favour its own responses not because of identity loyalty, but because it genuinely performs better on questions it also handles well as a respondent.
To test this, the team constructed an 'evaluator quality baseline' by directly comparing how a judge votes when assessing itself versus another model on the same questions. Their analysis found that only 51% of examples cited in previous self-preference research retained statistical significance once this confound was controlled for — though those examples still account for 89.6% of the total measured self-preference probability mass. The researchers also point to uncertainty-driven overlap in voting distributions as a further complicating factor.
The upshot is that while self-preference bias is real, its prevalence and severity in earlier studies may have been overstated.
A Technique to Reduce the Bias
The companion paper, 'Breaking the Mirror,' takes the existence of genuine self-preference bias as its starting point and asks how it can be corrected at inference time — without the expense of retraining a model from scratch.
The researchers introduce a curated dataset that separates 'justified' self-preference (where the model's own output is genuinely better) from 'unjustified' self-preference (where the model favours itself despite producing inferior results). They then construct 'steering vectors' — compact mathematical representations derived from the model's internal activations — using two methods: Contrastive Activation Addition (CAA) and an optimisation-based approach.
Applied during inference, these steering vectors reduced unjustified self-preference bias by up to 97%, substantially outperforming both prompting-based interventions and direct preference optimisation baselines.
However, the technique carries a caveat: steering vectors proved unstable when applied to cases of legitimate self-preference and unbiased agreement, suggesting that the underlying bias does not occupy a single, cleanly separable direction in the model's activation space. The authors acknowledge this limits the approach's reliability as a standalone safeguard.
Implications for AI Evaluation
Together, the two papers highlight the complexity of building trustworthy automated evaluation systems. The use of LLMs as judges has become widespread because it is cheap and scalable, but if those judges carry systematic biases — even modest ones — the downstream effects on model training and deployment decisions could be significant.
The researchers call for more robust interventions and more careful experimental design in future bias research, noting that both the diagnosis and the cure remain works in progress.