Steering Vectors in Language Models Work Mainly Through Attention's OV Circuit

A mechanistic analysis of refusal steering reveals that steering vectors interact with attention value computations, not key-query matching, and can be sparsified by up to 99% with little performance loss.

Stephen Cheng · Sarah Wiegreffe · Dinesh Manocha
Research Digest··3 min read
Cheng, Wiegreffe, and Manocha investigate why steering vectors — a lightweight technique for controlling language model behavior — actually · AI-generated illustration · Zotpaper
Cheng, Wiegreffe, and Manocha investigate why steering vectors — a lightweight technique for controlling language model behavior — actually · AI-generated illustration · Zotpaper
Cheng, Wiegreffe, and Manocha investigate why steering vectors — a lightweight technique for controlling language model behavior — actually work at a mechanistic level. Using refusal as a case study, they find that steering vectors primarily operate through the output-value (OV) circuit of the attention mechanism while leaving the query-key (QK) circuit largely untouched, and that the vast majority of steering vector dimensions are unnecessary.

What they did

The authors developed a multi-token activation patching framework to trace how steering vectors propagate through transformer internals. They focused on refusal steering — making models refuse or comply with requests — and applied their analysis across two model families and multiple steering methodologies (e.g., contrastive activation addition, representation engineering). The framework allowed them to isolate which components of the transformer architecture are causally responsible for the behavioral changes induced by steering.

They decomposed the attention mechanism into its constituent parts: the QK circuit (which determines what tokens attend to) and the OV circuit (which determines what information is read and written). They also performed a mathematical decomposition of the steered OV circuit to identify interpretable semantic concepts, and tested how aggressively steering vectors could be sparsified while maintaining effectiveness.

Key findings

  • Steering vectors act through the OV circuit, not the QK circuit. Freezing all attention scores (effectively disabling the QK circuit's role during steering) reduced steering performance by only 8.75% across two model families, indicating the QK circuit plays a minimal role.
  • Different steering methodologies use functionally interchangeable circuits. When applied at the same layer, distinct steering methods (e.g., contrastive activation addition vs. representation engineering) activate overlapping internal pathways, suggesting a shared underlying mechanism.
  • Steering vectors can be sparsified by 90–99% while retaining most of their effectiveness, meaning only a tiny fraction of dimensions carry the behaviorally relevant signal.
  • Mathematical decomposition of the steered OV circuit yields semantically interpretable concepts, even in cases where the full steering vector itself resists straightforward interpretation.

Why it matters

Steering vectors have emerged as a practical tool for model alignment, but their inner workings have been opaque. This work provides the first detailed mechanistic account of how they alter model behavior, grounding the explanation in specific transformer subcircuits. The finding that steering is largely an OV-circuit phenomenon constrains future theoretical accounts and could inform more efficient or targeted steering methods. The extreme sparsifiability result suggests that current steering vectors are highly redundant, opening paths toward cheaper and more precise interventions.

Caveats

The study is a case study on refusal — one specific behavioral dimension — and it remains to be seen how well these findings generalize to other steering targets like style, factuality, or persona. The analysis was conducted on two model families, and architectural differences in other models could yield different mechanistic stories. The 8.75% performance drop from freezing attention scores, while small, is not zero, leaving room for QK-mediated effects in edge cases. The semantic interpretability of decomposed OV circuits is assessed qualitatively, and more rigorous evaluation methods for interpretability claims would strengthen the conclusions.

§

Analysis

This work sits at the intersection of mechanistic interpretability and practical alignment techniques. Previous work has shown that steering vectors are effective but has offered limited explanation beyond geometric intuitions in activation space. By connecting steering to the OV/QK decomposition framework from the mechanistic interpretability literature, the authors bridge two active research threads. The sparsifiability finding echoes results in pruning and lottery ticket research — that neural networks often use far fewer effective dimensions than their nominal size suggests — but applies it specifically to the intervention vector rather than model weights. An open question is whether the shared circuits across steering methods imply a universal 'refusal direction' in these models, or whether different behavioral targets would reveal more method-dependent pathways.

newspaper

Research Digest

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.