What they did
The authors developed a multi-token activation patching framework to trace how steering vectors propagate through transformer internals. They focused on refusal steering — making models refuse or comply with requests — and applied their analysis across two model families and multiple steering methodologies (e.g., contrastive activation addition, representation engineering). The framework allowed them to isolate which components of the transformer architecture are causally responsible for the behavioral changes induced by steering.
They decomposed the attention mechanism into its constituent parts: the QK circuit (which determines what tokens attend to) and the OV circuit (which determines what information is read and written). They also performed a mathematical decomposition of the steered OV circuit to identify interpretable semantic concepts, and tested how aggressively steering vectors could be sparsified while maintaining effectiveness.
Key findings
- Steering vectors act through the OV circuit, not the QK circuit. Freezing all attention scores (effectively disabling the QK circuit's role during steering) reduced steering performance by only 8.75% across two model families, indicating the QK circuit plays a minimal role.
- Different steering methodologies use functionally interchangeable circuits. When applied at the same layer, distinct steering methods (e.g., contrastive activation addition vs. representation engineering) activate overlapping internal pathways, suggesting a shared underlying mechanism.
- Steering vectors can be sparsified by 90–99% while retaining most of their effectiveness, meaning only a tiny fraction of dimensions carry the behaviorally relevant signal.
- Mathematical decomposition of the steered OV circuit yields semantically interpretable concepts, even in cases where the full steering vector itself resists straightforward interpretation.
Why it matters
Steering vectors have emerged as a practical tool for model alignment, but their inner workings have been opaque. This work provides the first detailed mechanistic account of how they alter model behavior, grounding the explanation in specific transformer subcircuits. The finding that steering is largely an OV-circuit phenomenon constrains future theoretical accounts and could inform more efficient or targeted steering methods. The extreme sparsifiability result suggests that current steering vectors are highly redundant, opening paths toward cheaper and more precise interventions.
Caveats
The study is a case study on refusal — one specific behavioral dimension — and it remains to be seen how well these findings generalize to other steering targets like style, factuality, or persona. The analysis was conducted on two model families, and architectural differences in other models could yield different mechanistic stories. The 8.75% performance drop from freezing attention scores, while small, is not zero, leaving room for QK-mediated effects in edge cases. The semantic interpretability of decomposed OV circuits is assessed qualitatively, and more rigorous evaluation methods for interpretability claims would strengthen the conclusions.