A pair of research papers published this week reveal both the promise and the significant limitations of large language models (LLMs) when deployed in cybersecurity defence roles, finding that while structured frameworks can meaningfully improve AI performance in threat hunting, fundamental vulnerabilities in how LLMs reason about cyber threat intelligence remain a serious concern for security teams.
Researchers from multiple institutions have released two complementary studies examining how well large language models perform when tasked with real-world cybersecurity defence work — a question with growing urgency as organisations increasingly look to AI tools to help overwhelmed security analysts manage an expanding threat landscape.
The first paper introduces CyberTeam, a benchmark framework designed to evaluate and guide LLMs through blue team threat-hunting tasks — the proactive work security defenders do to detect threats before they cause damage. Rather than allowing models to reason freely, CyberTeam breaks threat hunting into a structured sequence of 30 discrete tasks and 9 operational modules, covering everything from threat attribution through to incident response.
The researchers found that this standardised, modular approach produced measurable improvements over open-ended reasoning strategies, where LLMs are simply given a problem and asked to work through it without imposed structure. The results suggest that how you deploy an LLM in a security context matters as much as which model you choose.
However, the second paper — from an overlapping group of researchers — offers a sobering counterpoint. In a comprehensive empirical study of LLM vulnerabilities in cyber threat intelligence (CTI) reasoning, the authors argue that the biggest obstacle to reliable AI-assisted security work is not generic model flaws like hallucination, but rather the nature of the threat data itself.
CTI information is, by its nature, heterogeneous, rapidly changing, and fragmented across many sources — properties that standard AI benchmarks rarely capture. The researchers identified three specific cognitive failure modes that emerge under these conditions: spurious correlations drawn from superficial metadata; contradictory knowledge absorbed from conflicting sources; and poor generalisation to emerging or novel threats that fall outside a model's training data.
Using a human-in-the-loop categorisation framework — deliberately avoiding automated 'LLM-as-a-judge' evaluation, which the authors argue introduces its own brittleness — the team validated these failure mechanisms through causal interventions. They reported that targeted defences addressing each failure mode reduced error rates significantly, though they stopped short of claiming the problem is solved.
Taken together, the two studies paint a nuanced picture of where AI stands in cybersecurity defence. Structured deployment frameworks like CyberTeam can extract more reliable performance from existing models, but the volatile and fragmented nature of real-world threat intelligence continues to expose fundamental gaps in LLM reasoning that generic improvements alone will not address.
For security teams considering AI-assisted tools, the research suggests a cautious approach: structured workflows and human oversight remain essential, and organisations should be wary of deploying LLMs in high-stakes CTI contexts without robust mechanisms to catch domain-specific reasoning failures.
Analysis
Why This Matters
- Security teams worldwide are under increasing pressure to adopt AI tools to manage surging threat volumes — these studies provide the first structured evidence of where those tools reliably succeed and where they dangerously fail.
- Organisations that deploy LLMs in threat intelligence roles without understanding these failure modes risk false confidence, potentially missing genuine threats or acting on spurious correlations.
- The research points toward a roadmap for building more resilient AI security tools, with implications for vendors, enterprises, and government agencies that rely on CTI pipelines.
Background
The use of large language models in cybersecurity has grown rapidly over the past two years, driven by the success of general-purpose models like GPT-4 and Claude and the chronic shortage of skilled security analysts. Vendors have rushed to integrate LLMs into security information and event management (SIEM) platforms, threat intelligence feeds, and incident response tools.
Early deployments showed promise in automating routine tasks such as log summarisation and vulnerability triage, but practitioners quickly noted reliability problems — models would confidently assert incorrect threat attributions, miss novel attack patterns, or produce inconsistent analyses of the same data. Until now, however, systematic benchmarks for blue team and CTI-specific LLM performance have been limited.
The broader academic field of AI safety in security contexts has largely focused on offensive uses of LLMs — helping attackers write malware or phishing content — with the defensive side receiving comparatively less rigorous attention. These two papers represent part of a growing effort to close that gap with empirical, domain-specific evaluation.
Key Perspectives
Security researchers: The authors argue that the field has been too quick to attribute LLM failures in security contexts to generic model limitations. Their work identifies domain-specific failure modes that require domain-specific solutions — a finding that pushes back against the assumption that frontier model improvements alone will resolve the problem.
Security practitioners and blue teams: Defenders who rely on AI-assisted tools need to understand that free-form LLM reasoning is measurably less reliable than structured, modular approaches. The CyberTeam results support a workflow-first philosophy, where AI is a guided component rather than an autonomous analyst.
Critics and sceptics: Some in the security community will note that benchmarks, however well-designed, rarely capture the full complexity of live operational environments. The human-in-the-loop evaluation method used in the CTI vulnerability study is more robust than automated judging, but scaling such evaluation to production environments remains an open challenge. Others may question whether the 'targeted defences' that reduced failure rates are practical to implement at enterprise scale.
What to Watch
- Whether major security vendors (CrowdStrike, Microsoft Sentinel, Palo Alto) respond to this research by adopting structured, modular LLM deployment frameworks in their AI-assisted products.
- Academic and industry uptake of the CyberTeam benchmark as a standard evaluation tool — its adoption would signal growing consensus around structured LLM deployment in security.
- Emerging regulatory guidance on AI use in critical security infrastructure, particularly from bodies like CISA in the US and the NCSC in the UK, which may reference or build on this kind of empirical work.