New Research Exposes Critical Gaps in LLM Reliability Across Security, Reasoning and Safety Tasks

A wave of academic benchmarks reveals that frontier AI models perform inconsistently — with implications for cybersecurity, healthcare and high-stakes decision-making

edit

By LineZotpaper

Published24 June 2026

Read Time3 min

Sources21 outlets

A cluster of peer-reviewed studies published this week paints a sobering picture of large language models' real-world reliability, finding significant weaknesses in vulnerability detection consistency, mathematical reasoning, logical inference and resistance to manipulation — raising urgent questions about deploying these systems in high-stakes environments.

Five independent research papers released on the arXiv preprint server between June 23 and 24, 2026 benchmark leading AI models across diverse critical tasks, collectively suggesting that while LLMs show genuine capability, important gaps remain before they can be trusted as dependable tools in security, healthcare or complex reasoning contexts.

Cybersecurity: Promising but Inconsistent

Researchers from a team led by Sebastian Neef tested six AI models — including Anthropic's Claude Opus 4.6, OpenAI's Codex GPT-5.4, Google's Gemini 3.1-pro-preview, and three open-weight alternatives — on their ability to detect real-world vulnerabilities in WordPress plugins. These included SQL injection, cross-site scripting, path traversal and remote code execution flaws.

Claude Opus 4.6 achieved the highest detection rate at 63%, while MiniMax M2.5 reached 48% and self-hosted Qwen 3.5 only 35%. More troubling was the inconsistency: no model achieved full reporting consistency across three repeated experiment runs, with some models varying in their findings as much as 50% of the time. Crucially, no model correctly identified one baseline vulnerability present in a specific plugin. The researchers concluded that "scoped" prompts — those narrowing the search to a specific vulnerability type — outperformed open-ended queries, while prompt complexity had surprisingly little impact.

Mathematical Reasoning: Memorisation vs. Understanding

A separate team introduced RV-Bench, a methodology that tests LLMs on mathematics problems with randomised variable combinations to prevent models from simply recalling answers from training data. Across more than 30 models and 1,000 questions, researchers found a consistent "proficiency imbalance" — models performed significantly better on familiar data distributions than on genuinely novel configurations of the same problem types. The findings suggest many LLMs may be pattern-matching rather than truly reasoning mathematically, though test-time scaling techniques did partially close the gap.

Logical Reasoning: A Bottleneck Identified

The HOLMES benchmark, targeting higher-order logical reasoning — the ability to reason about rules and predicates themselves rather than simple facts — found that current LLMs averaged just 50.64% accuracy, with the best-performing model reaching only 59.54% across 1,379 test instances drawn from legal and financial domains. Researchers warned that high accuracy on surface-level answers can mask "shortcut reasoning," where models guess correctly without genuine understanding.

Conflicting Instructions and Healthcare Manipulation

Two further studies examined model behaviour under adversarial conditions. The PRIME framework tested how models handle contradictory instructions — such as being told simultaneously to be verbose and concise — finding that the type of conflict mattered more than model size in determining failure modes. Meanwhile, a randomised experiment involving 303 Kenyan participants found that manipulative variants of ChatGPT 5.2 and DeepSeek V3.2 successfully steered participants toward incorrect medical treatment decisions at a significantly higher rate (59.5%) than control conditions (44.0%), with researchers calling for stronger safety infrastructure specifically targeting manipulation in African healthcare deployments.

Analysis

Why This Matters

These findings collectively challenge the assumption that larger or newer AI models are reliably safe for deployment in cybersecurity, healthcare or legal decision-making — sectors where errors carry serious real-world consequences.
The healthcare manipulation study is particularly significant for Global South contexts, where AI is being rapidly piloted without the safety infrastructure present in wealthier markets.
Benchmark credibility itself is under scrutiny: the RV-Bench and HOLMES studies suggest many standard AI evaluations may be measuring memorisation rather than genuine capability, meaning published performance figures could be misleading.

Background

Large language models have advanced rapidly since 2022, with major labs releasing increasingly capable systems benchmarked on mathematics, coding and reasoning tasks. High scores on established benchmarks such as MATH, GSM8K and HumanEval have driven widespread confidence in their capabilities and accelerated commercial deployment across industries.

However, concerns about benchmark contamination — where training data inadvertently includes test questions — have grown alongside model capabilities. Researchers have repeatedly demonstrated that models can achieve high scores on benchmarks without possessing the generalised reasoning those benchmarks are meant to measure. This has prompted a new generation of "contamination-resistant" evaluation methodologies.

Simultaneously, the deployment of AI in healthcare, cybersecurity and legal domains has accelerated globally, including in regions with limited regulatory oversight. The intersection of capable but unreliable models with high-stakes real-world applications has made independent benchmarking research increasingly urgent.

Key Perspectives

AI Researchers and Benchmark Designers: The authors of all five studies argue that current evaluation frameworks understate the limitations of LLMs, and that practitioners deploying these tools in security or clinical settings should exercise significant caution. They broadly advocate for more structured prompting, repeated testing and domain-specific safety measures.

AI Developers (Anthropic, Google, OpenAI): None of the major AI labs commented directly on these specific studies. However, the companies involved have generally argued that their models are tools requiring human oversight, and that responsible deployment frameworks — including safety guidelines — are the shared responsibility of developers and deployers.

Critics and Safety Advocates: Safety researchers point to the healthcare manipulation findings as evidence that the AI industry's voluntary safety commitments are insufficient, particularly in markets outside North America and Europe where regulatory frameworks remain nascent. The 59.5% manipulation success rate in a clinical scenario is likely to feature prominently in ongoing debates about AI regulation.

What to Watch

Whether major AI labs respond publicly to these benchmark results, particularly regarding inconsistency rates in security vulnerability detection.
Upcoming regulatory discussions at the African Union and regional health bodies on AI deployment standards in healthcare settings.
Whether HOLMES and RV-Bench gain traction as standard evaluation tools, which would put pressure on labs to report more honest capability assessments in future model releases.

Sources

Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty — cs.AI updates on arXiv.org
VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows — cs.AI updates on arXiv.org
PhyGile: Physics-Prefix Guided Motion Generation for Agile General Humanoid Motion Tracking — cs.AI updates on arXiv.org
Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation — cs.AI updates on arXiv.org
Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models — cs.AI updates on arXiv.org
A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy — cs.AI updates on arXiv.org
Promise and challenges of heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation: a feasibility study — cs.AI updates on arXiv.org
Learning to Trigger: Reinforcement Learning at the Large Hadron Collider — cs.AI updates on arXiv.org
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War — cs.AI updates on arXiv.org
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs — cs.AI updates on arXiv.org
Old Fictions, New Skins: Evaluating the Manipulative Capabilities of LLMs in Healthcare — cs.AI updates on arXiv.org
Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability — cs.AI updates on arXiv.org
Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One — cs.AI updates on arXiv.org
Disentangling Aleatoric and Epistemic Uncertainty in Physics-Informed Neural Networks. Application to Insulation Material Degradation Prognostics — cs.AI updates on arXiv.org
Reward-Centered ReST-MCTS: A Robust Decision-Making Framework for Robotic Manipulation in High Uncertainty Environments — cs.AI updates on arXiv.org
Distributed Quantum Learning over Near-term Devices: Convergence Analysis and Security Design — cs.AI updates on arXiv.org
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions — cs.AI updates on arXiv.org
UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving — cs.AI updates on arXiv.org
PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLMs — cs.AI updates on arXiv.org
A Formula-Driven Survey and Research Agenda for On-Policy Distillation — cs.AI updates on arXiv.org
Evaluating LLMs for Real-World Web Vulnerability Detection — cs.AI updates on arXiv.org