Offline Reinforcement Learning Gets a Theoretical Shake-Up
One of the more foundational contributions comes from researchers Haolin Liu, Braham Snyder, and Chen-Yu Wei, who settle a long-open question in offline reinforcement learning (RL). Their paper proves that two widely assumed sufficient conditions — Q*-realizability and Bellman completeness — are not enough on their own to guarantee sample-efficient learning under partial data coverage. The finding has direct implications for algorithms like Conservative Q-Learning (CQL), which is widely used in practice. The team introduces a new theoretical framework that decomposes offline RL complexity into two sub-problems, improving known sample complexity bounds and providing the first analysis of CQL under function approximation.
The Hidden Cost of Instruction-Tuned Coding Assistants
For software developers, a paper from Chang et al. raises a cautionary note about popular AI coding tools. The study introduces the concept of an "Instruction-Tuning Tax" — the trade-off that arises when large language models are fine-tuned to follow natural-language instructions. While instruction-tuned models excel at understanding developer intent expressed in prose (what the authors call "Command mode"), they tend to perform worse at inline code completion and infilling ("Flow mode"). The authors argue that tool builders must more carefully balance these two capabilities rather than treating instruction tuning as universally beneficial.
New Benchmarks for Science and Tabular Data
Two new benchmarks aim to better measure AI capabilities in specialist domains. MatSciBench, from researchers at multiple Chinese and US universities, presents 1,340 college-level materials science problems spanning 6 primary fields and 31 subfields. Top models including DeepSeek-R1 (75.22% on text-only questions) and GPT-5 (53.02% on image questions) leave considerable room for improvement, with models struggling most on calculation errors and domain knowledge gaps.
Separately, TRL-Bench targets tabular data — a workhorse format in enterprise AI — offering a standardised way to compare encoders across different training paradigms. The benchmark's central finding is that no single encoder dominates; performance is capability-specific, suggesting that the best real-world pipelines combine specialist models rather than relying on one general solution.
Scaling Computer-Use Agents and Recommendation Systems
CUA-Gym, from a team including researchers affiliated with Alibaba's Qwen project, addresses a practical bottleneck in training AI agents that operate computers directly. The pipeline automatically generates verified training tasks across 110 simulated environments, producing 32,112 training examples. Models trained on this data achieved 62.1% and 72.6% on a standard computer-use benchmark — outperforming prior open-source systems at comparable scale.
In the recommendation systems space, the Generative Reasoning Reranker (GR2) framework applies reinforcement learning to the under-studied "reranking" phase of content recommendation, outperforming a leading baseline by 2.4% in recall. Notably, the researchers found that language models tend to "reward hack" by preserving the existing item order, underscoring the importance of careful reward design.
AI-Assisted Motor Engineering
Finally, a paper from KAIST demonstrates a multi-agent AI system for optimising the design of interior permanent magnet synchronous motors (IPMSMs) — components central to electric vehicles and industrial machinery. The system combines retrieval-augmented generation (RAG) with finite element analysis (FEA) simulations, autonomously setting up, running, and refining design experiments. The hybrid approach outperformed both purely simulation-driven and purely AI-driven optimisation under equivalent computational budgets.