Faster Attention for Long-Context Models
Researchers from several Chinese institutions introduced MISA (Mixture of Indexer Sparse Attention), a technique designed to accelerate a computational bottleneck in DeepSeek's state-of-the-art sparse attention mechanism. The current approach requires scoring every preceding token with dozens of attention heads — a process that becomes increasingly expensive as documents grow longer.
MISA reframes those attention heads as a mixture-of-experts pool, using a lightweight router to activate only a handful of heads per query rather than all of them. The team reports that using just eight active heads — compared to the original 64 — delivers roughly the same accuracy on the LongBench evaluation suite across DeepSeek-V3.2 and GLM-5 models, while running approximately 3.82 times faster on an NVIDIA H200 GPU. The method requires no additional model training.
Mimicking Human Gaze in Vision-Language Models
A separate team proposed GazeVLM, a 4-billion-parameter vision-language model that attempts to replicate the way humans focus attention when examining an image. Standard vision-language models process entire images simultaneously, which the authors argue dilutes spatial reasoning and contributes to hallucinations — instances where models confidently describe things that are not present.
GazeVLM introduces special "gaze tokens" that allow the model to dynamically suppress irrelevant regions and concentrate on task-relevant areas, without cropping images or adding extra visual tokens to the context. The researchers report that GazeVLM outperforms comparable models by nearly 4% and surpasses more complex "agentic" pipelines by more than 5% on high-resolution benchmarks HRBench-4k and HRBench-8k.
Smarter Reinforcement Learning for AI Agents
A team from Baidu and associated institutions presented AEM (Adaptive Entropy Modulation), a method for improving how AI agents learn from trial-and-error interactions with complex environments such as web browsers and software repositories. A longstanding challenge in this domain is that rewards are often sparse — the agent only learns whether it succeeded or failed at the end of a long sequence of actions, making it difficult to identify which steps were helpful.
Rather than introducing additional supervisory signals, AEM monitors the statistical uncertainty of the model's responses and uses that signal to balance exploration and exploitation during training. Experiments on ALFWorld, WebShop, and SWE-bench-Verified showed consistent improvements over strong baselines, including a 1.4% gain on a leading software-engineering benchmark.
Genomics Meets Language Models
OmicsLM, developed by researchers at Synexa Life Sciences, combines quantitative gene expression data with natural-language reasoning in a single model. The system was trained on more than 5.5 million examples spanning over 70 biological task types — from predicting cell types to answering open-ended questions about experimental results.
The paper also introduces GEO-OmicsQA, a new benchmark derived from real studies in the Gene Expression Omnibus database, intended to fill a gap in existing evaluations that typically test either numeric prediction or text reasoning, but not both simultaneously.
Clinical AI: Depression and Brain Disorders
Two medically focused papers address uncertainty and reliability in clinical prediction. EviDep applies evidential deep learning to depression severity estimation from audio and video data, producing not just a score but a calibrated measure of how confident that score should be — a property researchers argue is essential before such tools could safely inform clinical decisions.
Separately, the MADCLE framework tackles inconsistency in brain disorder classification from fMRI scans, arising because results depend heavily on which brain atlas — essentially a map dividing the brain into regions — is used for analysis. MADCLE trains on multiple atlases simultaneously and uses distributional alignment to extract disease-related patterns that are consistent across them.
Bridging Vision and Language Representations
Finally, a team from multiple institutions addressed the "modality gap" — a known geometric phenomenon in which text and image embeddings occupy systematically different regions of a shared representation space even when they describe the same content. Their ReAlign strategy uses statistical properties of large unpaired datasets to correct this misalignment without additional training, and their ReVision paradigm integrates this into the pretraining stage of multimodal models, potentially reducing the need for expensive paired image-text datasets.