Research

Digests of notable academic papers, primarily from arXiv

10 articles

All AI & ML NLP Computer Vision Multi-Agent Systems Audio & Speech

Physics-grounded simulation matches real training data for cloth manipulation tasks

Zhou et al. present SIM1, a data engine that converts a small number of real demonstrations into large-scale synthetic training data for robotic manipulation of deformable objects such as cloth. The system digitizes real scenes into metric-accurate virtual twins, calibrates soft-body physics via elastic modeling, and generates diverse trajectories through a diffusion model with quality filtering. Policies trained exclusively on this synthetic data match those trained on real data at a 1:15 equivalence ratio — one real demonstration is worth roughly fifteen synthetic ones, or equivalently, fifteen synthetic examples substitute for one real one.

13 Apr·3 min

Computer Vision

Training-free framework improves object counting in text-to-video generation

Text-to-video diffusion models frequently generate the wrong number of objects when given a numeric prompt, such as 'three dogs' producing two or four. The authors introduce NUMINA, a training-free framework that detects count mismatches during generation and corrects them by refining the spatial layout derived from attention maps. Applied to Wan2.1 models of three different sizes, NUMINA improves counting accuracy by up to 7.4 percentage points with no additional training.

13 Apr·3 min

Research

Neural Network Decoder Unlocks a Steep Error-Suppression Regime in Quantum LDPC Codes

Gu et al. introduce a structure-aware convolutional neural network decoder for quantum error-correcting codes that matches the geometric layout of the code. Applied to the [144, 12, 12] Gross code, it reveals a previously hidden "waterfall" regime of steep error suppression, reaching logical error rates of ~10⁻¹⁰ at 0.1% physical error rate — with latencies compatible with real-time operation on current quantum hardware platforms.

13 Apr·3 min

AI & ML

Mobile AI Agents Fail When They Must Infer User Preferences on Their Own

Chen et al. introduce KnowU-Bench, an interactive benchmark for evaluating personalized mobile agents across 192 tasks in a live Android emulation environment. Their key finding: agents that perform well on explicit instructions see performance drop below 50% when they must infer hidden user preferences or decide when to proactively assist, exposing a fundamental gap between interface competence and genuine personal assistance.

13 Apr·3 min

AI & ML

Steering Vectors in Language Models Work Mainly Through Attention's OV Circuit

Cheng, Wiegreffe, and Manocha investigate why steering vectors — a lightweight technique for controlling language model behavior — actually work at a mechanistic level. Using refusal as a case study, they find that steering vectors primarily operate through the output-value (OV) circuit of the attention mechanism while leaving the query-key (QK) circuit largely untouched, and that the vast majority of steering vector dimensions are unnecessary.

13 Apr·3 min·8 sources

NLP

LLMs Simulating Human Behavior Converge Toward an Unrealistically Positive "Average Person"

Chen et al. introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data spanning multiple scenarios and long time horizons. Their evaluation of state-of-the-art LLMs reveals a fundamental structural bias: models consistently simulate an overly active, homogenized, and optimistic version of human behavior, losing the diversity and idiosyncrasies present in authentic behavioral data.

12 Apr·3 min

NLP

Language Models Learn Skills in a Predictable, Compositional Order During Training

Liu et al. propose and test the "Implicit Curriculum Hypothesis" — that pretraining follows a predictable, compositional curriculum rather than acquiring skills in an arbitrary order. By tracking when specific capabilities emerge across four model families (410M–13B parameters), they find highly consistent orderings (Spearman ρ = .81 across 45 model pairs) and show that composite tasks reliably emerge after their constituent subtasks.

12 Apr·3 min

Computer Vision

Open-Source Web Agents Match or Beat Proprietary Models on Browser Tasks

Researchers at the Allen Institute for AI introduce MolmoWeb, a pair of open multimodal web agents (4B and 8B parameters) that navigate websites using only screenshots and task instructions—no HTML or accessibility tree access required. Trained on MolmoWebMix, a dataset combining over 100K synthetic task trajectories with 30K+ human demonstrations, the 8B model achieves state-of-the-art results among open models and surpasses set-of-marks agents built on GPT-4o. All model weights, training data, code, and evaluation infrastructure will be publicly released.

12 Apr·3 min

NLP

AI agents struggle with everyday online tasks on live websites

Zhang et al. created ClawBench, a benchmark testing AI agents on 153 everyday online tasks like booking appointments and submitting applications across 144 live websites. Even the best model, Claude Sonnet 4.6, completed only 33.3% of tasks, revealing significant gaps in current AI capabilities for real-world web automation.

12 Apr·2 min

Research

New benchmark reveals autonomous driving systems fail under common scenario shifts

Gerstenecker et al. introduce Fail2Drive, the first paired-route benchmark for testing how well autonomous driving systems generalize beyond their training scenarios. Testing multiple state-of-the-art models on 200 routes with 17 types of distribution shifts, they found consistent degradation with an average 22.8% drop in success rates.

12 Apr·2 min·16 sources