Computer Vision

cs.CV

2 articles

All AI & ML NLP Computer Vision Multi-Agent Systems Audio & Speech

Training-free framework improves object counting in text-to-video generation

Text-to-video diffusion models frequently generate the wrong number of objects when given a numeric prompt, such as 'three dogs' producing two or four. The authors introduce NUMINA, a training-free framework that detects count mismatches during generation and corrects them by refining the spatial layout derived from attention maps. Applied to Wan2.1 models of three different sizes, NUMINA improves counting accuracy by up to 7.4 percentage points with no additional training.

13 Apr·3 min

Computer Vision

Open-Source Web Agents Match or Beat Proprietary Models on Browser Tasks

Researchers at the Allen Institute for AI introduce MolmoWeb, a pair of open multimodal web agents (4B and 8B parameters) that navigate websites using only screenshots and task instructions—no HTML or accessibility tree access required. Trained on MolmoWebMix, a dataset combining over 100K synthetic task trajectories with 30K+ human demonstrations, the 8B model achieves state-of-the-art results among open models and surpasses set-of-marks agents built on GPT-4o. All model weights, training data, code, and evaluation infrastructure will be publicly released.

12 Apr·3 min