Training-free framework improves object counting in text-to-video generation
Text-to-video diffusion models frequently generate the wrong number of objects when given a numeric prompt, such as 'three dogs' producing two or four. The authors introduce NUMINA, a training-free framework that detects count mismatches during generation and corrects them by refining the spatial layout derived from attention maps. Applied to Wan2.1 models of three different sizes, NUMINA improves counting accuracy by up to 7.4 percentage points with no additional training.