Training-free framework improves object counting in text-to-video generation

NUMINA uses attention-head selection and layout refinement to align generated video content with numeric specifications in prompts, without retraining models.

PaperarXiv:2604.08546v1 ↗

Zhengyang Sun · Yu Chen · Xin Zhou · Xiaofan Li · Xiwu Chen · Dingkang Liang · +1 more

Research Digest·13 April 2026·3 min read

Read the paper →

Text-to-video diffusion models frequently generate the wrong number of objects when given a numeric prompt, such as 'three dogs' producing t · AI-generated illustration · Zotpaper

Text-to-video diffusion models frequently generate the wrong number of objects when given a numeric prompt, such as 'three dogs' producing two or four. The authors introduce NUMINA, a training-free framework that detects count mismatches during generation and corrects them by refining the spatial layout derived from attention maps. Applied to Wan2.1 models of three different sizes, NUMINA improves counting accuracy by up to 7.4 percentage points with no additional training.

What they did

The authors developed NUMINA, a plug-in framework for existing text-to-video diffusion models that requires no parameter updates. The system works in two stages: first, it selects discriminative self-attention and cross-attention heads to construct a 'countable latent layout' — a spatial map representing how many discrete object instances are present in the current generation. It then compares this layout against the numeric specification in the prompt to detect inconsistencies.

When a mismatch is found, NUMINA conservatively refines the latent layout and modulates cross-attention maps to steer subsequent denoising steps toward the correct count. The authors evaluated the method on a newly introduced benchmark, CountBench, using Wan2.1 models at 1.3B, 5B, and 14B parameter scales.

Key findings

Counting accuracy improved by 7.4% on Wan2.1-1.3B, 4.9% on the 5B model, and 5.5% on the 14B model relative to unmodified baselines.
CLIP alignment scores — measuring overall prompt–video semantic correspondence — improved alongside counting accuracy, suggesting the intervention does not sacrifice general prompt fidelity.
Temporal consistency was maintained, meaning the corrections did not introduce visible flickering or frame-to-frame incoherence.
The authors find that structural layout guidance is complementary to, not a replacement for, seed search and prompt engineering strategies.

Why it matters

Numerical alignment is a persistent and underexplored failure mode in generative video models. A training-free approach is practically significant because it can be applied to commercial or large-scale models whose weights are not freely modifiable. The attention-head selection methodology also offers a potentially transferable technique for diagnosing other structural failures in diffusion generation.

Caveats

All experiments are conducted on a single model family (Wan2.1), so generalizability to architecturally distinct video diffusion models is untested. CountBench is introduced by the same authors, raising questions about benchmark design independence. The evaluation focuses on relatively small object counts; performance on larger or more complex count targets is not characterized. The conservative layout refinement strategy may limit improvements in cases where the initial layout is severely incorrect.

Analysis

Counting accuracy has been a known weakness of diffusion models since their earliest image-generation incarnations, with methods like structured attention manipulation and layout conditioning explored in the image domain. NUMINA extends this line of work to video, where temporal coherence adds a non-trivial constraint: correcting spatial count mid-generation must not destabilize frame consistency. The training-free framing distinguishes this from approaches that fine-tune models on counting-specific data, which tend to degrade general capability.

The paper also raises an implicit question about scaling: the 1.3B model benefits more than the 5B and 14B models, suggesting that larger models may be better at counting to begin with, narrowing the margin available for improvement. Whether attention-based layout supervision remains the right abstraction as model architectures evolve — particularly if future architectures reduce interpretable spatial structure in attention maps — remains to be seen.