Researchers Unveil Three Advances in Speculative Decoding to Speed Up AI Language Models

New techniques target a core bottleneck in LLM inference, with reported throughput gains of up to 88%

edit
By LineZotpaper
Published
Read Time4 min
Sources27 outlets
Three independent research teams have published papers proposing distinct approaches to accelerating large language model inference through improved speculative decoding — a technique in which a smaller, faster model generates candidate text that a larger model then verifies. The work, released in June 2026, addresses persistent inefficiencies in how AI systems generate text, with potential implications for deploying powerful models on resource-constrained devices and at scale.

Large language models generate text one token at a time, a sequential process that creates a fundamental speed bottleneck regardless of how powerful the underlying hardware may be. Speculative decoding, which uses a lightweight 'draft' model to propose multiple tokens simultaneously before a larger model checks them in bulk, has emerged as one of the most promising strategies for overcoming this constraint. This week, three research groups released papers pushing the technique in different directions.

Distributed inference at the network edge

Researchers from a collaborative team including Haotian Zheng, Zhanwei Wang, and colleagues proposed Multi-SPIN, an architecture designed to distribute speculative decoding across a multiuser edge computing environment. In their framework, users' own devices run small language models that generate token drafts locally, while a shared edge server runs the full LLM to verify those drafts in parallel batches.

The team identified draft length — how many candidate tokens each device submits — as a critical control variable affecting both computation load and network latency. They developed joint optimisation algorithms for draft-length control and bandwidth allocation, testing two scenarios: one where all users submit drafts of the same length to simplify server-side batching, and one where draft lengths vary across users. Experiments using Llama-2 and Qwen3.5 model pairs showed Multi-SPIN improving token throughput by up to 88% compared to baselines that ignore hardware heterogeneity among users.

Mixing drafting strategies within a single sequence

A team led by Young D. Kwon and colleagues at Samsung AI proposed WhiFlash, which takes a different approach: rather than committing to one drafting method for an entire generation task, the system switches dynamically between autoregressive draft models and diffusion-based parallel drafting models at the token level.

The researchers found empirically that the accuracy of either drafting paradigm fluctuates substantially within a single sequence, meaning a static choice of method leaves performance on the table. WhiFlash introduces a routing controller that selects the better drafting approach for each token, using either an entropy-based heuristic or a learned neural policy. To make the rapid switching computationally practical, the team developed two cache-management techniques — Lazy Catch-up and KV-only Prefill — that hold switching overhead below 7% of per-round latency. Against state-of-the-art baselines, WhiFlash reported throughput gains of up to 69.6% over the autoregressive EAGLE-3 system and 37.3% over the diffusion-based DFlash.

Rethinking how draft models are trained

A third team, led by Xiandong Zou and colleagues, attacked a more fundamental issue: the mismatch between how draft models are trained and how they are actually used. Current training methods optimise a draft model to predict the single most likely next token, but at inference time the model must generate multiple candidate paths that will be ranked and accepted or rejected by the larger model.

Their Variational Speculative Decoding (VSD) framework reframes draft training as a variational inference problem, maximising the probability that the larger target model will accept a draft sequence. The method uses an Expectation-Maximization procedure with Adaptive Rejection Weighting and Confidence-Aware Regularization to improve draft quality and reduce training variance. Across both standard LLMs and multimodal models, VSD reported up to 9.6% additional speedup over the existing EAGLE-3 approach.

Taken together, the three papers represent distinct but complementary angles on the same problem: how to extract faster, more efficient inference from large language models without sacrificing the quality that larger models provide.

§

Analysis

Why This Matters

  • Inference speed is now one of the primary cost and user-experience constraints in deploying large language models commercially; advances here translate directly into lower API costs and faster response times for end users.
  • The Multi-SPIN work specifically addresses edge deployment, which is relevant for privacy-sensitive or low-connectivity use cases where sending all data to a centralised cloud is impractical.
  • Collectively, these papers suggest the field is moving beyond single-model optimisation toward system-level and training-level redesigns of the speculative decoding pipeline.

Background

Speculative decoding was formalised in a series of papers around 2022–2023 and quickly became a standard inference acceleration technique. The core idea — using a cheap draft model to propose tokens that an expensive model verifies — exploits the fact that verification is much faster than generation when done in parallel batches. Early implementations used static, autoregressive draft models and assumed relatively homogeneous compute environments.

As LLMs grew larger and deployment contexts more varied — including edge devices, multi-tenant servers, and multimodal applications — the original framework's limitations became more apparent. Draft models trained on standard next-token prediction objectives do not necessarily produce the drafts most likely to be accepted by the target model, and no single drafting paradigm excels across all types of content or reasoning tasks.

The past two years have seen a proliferation of speculative decoding variants, including EAGLE, EAGLE-3, ViSpec, and DFlash, each targeting different aspects of the pipeline. The three papers released this week continue that trajectory, with an increasing emphasis on adaptive, learned, and distributed approaches.

Key Perspectives

Academic researchers: The three teams frame their contributions primarily in terms of throughput metrics (tokens per second, acceptance length), which are the standard benchmarks for evaluating speculative decoding. Each claims measurable, reproducible gains over recent state-of-the-art baselines on well-known model families including Llama-2 and Qwen.

Industry practitioners: For teams deploying LLMs at scale, inference cost is a dominant operational concern. Techniques that improve throughput without degrading output quality are directly valuable, though practitioners will want to see results reproduced on their specific model sizes, hardware configurations, and workload distributions before committing to adoption.

Critics and skeptics: Benchmark gains in controlled research settings do not always translate cleanly to production environments. The heterogeneity of real-world request streams, model sizes, and hardware configurations can erode reported improvements. VSD's 9.6% gain over EAGLE-3, while statistically meaningful, is modest in absolute terms. WhiFlash's complexity — managing two distinct drafting paradigms with a learned router — introduces new engineering overhead that may complicate deployment and debugging.

What to Watch

  • Whether any of these techniques are adopted or reproduced by major LLM serving frameworks such as vLLM, TensorRT-LLM, or SGLang, which would signal practical industry uptake.
  • Publication of ablation studies and third-party reproductions that test these gains across a wider range of model sizes, hardware tiers, and real-world workloads.
  • Whether the Multi-SPIN edge-deployment framework is validated in actual wireless network conditions, where channel variability and device heterogeneity may behave differently than simulated environments.

Sources

newspaper

Zotpaper

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.