StarSD: One-for-Many Speculative Decoding

Junhao He, Feiran You, Hongyang Du

Year: 2026
Access: Open access

Abstract

Speculative decoding accelerates autoregressive generation by separating token proposal from verification, but most existing approaches are designed for single-node execution and do not scale well to multi-accelerator clusters used for serving modern Large Language Models (LLMs). We present StarSD, a one-for-many speculative decoding framework that uses a single draft model to serve multiple target models across distributed nodes via a star topology. StarSD decouples drafting and verification, enabling effective sharing of draft computation, and preventing distributed accelerators from remaining idle under bursty workloads. We provide a system-level analysis that characterizes when and why a single draft model can remain fully utilized by multiple verifiers, yielding predictable latency and utilization gains. Extensive experiments in real-world distributed inference settings demonstrate that StarSD simplifies deployment and supports flexible resource allocation across heterogeneous accelerators, while maintaining output quality. These results indicate that StarSD is a practical and scalable framework for bringing speculative decoding to modern cloud and edge inference infrastructures.

Keywords

eess.SY

StarSD: One-for-Many Speculative Decoding

Abstract

Keywords

Related papers

Statistical Learning Theory

Fractional Differential Equations

Applied Nonlinear Control

Genetic Programming: On the Programming of Computers by Means of Natural Selection