Speculative decoding tradeoffs GPU

And there I’ve crunched the numbers on GPU. The configuration is a 3GPU system 1x RTX4080 + 2x RTX3090. Baseline has been set by the following llama.cpp generation command (same generation config and prompts has been used in the Speculative Decoding testing).

./llama-simple -m /<path>/Qwen2.5-7B-Instruct-Q4_K_M.gguf  -p "# Write tutorial about the awk bash app\n\n" -e -ngl 29 -t 4 -n 512 -c 4096 -s 20 --top_k 1
main: decoded 503 tokens in 4.82 s, speed: 104.28 t/s

llama_perf_sampler_print:    sampling time =      36.37 ms /   504 runs   (    0.07 ms per token, 13857.19 tokens per second)
llama_perf_context_print:        load time =    1638.70 ms
llama_perf_context_print: prompt eval time =     133.62 ms /     9 tokens (   14.85 ms per token,    67.35 tokens per second)
llama_perf_context_print:        eval time =    4346.14 ms /   503 runs   (    8.64 ms per token,   115.73 tokens per second)
llama_perf_context_print:       total time =    6461.80 ms /   512 tokens

Let’s see how much benefit…

Accepted rate

Draft size 1 and high quantized draft models provided the best accepted rate which is not surprising. Though the sweet-spot with Q3 quantization which appears both with the 1.5B and 0.5B draft model is surprising, though the difference is minuscule -> could be measurement error.

The other interesting thing is to see the the more aggressive decline in the acceptance rate as we’re increasing the draft size to 4-6 by the smaller 0.5B draft model. We can also see a small plateau at the 6-8 draft range.

Generation speed

I’ve achieved the best performance with the 0.5B Q8 quantized draft model with draft size 2, 120.35 t/s which is 15.37% improvement to the baseline.

We can identify a local maximum on both plot at the same 6-8 draft range where we saw plateau in the acceptance rate plot.

We can also see the top performance results are appearing in the highest quants for the smaller 0.5B draft model while with the bigger 1.5B draft model we see some tradeoff between the draft model generation speed and the reduction of quality by the quantization. Here the best performances appears are appear in the mid quants Q4, Q5, Q6

My Conclusion

Speculative decoding clearly able to provide performance improvements without sacrificing the generation quality only using a little VRAM overhead. The smaller draft model preformed better, probably a model size between the 2 models like a 1B model could achieve slightly better generation performance.

Why others claim way higher improvements?

I’ve expected more

  • More improvements on the generation speed
  • The higher speed at some higher draft range

The claims I’ve read about performance improvements by Speculative Decoding telling ~200% gain. Why I haven’t got those results? Maybe there’s room to better configure the inference. I can imagine SD gain more from some particular configuration.

Though my theory is the result is highly dependent on the difficulty of the prompt. Low quant small models with higher draft size might performing very well on simplistic prompts, like “What is the capital of Italy?” but in this test I’ve tasked the LLM with some more complex tasks:

"# Write tutorial about the awk bash app\n\n"
"# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n"

I can imagine these tasks are decrease the acceptance rate significantly with higher draft so we “burning” the compute budget for tokens we eventually throw away.

Next: test with very simple prompts

Note:
All experiment used llama.cpp.

Check out my SD test on CPU, CPU2