I’ve rerun the Speculative Decoding experiment with some larger models where I’ve paired a Llama 3 70B primary model with the 8B draft model, to see if a larger model Llama 3 70B can benefit more from a draft model the the previously used Qwen 7B & 1.5B / 0.5B combo.
For the experiment I’ve used llama.cpp on on Nvidia GPUs (1 x RTX 4080 + 2 x RTX 3090). I’ve extended the previous experiment with more draft sizes and also used 4 prompts (2 complex and 2 more simple see below) and take the average of the generation speeds.
Primary model: Meta-Llama-3-70B-Instruct-Q4_K_M.gguf
Draft models: Meta-Llama-3-8B-Instruct [IQ3, Q4, Q5, Q6, Q8]
Prompts
“# Write tutorial about the awk bash app\n\n”
“# Dijkstra’s shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n”
“# Write a short story about a cat\n\n”
“# Compare capitalism and communism\n\n”
Llama.cpp configuration Speculative Decoding
/llama-speculative -m "$primary_model" \
-md "$model_file" -p "$prompt" \
-e -ngl 100 -t 4 -n 512 -c 4096 -s 20 --top_k 1 --draft "$draft" -ngld 100
Llama.cpp configuration Reference (no SD only the 70B model)
./llama-simple -m "$primary_model" -p "${prompts[0]}" -e -ngl 100 -t 4 -n 512 -c 4096 -s 20 --top_k 1
Results
Accepted rate
No surprise the smaller the draft and the higher the quant the better the accept rate of the draft model generation.

Generation
Again we can identify a sweet-spot with a mid quality quantized draft model. While in the previous experiment Q5 was the clear winner here the Q4 quantization provided the best value across multiple draft sizes. Regarding the drafts finally we can see some clearly defined performance boost between 8-12, the higher accept rate of draft=8 provided the best performance with a small margin compare to draft 10 and 12.

Top 3 configuration:
Value: 32.42, Draft: 8, Quantization: Q4_K_M, Accepted Rate: 58.68
Value: 32.32, Draft: 10, Quantization: Q4_K_M, Accepted Rate: 54.58
Value: 32.23, Draft: 12, Quantization: Q4_K_M, Accepted Rate: 50.56
Overall performance boost
Finally we see the 200% performance boost in the generation speed. The primary model alone achieved around 16tok/s what we were able to double ~32tok/s by using Speculative Decoding with a draft model.
Key takeaways
- Seems the larger the primary models benefit more from Speculative Decoding
- Improvement on a CPU only configuration is not existent of minuscule
- Q4-Q5 quantized draft models are the sweet spot
- Draft sizes of 8-12 gave the most gain
- 2x generation speed improvement compared to the primary model only setup