Here’s a followup on the Speculative decoding on CPU tradeoffs now performing the same test on a Threadripper 3970x, and finally I see some benefit.
Acceptance rate


Generation speed


Summary
The results seems very similar to the Xeon E5 test. Acceptance rate is the highest with the smallest draft size with the highest quantization draft model as expected. The tok/s is showing 13% improvements with the smaller 0.5B parameter draft model (using draft=2 and Q3_K_L quantized draft model).
Qwen2.5-0.5B-Instruct-Q3_K_L.gguf | Qwen2.5-0.5B-Instruct | Q3_K_L | 2.0 | 78.0510 | 15.4130 | 9.3275 |
Baseline generation: 8.24 t/s
Based on these results Speculative Decoding can give you a modest boost in the generation speed on CPU too. According my measurement use the smallest possible draft model with low quantization and draft=2.
Note: llama.cpp framework has been used.