Speculative decoding on CPU tradeoffs

I’ve played around with the llama.cpp speculative decoding on CPU (Mac Pro 2013 – Xeon E5 12core 2.7GHz) and wanted to share my experience.

First of all I’ve struggled to find models where the vocab size difference is less than 100 this caused the following error:

main: error: draft model vocab must closely match target model to use speculation but target vocab size 152064 does not match draft vocab size 151936 - difference 128, max allowed 100

The solution was to update #define SPEC_VOCAB_MAX_SIZE_DIFFERENCE 100 > 130 (or what number would accommodate your diff) in the speculative.cpp. Than rebuild the project.

In the test I’ve used Qwen2.5-7B-Instruct-Q4_K_M.gguf as the primary model and tested qwen2.5-1.5b-instruct and qwen2.5-0.5b-instruct as the draft model.

I’ve iterated trough all the quantization of the draft model and and the [1,2,4,6,8,10,12] drafts. I’ve used to prompts and take the average of the 2 in the results. The prediction happened on CPU only so used -ngl 0 and -ngld 0 to not offload any layer to the super-slow AMD D500s.

The full command looks like this:

./llama-speculative -m ~/.cache/huggingface/hub/models--lmstudio-community--Qwen2.5-7B-Instruct-GGUF/snapshots/a8bb3906b78b3009770d7ae7d116be2ea892802d/Qwen2.5-7B-Instruct-Q4_K_M.gguf -md ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B-Instruct-GGUF/snapshots/91cad51170dc346986eccefdc2dd33a9da36ead9/qwen2.5-1.5b-instruct-q4_k_m.gguf -p "# Write tutorial about the awk bash app\n\n" \
-e -ngl 0 -t 4 -n 512 -c 4096 -s 20 --top_k 1 --draft 2 -ngld 0

Results

In the result I’ve looked the accept rate

Not much surprise the higher quants are more precise and the less we generate the more change the result tokens not differs too much from the primary model’s answers. Though with the 0.5B draft mode the Q3 quant performed little-bit better than the Q4 (maybe just random fluctuation).
No combination has benefited from higher draft size.

Let’s see the data for the output token/sec

While with the 0.5B draft model the highest quant still performed the best (Q3 maintained slightly better performance than Q4)

The output token/sec showed some tradeoffs between draft model generation speed and accept rate in case of the 1.5B draft model. Q4_K_M quant showed the best performance and interestingly Q5_0 has degraded performance compare to it’s neighbors.
And again, no combination has benefited from higher draft size.

Summary

The reality is that none of the combination was faster than the single model only generation, though the best result with SD (Speculative Decoding) were showed very close tok/s.
Baseline: 5.29 tok/s

Probably more tuning needed for the CPU like testing different number of threads.
I also assume this old hardware has other bottlenecks too like the slow DDR3 memory. All in all I think this result may not representative to judge SD, but probably fair representation of the tradeoff between the higher tok/s on lower quantizations vs. higher accept percentage of the higher quantizations.

I might also repeat the test on a more capable CPU (Threadripper 3970x) and definitely repeat with GPU acceleration.

To have more stable result I might need to repeat the test and take the average of more result of the same combination.