GPU GGUF inference comparison

The following is the leader board of all GPUs I’ve tested with llama-bench. For the model I’m using Q8_0 and Q4_K_M quantized Llama-3.1-8B-Instruct from the following repo: DevQuasar/Meta-Llama-3.1-8B-Instruct-GGUF. The reasoning behind the model selection is to use a generic widely used popular model which memory footprint is small enough to have a wide range of GPU that are capable to run this.

Please Note: The following results are not generated on identical system.

GPU BackendQ4_K_M pp512 Q4_K_M tg128 Q8_0 pp512Q8_0 tg128
MI100 ROCm 3003.69 93.22 3278.71 78.24
RTX 3090 CUDA 4656.10 120.67 4931.30 83.80
RTX 4080 CUDA 6925.90 109.97 7362.13 70.77
MI50 ROCm 395.73 76.73 416.68 60.38
L40S CUDA 8788.69 85.76 8968.31 59.93