The following is the leader board of all GPUs I’ve tested with llama-bench. For the model I’m using Q8_0 and Q4_K_M quantized Llama-3.1-8B-Instruct from the following repo: DevQuasar/Meta-Llama-3.1-8B-Instruct-GGUF. The reasoning behind the model selection is to use a generic widely used popular model which memory footprint is small enough to have a wide range of GPU that are capable to run this.
Please Note: The following results are not generated on identical system.
GPU | Backend | Q4_K_M pp512 | Q4_K_M tg128 | Q8_0 pp512 | Q8_0 tg128 |
---|---|---|---|---|---|
MI100 | ROCm | 3003.69 | 93.22 | 3278.71 | 78.24 |
RTX 3090 | CUDA | 4656.10 | 120.67 | 4931.30 | 83.80 |
RTX 4080 | CUDA | 6925.90 | 109.97 | 7362.13 | 70.77 |
MI50 | ROCm | 395.73 | 76.73 | 416.68 | 60.38 |
L40S | CUDA | 8788.69 | 85.76 | 8968.31 | 59.93 |