x

The following is the leader board of all GPUs I’ve tested with llama-bench. For the model I’m using Q8_0 and Q4_K_M quantized Llama-3.1-8B-Instruct from the following repo: DevQuasar/Meta-Llama-3.1-8B-Instruct-GGUF. The reasoning behind the model selection is to use a generic widely used popular model which memory footprint is small enough to have a wide range of GPU that are capable to run this.

Please Note: The following results are not generated on identical system.

GPU BackendQ4_K_M pp512 Q4_K_M tg128 Q8_0 pp512Q8_0 tg128
RTX 5090CUDA 12.812031.59204.1912207.44143.18
MI100 ROCm3003.6993.223278.7178.24
RTX 3090 CUDA4656.10120.674931.3083.80
RTX 4080 CUDA6925.90109.977362.1370.77
MI50 ROCm395.7376.73416.6860.38
L40S CUDA8788.6985.768968.3159.93
V100CUDA 12.62844.5598.763368.0974.16