All about AMD and ROCm

Hardware Setup

If you’re like me and using server-grade AMD Instinct passive-cooled AI accelerators, you should ensure you can provide adequate cooling. For this, I rely on a 120mm blower PWM fan.

I’m particularly using the following https://www.amazon.com/dp/B089Y3QPYF?ref=ppx_yo2ov_dt_b_fed_asin_title

I’ve also made printable 3d models for MI50(60) and MI100 that you can use with the fan above.

All the following setup has used Ubuntu 24.

MI50 was a drop in the machine has booted and the card showed up on lspci, while with MI100 the machine not even posted on the screen. To solve that I’ve removed the 8pin power connectors, to able to access the bios and did the following changes:

  • CSM support – disable
  • Above 4g coding – enable (not sure if this is needed or not)
  • Resize bar – enable/auto

After these changes the machine posted and loaded Ubuntu without issue.

Drivers

I’ve followed AMD’s instructions from here:

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html

They’ve worked without modification.

NOTE: MI50 and MI60 is on an older architecture GCN5.1 and their ROCm support is Deprecated. It has still worked for me on version 6.3 but it’s EOL.

Details can be found here: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html

To validate the driver’s successful installation, you can use rocminfo that provides a very detailed information of the ROCm system and devices or use rocm-smi that provides similar but less detailed output of the GPUs like nvidia-smi.

To have a mode detailed view if the GPU status I’m using nvtop which you can install with apt sudo apt install nvtop. FYI: nvitop does not work with AMD (at least this is my experience)

Inference

Note: If you have multiple AMD GPUs, like in my case I have a Ryzen with Integrated GPU you might want to control which device is used. Here again similarly to NVIDIA we ca control this with the following environment variable ROCR_VISIBLE_DEVICES=0

Quantized (GGUF)

Llama.cpp has supporting ROCm and I managed to make both MI50 (gfx906) and MI100 (gfx908) work. You should llama.cpp to the architecture of the card, by passing the AMDGPU_TARGETS parameter. You can get this parameter value from rocinfo: rocminfo | grep gfx | head -1 | awk '{print $2}'

And this is the build command:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx908 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16

For non GGML quantized inference look at the Training section.

Performance
ROCR_VISIBLE_DEVICES=0 build/bin/llama-bench -m ~/Downloads/DevQuasar-R1-Uncensored-Llama-8B.Q4_K_M.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | ROCm       |  99 |         pp512 |       2997.80 ± 3.85 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | ROCm       |  99 |         tg128 |         92.47 ± 0.11 |
ROCR_VISIBLE_DEVICES=0 build/bin/llama-bench -m ~/Downloads/DevQuasar-R1-Uncensored-Llama-8B.Q8_0.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | ROCm       |  99 |         pp512 |       3259.75 ± 2.67 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | ROCm       |  99 |         tg128 |         77.16 ± 0.26 |
Raw inference

For non quantized inference you would need torch. The package configurator on https://pytorch.org/ work prettu good. Choose the OS and the proper compute platform (ROCm) and you’ll end up with an install command like:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4

NOTE: match the rocm version with your drivers! If you use 6.3 modify the command above accordingly!

If needed use the nightly

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/

Training

I’ve managed to run a LORA training on the fp16 model, and the training was successful, without any additional configuration.

Quantized model training

Unsloth

Not tested yet

Bits And Bytes

To do QLORA you have to build and install a ROC specific bitsandbytes. There are multiple sources/repos how to do this.

ROCm repo

https://github.com/ROCm/bitsandbytes

Huggingface

https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git && cd bitsandbytes/

# Compile & install
apt-get install -y build-essential cmake  # install build tools dependencies, unless present
cmake -DCOMPUTE_BACKEND=hip -S .  # Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install -e .   # `-e` for "editable" install, when developing BNB (otherwise leave that out)
AMD’s suggestion

https://rocm.blogs.amd.com/artificial-intelligence/bnb-8bit/README.html

Partial success

At the time of writing this blogpost (2025 Feb 28) I’ve achieved partial success:

  • Build and Install Bits And Bytes
  • Load model to memory
  • NOT WORKING inference

Here I’ve used the code and instructions from https://github.com/ROCm/bitsandbytes/tree/rocm_enabled_multi_backend

This repo contained a BNB version that properly handle if trition is available or not.

Issues

After build when I’ve tested the BNB package python -m bitsandbytes

GLIBC
Could not load bitsandbytes native library: /home/kecso/miniconda3/envs/bnbtest/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /home/kecso/Documents/workspace/bitsandbytes/bitsandbytes/libbitsandbytes_hip.so)

We need to add the proper GLIBC version so we have to update the conda env: conda install -c conda-forge gcc=12

triton.ops
ModuleNotFoundError: No module named 'triton.ops'
So far no success

Addressed in bnb: https://github.com/bitsandbytes-foundation/bitsandbytes/commit/032beb953e7ddbf001d244489850a79c281e8217

Successful BNB installation
(bnbtest) kecso@gpu-testbench2:~/bitsandbytes$ python -m bitsandbytes
g++ (Ubuntu 14.2.0-4ubuntu2) 14.2.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
ROCm specs: rocm_version_string='63', rocm_version_tuple=(6, 3)
PyTorch settings found: ROCM_VERSION=63
The directory listed in your path is found to be non-existent: local/gpu-testbench2
The directory listed in your path is found to be non-existent: @/tmp/.ICE-unix/2803,unix/gpu-testbench2
The directory listed in your path is found to be non-existent: /etc/xdg/xdg-ubuntu
The directory listed in your path is found to be non-existent: /org/gnome/Terminal/screen/6bd83ab2_fd9f_4990_876a_527ef8117ef6
The directory listed in your path is found to be non-existent: //debuginfod.ubuntu.com 
WARNING! ROCm runtime files not found in any environmental path.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Checking that the library is importable and ROCm is callable...
SUCCESS!
Installation was successful!

https://github.com/csabakecskemeti/all_about_rocm/blob/main/bnb_rocm_requirements.txt
https://github.com/ROCm/bitsandbytes/tree/rocm_enabled_multi_backend

Inference HIP device

export HIP_VISIBLE_DEVICES=0

Load model in 8bit

Test code

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

device = "cuda:0"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16, "load_in_8bit": True},
    device_map=device,
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

As seen on the screenshot above when "load_in_8bit": True the memory footprint is about half, but the inference fails on

Exception: cublasLt ran into an error!

hipBLASLt on an unsupported architecture!

Same error with BNB config

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig  
device = "cuda:0"
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

quantized_model = AutoModelForCausalLM.from_pretrained(
	model_name, device_map=device, torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Tell me a joke."
input_ids = tokenizer(input_text, return_tensors="pt").to(device)


output = quantized_model.generate(**input_ids, max_new_tokens=10)

print(tokenizer.decode(output[0], skip_special_tokens=True))

So far no solution for inference or training with model loaded in 8bit 🙁

It turned out that MI100 gfx908 has no hipBLASLt support.
https://github.com/ROCm/hipBLASLt/blob/develop/README.md

UPDATE 03/02/2025

4bit inference is working!

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

device = "cuda:0"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16, "load_in_4bit": True},
    device_map=device,
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

The first memory bump is the 4bit and the second bigger bump is the 16bit precision.

Latest requirements.txt: https://github.com/csabakecskemeti/all_about_rocm/blob/main/bnb_rocm_requirements2.txt

Interesting findings

  • ROCm works after hibernation or sleep, not like CUDA.