

Hardware Setup
If you’re like me and using server-grade AMD Instinct passive-cooled AI accelerators, you should ensure you can provide adequate cooling. For this, I rely on a 120mm blower PWM fan.

I’m particularly using the following https://www.amazon.com/dp/B089Y3QPYF?ref=ppx_yo2ov_dt_b_fed_asin_title
I’ve also made printable 3d models for MI50(60) and MI100 that you can use with the fan above.
All the following setup has used Ubuntu 24.
MI50 was a drop in the machine has booted and the card showed up on lspci
, while with MI100 the machine not even posted on the screen. To solve that I’ve removed the 8pin power connectors, to able to access the bios and did the following changes:
- CSM support – disable
- Above 4g coding – enable (not sure if this is needed or not)
- Resize bar – enable/auto
After these changes the machine posted and loaded Ubuntu without issue.
Drivers
I’ve followed AMD’s instructions from here:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html
They’ve worked without modification.
NOTE: MI50 and MI60 is on an older architecture GCN5.1 and their ROCm support is Deprecated. It has still worked for me on version 6.3 but it’s EOL.
Details can be found here: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
To validate the driver’s successful installation, you can use rocminfo
that provides a very detailed information of the ROCm system and devices or use rocm-smi
that provides similar but less detailed output of the GPUs like nvidia-smi
.

To have a mode detailed view if the GPU status I’m using nvtop which you can install with apt sudo apt install nvtop
. FYI: nvitop
does not work with AMD (at least this is my experience)

Inference
Note: If you have multiple AMD GPUs, like in my case I have a Ryzen with Integrated GPU you might want to control which device is used. Here again similarly to NVIDIA we ca control this with the following environment variable ROCR_VISIBLE_DEVICES=0
Quantized (GGUF)
Llama.cpp has supporting ROCm and I managed to make both MI50 (gfx906) and MI100 (gfx908) work. You should llama.cpp to the architecture of the card, by passing the AMDGPU_TARGETS parameter. You can get this parameter value from rocinfo: rocminfo | grep gfx | head -1 | awk '{print $2}'
And this is the build command:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx908 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16
For non GGML quantized inference look at the Training section.
Performance
ROCR_VISIBLE_DEVICES=0 build/bin/llama-bench -m ~/Downloads/DevQuasar-R1-Uncensored-Llama-8B.Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | ROCm | 99 | pp512 | 2997.80 ± 3.85 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | ROCm | 99 | tg128 | 92.47 ± 0.11 |
ROCR_VISIBLE_DEVICES=0 build/bin/llama-bench -m ~/Downloads/DevQuasar-R1-Uncensored-Llama-8B.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | ROCm | 99 | pp512 | 3259.75 ± 2.67 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | ROCm | 99 | tg128 | 77.16 ± 0.26 |
Raw inference
For non quantized inference you would need torch. The package configurator on https://pytorch.org/ work prettu good. Choose the OS and the proper compute platform (ROCm) and you’ll end up with an install command like:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
NOTE: match the rocm version with your drivers! If you use 6.3 modify the command above accordingly!
If needed use the nightly
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/
Training
I’ve managed to run a LORA training on the fp16 model, and the training was successful, without any additional configuration.
Quantized model training
Unsloth
Not tested yet
Bits And Bytes
To do QLORA you have to build and install a ROC specific bitsandbytes. There are multiple sources/repos how to do this.
ROCm repo
https://github.com/ROCm/bitsandbytes
Huggingface
https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend
git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git && cd bitsandbytes/
# Compile & install
apt-get install -y build-essential cmake # install build tools dependencies, unless present
cmake -DCOMPUTE_BACKEND=hip -S . # Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install -e . # `-e` for "editable" install, when developing BNB (otherwise leave that out)
AMD’s suggestion
https://rocm.blogs.amd.com/artificial-intelligence/bnb-8bit/README.html
Partial success
At the time of writing this blogpost (2025 Feb 28) I’ve achieved partial success:
- Build and Install Bits And Bytes
- Load model to memory
- NOT WORKING inference
Here I’ve used the code and instructions from https://github.com/ROCm/bitsandbytes/tree/rocm_enabled_multi_backend
This repo contained a BNB version that properly handle if trition is available or not.
Issues
After build when I’ve tested the BNB package python -m bitsandbytes
GLIBC
Could not load bitsandbytes native library: /home/kecso/miniconda3/envs/bnbtest/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /home/kecso/Documents/workspace/bitsandbytes/bitsandbytes/libbitsandbytes_hip.so)
We need to add the proper GLIBC version so we have to update the conda env: conda install -c conda-forge gcc=12
triton.ops
ModuleNotFoundError: No module named 'triton.ops'
So far no success
Addressed in bnb: https://github.com/bitsandbytes-foundation/bitsandbytes/commit/032beb953e7ddbf001d244489850a79c281e8217
Successful BNB installation
(bnbtest) kecso@gpu-testbench2:~/bitsandbytes$ python -m bitsandbytes
g++ (Ubuntu 14.2.0-4ubuntu2) 14.2.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
ROCm specs: rocm_version_string='63', rocm_version_tuple=(6, 3)
PyTorch settings found: ROCM_VERSION=63
The directory listed in your path is found to be non-existent: local/gpu-testbench2
The directory listed in your path is found to be non-existent: @/tmp/.ICE-unix/2803,unix/gpu-testbench2
The directory listed in your path is found to be non-existent: /etc/xdg/xdg-ubuntu
The directory listed in your path is found to be non-existent: /org/gnome/Terminal/screen/6bd83ab2_fd9f_4990_876a_527ef8117ef6
The directory listed in your path is found to be non-existent: //debuginfod.ubuntu.com
WARNING! ROCm runtime files not found in any environmental path.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Checking that the library is importable and ROCm is callable...
SUCCESS!
Installation was successful!
Inference HIP device
export HIP_VISIBLE_DEVICES=0
Load model in 8bit
Test code
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
device = "cuda:0"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16, "load_in_8bit": True},
device_map=device,
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

As seen on the screenshot above when "load_in_8bit": True
the memory footprint is about half, but the inference fails on
Exception: cublasLt ran into an error!
hipBLASLt on an unsupported architecture!
Same error with BNB config
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
device = "cuda:0"
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
quantized_model = AutoModelForCausalLM.from_pretrained(
model_name, device_map=device, torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Tell me a joke."
input_ids = tokenizer(input_text, return_tensors="pt").to(device)
output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
So far no solution for inference or training with model loaded in 8bit 🙁
It turned out that MI100 gfx908 has no hipBLASLt support.
https://github.com/ROCm/hipBLASLt/blob/develop/README.md
UPDATE 03/02/2025
4bit inference is working!
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
device = "cuda:0"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16, "load_in_4bit": True},
device_map=device,
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

The first memory bump is the 4bit and the second bigger bump is the 16bit precision.
Latest requirements.txt: https://github.com/csabakecskemeti/all_about_rocm/blob/main/bnb_rocm_requirements2.txt
Interesting findings
- ROCm works after hibernation or sleep, not like CUDA.