I’ve really wanted a server that I can use to host a quantized model for personal purposes, such as demoing a model, using it in a Hugging Face Space, etc. There are a huge variety of options to do this.
What to choose?
- Obvious – use a CLOUD resource
Pros: Easy to setup, scaleable
Cons: I really don’t want another monthly cost 🙂 - HuggingFace Free tier 2vCPU virtual hardware
Pros: free, I’ve already set it up
Cons: Ok – that’re really really slow, it only usable for this purpose - I can use my AI workstation
Pros: Fast inference, trivial setup
Cons: A large PC should run all the time for occasional usage, When I use it for something else the demo inference won’t be available. Noise and Heat 🙂 - A real AI server
Pros: Performance woul be awesome.This would allow me to use it training, maybe event making little $ by rent it out
Cons: You can imageine, cost, space heat and noise - Another PC as server
Pros: Good performance, dedicated machine so I would be able to use my workstation woyhout affecting the service. Multiuse. Maybe experiment with other GPUs and try ROCm
Cons: Space heat and noise. - An SCB – Single Board Computer
Pros: Small, cheap, quiet
Cons: Performace
I’ve learned there are actually very powerful SBCs out there. Raspberry Pi 5s are very capable, though sometimes availability is limited. Radxa also builds performant SBCs, with both ARM and x86 CPUs; again, the fully loaded models are not always available. Finally, I’ve settled on something that is available and beats the specs of both the Raspberry Pi 5 and most Radxa models: Orange Pi 5 Plus.

The spec is unbelievable
- Rockchip RK3588 8-core 64-bit processor up to
2.4GHz - Onboard GPU Mali 610G – OpenCL compatible
- 32GB LPDDR ram
- EMMC
- M.2 nvme for storage OR as it’s implemants PCIe maybe add a GPU (ok later 🙂 )
- E.2 nvme for wifi and BT
- And all connectors you can imagine
This itself sounds like a decent desktop configuration and is available right now for $215.
Setup llama.cpp
After installing a Debian-based OS (link below), the installation was straightforward. Clone the llama.cpp repository and run make. Make succeeded without any hiccups, and now you are ready to start the model and generate.
./main -m ~/Downloads/llama3_8b_chat_brainstorm.Q2_K.gguf -n -1 --color -r "User:" --in-prefix " " -i -p 'User: Hi
AI: Hi how can I help?
User:'
To start as a server:
./server -m ~/Downloads/llama3_8b_chat_brainstorm.Q2_K.gguf -c 2048 --port 1234
By default the server listens on localhost, so make it available externally, I’ve defined the host.
server -m ~/Downloads/llama3_8b_chat_brainstorm-v2.1.Q3_K_M.gguf -c 2048 --host <lan IP> --port 1234
To test it hit with a curl:
curl --request POST --url http://<lan ip>:1234/completion --header "Content-Type: application/json" --data '{"prompt": "### HUMAN: hello\n ### ASSISTANT:","n_predict": 24}'
Now the last step to NAT the port on your router or firewall and it’s available on the internet.
Performance
I’ve tested and am using a fine-tuned LLaMA3 8B Q3 quantized model. How good is it? It really depends on what you want to compare it with. It’s definitely not CUDA performance, but we’re talking about a $200 all-around SBC.
"timings":{"prompt_n":9,"prompt_ms":2176.702,"prompt_per_token_ms":241.8557777777778,"prompt_per_second":4.134695516428064,"predicted_n":24,"predicted_ms":6784.15,"predicted_per_token_ms":282.67291666666665,"predicted_per_second":3.5376576284427674}
- Prompt processing: 4.13tok/s
- Generation: 3.5tok/s
This definitely beats the Free Tier HF virtual hardware :). (Plus I’m running mu Son’s mincrafr server on the SBC also)
Connect to 🤗 Spaces
I’ve wrote my own little streaming llama.cpp client that has handle Alpacca prompt format and maintain the context. I think it’s pretty straight forward:
def chatty(prompt, messages, n_predict=128):
# print(prompt)
# print(f'messages: {messages}')
past_messages = ''
if len(messages) > 0:
for idx, message in enumerate(messages):
# print(f'idx: {idx}, message: {message}')
past_messages += f'\n### HUMAN: {message[0]}'
past_messages += f'\n### ASSISTANT: {message[1]}'
system = "### System: You help to brainstorm ideas.\n"
prompt_templated = f'{system} {messages}\n ### HUMAN:\n{prompt} \n ### ASSISTANT:'
headers = {
"Content-Type": "application/json"
}
data = {
"prompt": prompt_templated,
"n_predict": n_predict,
"stop": ["### HUMAN:", "### ASSISTANT:", "HUMAN"],
"stream": True
}
result = ""
try:
response = requests.post(sbc_host_url, headers=headers, data=json.dumps(data), stream=True)
if response.status_code == 200:
for line in response.iter_lines():
if line:
try:
result += json.loads(line.decode('utf-8').replace('data: ', ''))['content']
except:
# LMStudio response has empty token
pass
yield result
else:
response.raise_for_status()
except requests.exceptions.RequestException as e:
raise gr.Warning("Apologies for the inconvenience! Our model is currently self-hosted and unavailable at the moment.")
Example code: gradio_llamacpp_client_and_streaming_client_app.py
You can try it here: DevQuasar/llama3_on_sbc
Hardware accelerated inference on Mali 610G with OpenCL
The Rockchip RK3588 has an integrated ARM Mali-G610 MP4 quad-core GPU that suppoerts OpenCL. So is it possible to run the model on the build in GPU?
First you need the drivers installed based on this:
https://llm.mlc.ai/docs/install/gpu.html#orange-pi-5-rk3588-based-sbc
If drivers successfully installed you should see the device show up in the clinfo:
clinfo -l
Platform #0: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
`-- Device #0: Mali-LODX r0p0
Platform #1: Clover
Next build llama.cpp with OpenCL.
- You need some additional libs before that:
git clone --recurse-submodules https://github.com/KhronosGroup/OpenCL-SDK.git
cd OpenCL-SDK
sudo apt install cmake
cmake -B build -DBUILD_DOCS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_TESTING=OFF -DOPENCL_SDK_BUILD_SAMPLES=OFF -DOPENCL_SDK_TEST_SAMPLES=OFF
cmake --build build
mkdir ~/opencl_sdk
cmake --install build --prefix /home/orangepi/opencl_sdk/
sudo apt install clblast - And build llama.cpp
make LLAMA_CLBLAST=1
From here you just run the server with GPU offload the layers. For this you should use the -ngl
flag, like:
./server -m <model.gguf> -c 2048 --host <IP> --port 1234 -ngl 10
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
ggml_opencl: selecting platform: 'ARM Platform'
ggml_opencl: selecting device: 'Mali-LODX r0p0'
...
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors: CPU buffer size = 3825.27 MiB
llm_load_tensors: OpenCL buffer size = 995.00 MiB
...
Or you can offload the whole model, the intergrated GPU sharing the main memory with the CPU cores.
Performance
So it’s doable but do we gain any performance improvements?
Result with full model (33 layers) offload
"timings":{"prompt_n":9,"prompt_ms":15645.028,"prompt_per_token_ms":1738.3364444444444,"prompt_per_second":0.575262632959174,"predicted_n":24,"predicted_ms":26988.215,"predicted_per_token_ms":1124.5089583333333,"predicted_per_second":0.8892770418495629}
Result with 10 layers offload
"timings":{"prompt_n":9,"prompt_ms":7318.864,"prompt_per_token_ms":813.2071111111111,"prompt_per_second":1.229699035260117,"predicted_n":24,"predicted_ms":15943.454,"predicted_per_token_ms":664.3105833333333,"predicted_per_second":1.5053199890061464}
Result with CPU only
"timings":{"prompt_n":9,"prompt_ms":2067.684,"prompt_per_token_ms":229.74266666666668,"prompt_per_second":4.352696059939526,"predicted_n":24,"predicted_ms":6418.636,"predicted_per_token_ms":267.4431666666667,"predicted_per_second":3.739112172741997}
My result shows we not gain performance but loose. I can theorize few reason for this:
- The CPU is so much more performant the the GPU
- The OpenCL drivers could not use the GPU full potential
These need more investigation.
Next
There are few more think I’m looking forward to try with this board.
A) mlc.ai has tested the previous generation with models compiles to OpenCL:
B) Rockchip RK3588 also has a built-in AI accelerator NPU. Curious how can we harness it’s capabilities
C) The board has an M.2 NVME connector that implements PCIe… Can I add and “external” GPU? 😀
