Inference on a $200 SBC

I’ve really wanted a server that I can use to host a quantized model for personal purposes, such as demoing a model, using it in a Hugging Face Space, etc. There are a huge variety of options to do this.

What to choose?

  • Obvious – use a CLOUD resource
    Pros: Easy to setup, scaleable
    Cons: I really don’t want another monthly cost 🙂
  • HuggingFace Free tier 2vCPU virtual hardware
    Pros: free, I’ve already set it up
    Cons: Ok – that’re really really slow, it only usable for this purpose
  • I can use my AI workstation
    Pros: Fast inference, trivial setup
    Cons: A large PC should run all the time for occasional usage, When I use it for something else the demo inference won’t be available. Noise and Heat 🙂
  • A real AI server
    Pros: Performance woul be awesome.This would allow me to use it training, maybe event making little $ by rent it out
    Cons: You can imageine, cost, space heat and noise
  • Another PC as server
    Pros: Good performance, dedicated machine so I would be able to use my workstation woyhout affecting the service. Multiuse. Maybe experiment with other GPUs and try ROCm
    Cons: Space heat and noise.
  • An SCB – Single Board Computer
    Pros: Small, cheap, quiet
    Cons: Performace

I’ve learned there are actually very powerful SBCs out there. Raspberry Pi 5s are very capable, though sometimes availability is limited. Radxa also builds performant SBCs, with both ARM and x86 CPUs; again, the fully loaded models are not always available. Finally, I’ve settled on something that is available and beats the specs of both the Raspberry Pi 5 and most Radxa models: Orange Pi 5 Plus.

The spec is unbelievable

  • Rockchip RK3588 8-core 64-bit processor up to
    2.4GHz
  • Onboard GPU Mali 610G – OpenCL compatible
  • 32GB LPDDR ram
  • EMMC
  • M.2 nvme for storage OR as it’s implemants PCIe maybe add a GPU (ok later 🙂 )
  • E.2 nvme for wifi and BT
  • And all connectors you can imagine

This itself sounds like a decent desktop configuration and is available right now for $215.

Setup llama.cpp

After installing a Debian-based OS (link below), the installation was straightforward. Clone the llama.cpp repository and run make. Make succeeded without any hiccups, and now you are ready to start the model and generate.

./main -m ~/Downloads/llama3_8b_chat_brainstorm.Q2_K.gguf -n -1 --color -r "User:" --in-prefix " " -i -p 'User: Hi
AI: Hi how can I help?
User:' 

To start as a server:

./server -m ~/Downloads/llama3_8b_chat_brainstorm.Q2_K.gguf -c 2048 --port 1234

By default the server listens on localhost, so make it available externally, I’ve defined the host.

server -m ~/Downloads/llama3_8b_chat_brainstorm-v2.1.Q3_K_M.gguf -c 2048 --host <lan IP> --port 1234  

To test it hit with a curl:

curl --request POST     --url http://<lan ip>:1234/completion     --header "Content-Type: application/json"     --data '{"prompt": "### HUMAN: hello\n ### ASSISTANT:","n_predict": 24}'

Now the last step to NAT the port on your router or firewall and it’s available on the internet.

Performance

I’ve tested and am using a fine-tuned LLaMA3 8B Q3 quantized model. How good is it? It really depends on what you want to compare it with. It’s definitely not CUDA performance, but we’re talking about a $200 all-around SBC.

"timings":{"prompt_n":9,"prompt_ms":2176.702,"prompt_per_token_ms":241.8557777777778,"prompt_per_second":4.134695516428064,"predicted_n":24,"predicted_ms":6784.15,"predicted_per_token_ms":282.67291666666665,"predicted_per_second":3.5376576284427674}
  • Prompt processing: 4.13tok/s
  • Generation: 3.5tok/s

This definitely beats the Free Tier HF virtual hardware :). (Plus I’m running mu Son’s mincrafr server on the SBC also)

Connect to 🤗 Spaces

I’ve wrote my own little streaming llama.cpp client that has handle Alpacca prompt format and maintain the context. I think it’s pretty straight forward:

def chatty(prompt, messages, n_predict=128):
    # print(prompt)
    # print(f'messages: {messages}')
    past_messages = ''
    if len(messages) > 0:
        for idx, message in enumerate(messages):
            # print(f'idx: {idx}, message: {message}')
            past_messages += f'\n### HUMAN: {message[0]}'
            past_messages += f'\n### ASSISTANT: {message[1]}'
                
    system = "### System: You help to brainstorm ideas.\n"
    prompt_templated = f'{system} {messages}\n ### HUMAN:\n{prompt} \n ### ASSISTANT:'
    
    headers = {
        "Content-Type": "application/json"
    }
    data = {
        "prompt": prompt_templated,
        "n_predict": n_predict,
        "stop": ["### HUMAN:", "### ASSISTANT:", "HUMAN"],
        "stream": True
    }

    result = ""
    try:
        response = requests.post(sbc_host_url, headers=headers, data=json.dumps(data), stream=True)
        
        if response.status_code == 200:
            for line in response.iter_lines():
                if line:
                    try:
                        result += json.loads(line.decode('utf-8').replace('data: ', ''))['content']
                    except:
                        # LMStudio response has empty token 
                        pass
                    yield result
        else:
            response.raise_for_status()
    except requests.exceptions.RequestException as e:
        raise gr.Warning("Apologies for the inconvenience! Our model is currently self-hosted and unavailable at the moment.")

Example code: gradio_llamacpp_client_and_streaming_client_app.py

You can try it here: DevQuasar/llama3_on_sbc

Hardware accelerated inference on Mali 610G with OpenCL

The Rockchip RK3588 has an integrated ARM Mali-G610 MP4 quad-core GPU that suppoerts OpenCL. So is it possible to run the model on the build in GPU?

First you need the drivers installed based on this:
https://llm.mlc.ai/docs/install/gpu.html#orange-pi-5-rk3588-based-sbc
If drivers successfully installed you should see the device show up in the clinfo:

clinfo -l
Platform #0: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
`-- Device #0: Mali-LODX r0p0
Platform #1: Clover

Next build llama.cpp with OpenCL.

  • You need some additional libs before that:
    git clone --recurse-submodules https://github.com/KhronosGroup/OpenCL-SDK.git
    cd OpenCL-SDK
    sudo apt install cmake
    cmake -B build -DBUILD_DOCS=OFF   -DBUILD_EXAMPLES=OFF   -DBUILD_TESTING=OFF   -DOPENCL_SDK_BUILD_SAMPLES=OFF   -DOPENCL_SDK_TEST_SAMPLES=OFF
    cmake --build build
    mkdir ~/opencl_sdk
    cmake --install build --prefix /home/orangepi/opencl_sdk/
    sudo apt install clblast
  • And build llama.cpp
    make LLAMA_CLBLAST=1

From here you just run the server with GPU offload the layers. For this you should use the -ngl flag, like:


./server -m <model.gguf> -c 2048 --host <IP> --port 1234 -ngl 10
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
ggml_opencl: selecting platform: 'ARM Platform'
ggml_opencl: selecting device: 'Mali-LODX r0p0'
...

llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors: CPU buffer size = 3825.27 MiB
llm_load_tensors: OpenCL buffer size = 995.00 MiB
...

Or you can offload the whole model, the intergrated GPU sharing the main memory with the CPU cores.

Performance

So it’s doable but do we gain any performance improvements?

Result with full model (33 layers) offload

"timings":{"prompt_n":9,"prompt_ms":15645.028,"prompt_per_token_ms":1738.3364444444444,"prompt_per_second":0.575262632959174,"predicted_n":24,"predicted_ms":26988.215,"predicted_per_token_ms":1124.5089583333333,"predicted_per_second":0.8892770418495629} 

Result with 10 layers offload

"timings":{"prompt_n":9,"prompt_ms":7318.864,"prompt_per_token_ms":813.2071111111111,"prompt_per_second":1.229699035260117,"predicted_n":24,"predicted_ms":15943.454,"predicted_per_token_ms":664.3105833333333,"predicted_per_second":1.5053199890061464}

Result with CPU only

"timings":{"prompt_n":9,"prompt_ms":2067.684,"prompt_per_token_ms":229.74266666666668,"prompt_per_second":4.352696059939526,"predicted_n":24,"predicted_ms":6418.636,"predicted_per_token_ms":267.4431666666667,"predicted_per_second":3.739112172741997}

My result shows we not gain performance but loose. I can theorize few reason for this:

  • The CPU is so much more performant the the GPU
  • The OpenCL drivers could not use the GPU full potential

These need more investigation.

Next

There are few more think I’m looking forward to try with this board.

A) mlc.ai has tested the previous generation with models compiles to OpenCL:

B) Rockchip RK3588 also has a built-in AI accelerator NPU. Curious how can we harness it’s capabilities

C) The board has an M.2 NVME connector that implements PCIe… Can I add and “external” GPU? 😀

Useful links