devquasar X CIVO
Cloud deployment is not difficult let’s create a very simple private LM service on CIVO cloud, using llama.cpp and a quantized model from https://huggingface.co/DevQuasar .
Create your k8s cluster on CIVO
Navigate to the Kubernetes tab
Choose your flavor that fits your budget. In this tutorial I’ll use a small performance node for my 2 node cluster. We’ll do CPU inference with a tiny 1B model.
We won’t install anything we’ll have everything in the Docker image (ghcr.io/ggml-org/llama.cpp:server)
Once the cluster is created download the kubeconfig
Export the KUBECONFIG
export KUBECONFIG=civo-llama-cpp-cpu-server-kubeconfig
This will allow us to communicate with the cluster.
Let’s define the service
For the service we’re using the precompiled llama.cpp server Docket image which contains all we need to run quantized LLMs. The images are available here: ghcr.io/ggml-org/llama.cpp:server and you can also find detailed information about the various docker images and how to use them in this readme.
Create a directory, e.g. k8s-llama/
and create these files inside:
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-server
spec:
replicas: 1
selector:
matchLabels:
app: llama-server
template:
metadata:
labels:
app: llama-server
spec:
containers:
- name: llama-server
image: ghcr.io/ggml-org/llama.cpp:server
ports:
- containerPort: 8080
env:
- name: LLAMA_ARG_MODEL_URL
value: "https://huggingface.co/DevQuasar/ibm-granite.granite-3.1-1b-a400m-instruct-GGUF/resolve/main/ibm-granite.granite-3.1-1b-a400m-instruct.Q3_K_M.gguf"
The service will listen on port 80 and we’re using a 1B parameter small quantized model from my Huggingface org: https://huggingface.co/DevQuasar/ibm-granite.granite-3.1-1b-a400m-instruct-GGUF For security we can add LLAMA_API_KEY
parameter.
service.yaml
apiVersion: v1
kind: Service
metadata:
name: llama-service
spec:
type: LoadBalancer
selector:
app: llama-server
ports:
- protocol: TCP
port: 80 # Public-facing port
targetPort: 8080 # Container port
Let’s deploy the service
(note you have to have the kubeconfig exported)
kubectl apply -f k8s-llama/deployment.yaml
> deployment.apps/llama-server created
kubectl apply -f k8s-llama/service.yaml
> service/llama-service created
Get the IP for our cluster’s external IP
kubectl get svc llama-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
llama-service LoadBalancer 10.##.##.## 212.##.##.## 80:31608/TCP 47s
Now test it
Let’s use a simle CURL request
curl --request POST \
--url http://212.##.##.##/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Tell me a joke","n_predict": 256}'
FYI: The first call might be slower and the server has to download the model file from Huggingface and load it.
{"index":0,"content":"\n\n\nA man walks into an elevator and asks for the highest floor. The elevator doors close behind him. He says, \"How did you know I was coming?\" The elevator operator replies, \"I had to do that to make it work.\"\n\nWhat's the difference between a pen and pencil?\n\nThe pen has a barrel, the pencil has a tip.\n\nWhat do you call someone who only eats fast food?\n\nFast-fied.\n\nWhat do you call someone who only eats fast food?,"tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":256,"tokens_evaluated":5,"generation_settings":{"n_predict":256,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":4096,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":256,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"Tell me a joke","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":260,"timings":{"prompt_n":5,"prompt_ms":129.35,"prompt_per_token_ms":25.869999999999997,"prompt_per_second":38.65481252415926,"predicted_n":256,"predicted_ms":8117.727,"predicted_per_token_ms":31.70987109375,"predicted_per_second":31.53592132378928}}
And voila it’s working.
Let’s see some metrics
He cluster comes with a metrics server preinstalled and we can access information with the following commands presented in the guide:
How is our CPU use during generation?
kubectl top pod --all-namespaces
NAMESPACE NAME CPU(cores) MEMORY(bytes)
default llama-server-79fb8cf879-vdhrp 1658m 491Mi
kube-system civo-ccm-5474f5869d-vt7qg 3m 17Mi
kube-system civo-csi-controller-0 4m 61Mi
kube-system civo-csi-node-897l6 1m 20Mi
kube-system civo-csi-node-m4898 1m 21Mi
kube-system coredns-6799fbcd5-zhpxx 3m 22Mi
kube-system metrics-server-67c658944b-cp4cr 10m 26Mi
kube-system otel-collector-986kt 6m 69Mi
kube-system otel-collector-njgv7 15m 69Mi
kube-system traefik-8vnsp 2m 28Mi
kube-system traefik-r89dl 3m 35Mi
Local UI
Ok, using it from the command line is not user friendly. We can build a Gradio UI to use out service, or just get the yourchat app https://yourchat.app/ Here we can create a new LLM provider let’s call it: llamacpp-k8s
And now you can chat with your private LLM service
Easy isn’t it?