Private LLM server on K8S [tutorial]

devquasar X CIVO

Cloud deployment is not difficult let’s create a very simple private LM service on CIVO cloud, using llama.cpp and a quantized model from https://huggingface.co/DevQuasar .

Create your k8s cluster on CIVO

Navigate to the Kubernetes tab

Choose your flavor that fits your budget. In this tutorial I’ll use a small performance node for my 2 node cluster. We’ll do CPU inference with a tiny 1B model.
We won’t install anything we’ll have everything in the Docker image (ghcr.io/ggml-org/llama.cpp:server)

Once the cluster is created download the kubeconfig

Export the KUBECONFIG

export KUBECONFIG=civo-llama-cpp-cpu-server-kubeconfig

This will allow us to communicate with the cluster.

Let’s define the service

For the service we’re using the precompiled llama.cpp server Docket image which contains all we need to run quantized LLMs. The images are available here: ghcr.io/ggml-org/llama.cpp:server and you can also find detailed information about the various docker images and how to use them in this readme.

Create a directory, e.g. k8s-llama/ and create these files inside:

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-server
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: llama-server
  template:
	metadata:
  	labels:
    	app: llama-server
	spec:
  	containers:
  	- name: llama-server
    	image: ghcr.io/ggml-org/llama.cpp:server
    	ports:
    	- containerPort: 8080
    	env:
    	- name: LLAMA_ARG_MODEL_URL
      	value: "https://huggingface.co/DevQuasar/ibm-granite.granite-3.1-1b-a400m-instruct-GGUF/resolve/main/ibm-granite.granite-3.1-1b-a400m-instruct.Q3_K_M.gguf"

The service will listen on port 80 and we’re using a 1B parameter small quantized model from my Huggingface org: https://huggingface.co/DevQuasar/ibm-granite.granite-3.1-1b-a400m-instruct-GGUF For security we can add LLAMA_API_KEY parameter.

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: llama-service
spec:
  type: LoadBalancer
  selector:
	app: llama-server
  ports:
	- protocol: TCP
  	port: 80     	# Public-facing port
  	targetPort: 8080 # Container port

Let’s deploy the service

(note you have to have the kubeconfig exported)

kubectl apply -f k8s-llama/deployment.yaml

> deployment.apps/llama-server created

kubectl apply -f k8s-llama/service.yaml

> service/llama-service created

Get the IP for our cluster’s external IP

kubectl get svc llama-service

NAME        TYPE       CLUSTER-IP EXTERNAL-IP PORT(S)    AGE

llama-service   LoadBalancer   10.##.##.##   212.##.##.##   80:31608/TCP   47s

Now test it

Let’s use a simle CURL request

curl --request POST \
	--url http://212.##.##.##/completion \
	--header "Content-Type: application/json" \
	--data '{"prompt": "Tell me a joke","n_predict": 256}'

FYI: The first call might be slower and the server has to download the model file from Huggingface and load it.

{"index":0,"content":"\n\n\nA man walks into an elevator and asks for the highest floor. The elevator doors close behind him. He says, \"How did you know I was coming?\" The elevator operator replies, \"I had to do that to make it work.\"\n\nWhat's the difference between a pen and pencil?\n\nThe pen has a barrel, the pencil has a tip.\n\nWhat do you call someone who only eats fast food?\n\nFast-fied.\n\nWhat do you call someone who only eats fast food?,"tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":256,"tokens_evaluated":5,"generation_settings":{"n_predict":256,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":4096,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":256,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"Tell me a joke","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":260,"timings":{"prompt_n":5,"prompt_ms":129.35,"prompt_per_token_ms":25.869999999999997,"prompt_per_second":38.65481252415926,"predicted_n":256,"predicted_ms":8117.727,"predicted_per_token_ms":31.70987109375,"predicted_per_second":31.53592132378928}}

And voila it’s working.

Let’s see some metrics

He cluster comes with a metrics server preinstalled and we can access information with the following commands presented in the guide:

How is our CPU use during generation?

kubectl top pod --all-namespaces

NAMESPACE 	NAME                          	CPU(cores)   MEMORY(bytes)   
default   	llama-server-79fb8cf879-vdhrp 	1658m    	491Mi      	 
kube-system   civo-ccm-5474f5869d-vt7qg     	3m       	17Mi       	 
kube-system   civo-csi-controller-0         	4m       	61Mi       	 
kube-system   civo-csi-node-897l6           	1m       	20Mi       	 
kube-system   civo-csi-node-m4898           	1m       	21Mi       	 
kube-system   coredns-6799fbcd5-zhpxx       	3m       	22Mi       	 
kube-system   metrics-server-67c658944b-cp4cr   10m      	26Mi       	 
kube-system   otel-collector-986kt          	6m       	69Mi       	 
kube-system   otel-collector-njgv7          	15m      	69Mi       	 
kube-system   traefik-8vnsp                 	2m       	28Mi       	 
kube-system   traefik-r89dl                 	3m       	35Mi  	

Local UI

Ok, using it from the command line is not user friendly. We can build a Gradio UI to use out service, or just get the yourchat app https://yourchat.app/ Here we can create a new LLM provider let’s call it: llamacpp-k8s

And now you can chat with your private LLM service

Easy isn’t it?