Huggingface Spaces are amazing tools to demo a model you’ve built, the only problem the Free tier the CPU Basic with 2 vCPUs and 16GB Ram is not really capable to run any bigger model.

Here I want to show 2 solutions how can you still able to use this CPU Basic virtual hardware to host a LLama3 8B model.
Local hosting
If you already have a capable workstation you can host your model locally, and server text generation request from a Huggingface space. In my case I’ve used LMStudio to locally host a model with their server solution. By default they supporting an OpenAI style client you can access via
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
To make this accessible over the internet you have to find your public IP using something like https://whatismyipaddress.com/. You also have to open the port on your router/firewall to enable communication.
https/http
LMStudio server (at the time I’m writing this post) is only supports *http* connection so all the prompts and answers will travel un-encrypted through the internet. To mitigate this I’ve wrote a small https/http proxy utility: https://github.com/csabakecskemeti/ai_utils/blob/main/proxy_server.py
which should run on your local hardware. It accepts the *https* traffic and translates it to *http* for the local LMStudio server. One downside that if you’re using self signed certificate, the openAI API won’t work out of box you have to pach to not enforce signed cers and accept unsecure sites.
HF space Gradio code
For the Space SDK I’ve choosen the Gradio chatbot template for it’s simplicity, and it gave me a base to re-write my app.py.

I’ve don’t wanted to hardcode my public IP address so I’ve set up a secret for the URL (and and the API_KEY)

I’ve modified and added openai==1.30.1
to the requirements.txt.
From here it was pretty straight forward to replace the HF inference client with the OpenAI client, remove the not used parameters and read the URL
and API_KEY
from the environment variable.
self_host_url = os.environ['URL']
api_key = os.environ['API_KEY']
client = OpenAI(base_url=self_host_url, api_key=api_key)
The only trick I’ve had to apply is to make sure that LMStudio API won’t get any empty string, like in the system prompt. For some reason it’s very picky for this, so I’ve just added a whitespace:
messages = [{"role": "system", "content": system_message+" "}]
Examples
Example code: app.py
Example Space: DevQuasar/Brainstorm
CPU inference with llama.cpp
Ok, the previous solution was easy and performant, but requires my Workstation to run to be able to serve demo requests, but I wanted something more independent and still free. So I have that 2vCPU node with 16GB ram, let’s utilize the CPU -> llama.cpp
I went again with the dame Gradio chatbot template. So we need a compiled Thankfully there’s a precompiled llama.cpp package is available thanks to https://github.com/abetlen so I’ve added this to the requirements.txt
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
llama-cpp-python
I’ve uploaded my smallest Q2_K GGUF model file to use locally (I guess you can download it at in app init phase but this was just easier for me). In this case I couldn’t rely on my model’s preset so had to add the stopwords otherwise the chat function is pretty basic. I’ve used Alpacca Instruct format at Fine Tuning so in my case the chat function looks like this.
This applies the f'{messages}\n ### HUMAN:\n{prompt} \n ### ASSISTANT:'
prompt-template on the new prompt to concatenate the previous messages from the context and make sure the model well instructed.
def llama_cpp_chat(gguf_model, prompt:str, messages:str = ''):
prompt_templated = f'{messages}\n ### HUMAN:\n{prompt} \n ### ASSISTANT:'
output = gguf_model(
prompt_templated, # Prompt
max_tokens=512,
stop=["### HUMAN:\n", " ### ASSISTANT:"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
return output['choices'][0]['text']
You load the model simply with
from llama_cpp import Llama
llm = Llama(model_path="llama3_8b_chat_brainstorm.Q2_K.gguf")
Because of the difference how llama.cpp handling the context/ the message history vs the openAI compatible clients I had to manually edited the message history. So I’ve added the chatty proxy function between the chat and the Gradio app, which is unpacked the array or user-assitant message pairs [[user_msg, assistant_response],[user_msg, assistant_response]]
to a string which looks more like the training format### HUMAN: user_msg\n ### ASSISTANT: assistant_reponse\n ### HUMAN: user_msg\n ### ASSISTANT: assistant_reponse
def chatty(prompt, messages):
print(prompt)
print(f'messages: {messages}')
past_messages = ''
if len(messages) > 0:
for idx, message in enumerate(messages):
print(f'idx: {idx}, message: {message}')
past_messages += f'\n### HUMAN: {message[0]}'
past_messages += f'\n### ASSISTANT: {message[1]}'
# past_messages = messages[0][0]
print(f'past_messages: {past_messages}')
messages = llama_cpp_chat(llm, prompt, past_messages)
return messages.split('### ASSISTANT:')[-1]
And voila, it’s working, not fast but a LLama3 8B running on the free tier CPU hardware:

As I’ve mentioned it’s really slow. To make some improvement of the processing time you can consider even more restrictive quantization like: IQ1_S. All in on the Basic CPU free tier for the Huggingface Spaces probably never meant to run a LLama3 8B size model, but it’s doable 😀
Examples
Example code: app.py
Example Space: DevQuasar/Brainstorm_CPU