<UPDATE>
Just found out that LMStudio has a cors setting on their model serving menu:

Also ollama seems have a setting to allow given or all origins to access the LLM resources.
OLLAMA_ORIGINS "*"
With that said a the none standard port is a blocker to use this as a generic solution for websites. Llmaas might still can play a role to normalize the local LLM resource access.
</UPDATE>
I want to introduce a new concept and term: LLmaaS – Local LLM as a Service.
Inference costs remain a significant concern, particularly for smaller companies, making them hesitant to integrate LLM capabilities into their websites. At the same time, we are witnessing remarkable advancements in local inference and the efficiency of smaller models. This naturally raises the question: why not leverage edge inference capabilities?WebGPU enables compute-intensive tasks directly in the browser, providing an option for on-device inference. However, for lightweight applications, this approach may be too heavy to implement individually and could compete with local LLM runtimes (such as Ollama, LMStudio, and Llama.cpp) when users are running chat models on their machines.
With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the user’s device.
Some of the challenges
- CORS
The first challenge to address is CORS, which prevents HTML scripts from directly calling a local service. To overcome this, a local Flask proxy is introduced, acting as a bridge that enables secure communication between the HTML script and the locally running LLM service. This proxy ensures seamless access while maintaining flexibility and security. - Message history
To maintain message history, the demo below stores it in a session variable. This approach provides approximately 5MB of storage per session, which is more than sufficient for our needs, ensuring smooth conversation continuity without external dependencies. - The local serice may not exist
This is perhaps the most crucial aspect. Since we rely on local services, LLM functionality must remain optional on the website. The demo below includes a simple health check that sends a request to the local Flask proxy. If the first response chunk is received, it confirms the availability of the local LLM service; otherwise, the service is considered unavailable, and the [Send] button is disabled accordingly. - Security
The current solution is a demonstration prototype. Additional security considerations and mitigation measures are required to ensure robustness and safety.

Applications
What types of web solutions could benefit from this approach? Here are a few examples that come to mind:
- Chat with the webpage
- Fixing typos and syntax in user input
- Coldstart content, generate initial content for user
Need for standard
To enable a standardized approach for utilizing local LLM resources on websites, a common framework is needed—one that allows a generic website implementation to work seamlessly with a variety of local LLM services. The local Flask proxy serves as an ideal component for establishing such a standard, giving users full control over their locally running proxy. However, we must also account for potential misuse, where websites might attempt to exploit local LLM resources. Addressing these concerns will require additional safeguards and further development.
Further resources
LLmaaS proxy code and html examples, and required setup to reproduce: LLmaaS
About This Chat
This chat allows you to communicate with a local LLM (Llama 3.1) running on the edge. The website connects to a locally hosted Flask server, which acts as a proxy to route chat requests to the LLM. This means that any website can leverage the capabilities of your local LLM to offer a dynamic chat experience tailored to the website content.
Enable the toggle switch to “Chat with the Website” to provide additional context from the website in your first message. The LLM can then respond with knowledge about the content of the website and provide interactive assistance, all while the model runs locally on your edge device.
Flask Server Code: The Flask server acts as a proxy that facilitates communication between this website and your locally running LLM. Here’s the Python code to set up the Flask server: