Introducing Ori Serverless Kubernetes
Meet Ori Serverless Kubernetes, an AI infrastructure service that brings you the best of Serverless and Kubernetes by blending powerful scalability,...
Learn how to deploy LLMs and scale inference on Ori Serverless Kubernetes, via Ollama and Open WebUI.
For growing AI teams and startups, deploying large language models (LLMs) on GPU-powered virtual machines can feel like a tightrope walk. Inference demand is often unpredictable, similar to sudden spikes in web traffic. Should you over-provision and risk running through your GPU budget, or under-provision and jeopardize user experience? It's a classic infrastructure dilemma! That's where Ori Serverless Kubernetes comes in. It offers powerful scalability, easy management, and cost efficiency, helping you reach the widest possible audience for your LLM applications.
Scale seamlessly: Automatically scale your inference clusters up or down based on demand, without the hassles of managing nodes and node pools. Ori will manage your clusters and provision load balancers so you can focus on serving your models as widely as possible.
High throughput with powerful GPUs: No waiting for GPUs and no approvals needed. Pick a high-performance NVIDIA® GPU—H100, L40S, or L4—and set up a cluster in under a minute. These datacenter grade GPUs deliver the performance you need to provide powerful concurrency and low-latency processing.
Flexibility of Vanilla Kubernetes: Experience the flexibility of Vanilla Kubernetes with full access to the control plane via kubectl and multiple namespaces, unlike proprietary Kubernetes from other providers.
Pay only for what you use: Our pay-per-minute pricing helps you keep your costs predictable as you scale. Ori Serverless Kubernetes helps you right-size your infrastructure based on inference demand so you can make the most of your GPU budget.
Ollama is an open-source platform that enables users to run LLMs such Llama 3.1, Gemma2, Mistral, Qwen, Deepseek, Command R+, and plenty more, on their own infrastructure. Ollama combines model weights, configurations, and datasets into a package managed by a Modelfile. Designed to be user-friendly and intuitive, Ollama simplifies model management with a unified Modelfile, and supports easy switching between a wide range of models.
Open Web UI (formerly Ollama Web UI) is the interface through which you can interact with Ollama using the downloaded Model Files. Effortless to set up via Docker or Kubernetes, Open Web UI makes it easy to deliver a delightful experience to users for a variety of applications such as image generation, concurrent model usage, web and RAG integration, web and chat integration, customization for characters and more.
Before you start, ensure you have the following:
To automate the startup of the Ollama service and the pulling of the Llama model, we will create a custom entrypoint script. This script will be stored in a ConfigMap and mounted into the Ollama container.
Create a file named entrypoint-configmap.yaml
with the following content:
This script starts the Ollama service, pulls the Llama 3.1 model, and ensures that the service remains running.
Apply ConfigMap to your Kubernetes cluster:
To deploy your application, you'll need to create Kubernetes deployment manifests for both the Ollama and OpenWebUI services. These manifests define the desired state of your application, including the containers to run, the ports to expose, the persistent volume and the entrypoint script.
Create a file named ollama-deployment.yaml with the following content:
This manifest specifies that the Ollama service will use the ollama/ollama:latest ready-to-use Docker image and expose port 80, and specifies a single L40S GPU to be used, and also specifies a PV. You can access the image here.
Apply the manifest to deploy Kubernetes cluster:
Next, create a file named openwebui-deployment.yaml:
This manifest specifies that the OpenWebUI service will use the ghcr.io/open-webui/open-webui:main Docker image, expose port 8080, and connect to the Ollama service via an environment variable. You can find more information about OpenWebUI here.
Now deploy OpenWebUI to your Kubernetes cluster:
To make the Ollama and OpenWebUI deployments accessible within and outside the Kubernetes cluster, you need to create service manifests. These services route traffic to the appropriate pods, allowing communication between different parts of your application and external clients.
This manifest instructs the Ollama service to be exposed via a load balancer. The service listens on port 80 externally and forwards traffic to port 11434 on the Ollama pod.
Next, create a file named openwebui-service.yaml with the following content:
The OpenWebUI service is also exposed via a load balancer. The service listens on port 8080 externally and forwards traffic to the same port on the OpenWebUI pod.
Expose the OpenWebUI service:
After deploying the Ollama and OpenWebUI services, it's important to verify that everything is running as expected. This step will guide you through checking the status of your deployments and services within the Kubernetes cluster.
Use the following command to check the status of the pods:
You should see the Ollama and OpenWebUI pods listed with a status of Running.
With the Ollama service deployed and verified, the next step is to access the service to ensure it's working properly. This involves retrieving the external IP address of the service and interacting with it directly.
To get the external IP address assigned to the Ollama service, use the following command:
This command will return details about the ollama-service, including its external IP address. You should see output similar to this:
NAME | TYPE | CLUSTER-IP | EXTERNAL-IP | PORT(S) | AGE |
ollama-service | LoadBalancer | 10.0.0.1 | <external-ip> | 80:30000/TCP | 5m |
Similarly to the Ollama service, copy the external IP of the OpenWebUI service:
Load this IP in your web browser. You'll be taken to the OpenWebUI interface.
OpenWebUI provides a ChatGPT-like interface with an integrated RAG for interacting with your models. Open Follow these steps to customize your chatbot:
Use the OpenWebUI interface to create a custom chatbot:
Using OGC’s serverless Kubernetes, scaling your LLM-based application is straightforward. Adjust the replicas field in your deployment manifests to increase or decrease the number of instances running:
To speed up your inference even further, you can also easily increase the number of even more powerful GPUs in the Ollama deployment manifest file.
Note: You are not restricted to using one type of Llama model, so you can pull any model supported by Ollama. Check the list of available models here
To prevent the model from being pulled every time the pod restarts, we can use a Persistent Volume (PV) and Persistent Volume Claim (PVC) to store the model persistently. This way, the model is only pulled once, and subsequent pod restarts will use the already downloaded model. Define the pv-pvc.yaml file:
Congratulations, you have successfully deployed an LLM-based application using Ollama and OpenWebUI on Ori Serverless Kubernetes. This powerful combination allows you to build and scale large language model applications efficiently without needing to manage complex infrastructure. For further customization and model scaling, refer to OGC’s comprehensive documentation and support services.
Meet Ori Serverless Kubernetes, an AI infrastructure service that brings you the best of Serverless and Kubernetes by blending powerful scalability,...
Discover how to deploy Genmo Mochi 1 with ComfyUI on an Ori GPU instance, and read our analysis of this new open source video generation model.
Learn how to leverage Ori to deploy GPU workloads on Google Cloud.