Deploying generative AI/ML or large language models (LLMs) across projects at scale can be incredibly challenging. While LLMs like Meta's Llama2 and Code Llama have gained popularity for their ability to generate human-like text, the demand for better performance has left many projects with rapidly inflating compute costs.
In this blog post, we will explore how to build and deploy Hugging Face's LLM models on Ori Global Cloud (OGC) using FastAPI.
Before we get into the specifics of building and deploying LLM models on OGC, it's essential to understand why we want a platform that can handle scalability in the first place. In an enterprise setting, increasing LLM performance for practical deployments also means increasing training resources—time, data and parameters—which can easily rack up high costs if not done efficiently.
The Ori Global Cloud platform is cloud agnostic and enables seamless deployment across different cloud providers or on-premise infrastructure. With its container orchestration system, we can scale AI by deploying containers with the LLM model to any environment, regardless of infrastructure choices or application architectures.
It also provides automation features that streamline the deployment process and ensures continuity of running deployments through self-healing processes, making it an ideal platform for deploying LLM models.
Before you begin, ensure that you have the following prerequisites in place:
We need inference code for the CodeLlama-7b-instruct model. Create a FastAPI web application in Python. You can name the main script `app.py`.
In the above Python code, we are importing the dependencies and specifying the Tokenizer and the pipeline. There are also various parameters specified in the inference code above with recommended values.
For device_map
: The inference can be done using CPU, single GPU & Multi GPU by changing the device_map. Remove device_map parameter to make inference only on CPU. With a multi-GPU use device_map=”auto”
.
Make sure to list the dependencies of your Python application in a `requirements.txt` file.
Create a `Dockerfile` to set up the environment, install the dependencies, and launch the Python app on port 7860.
After creating the Python application and Dockerfile, build a Docker image from the project's root directory using the following command:
Once the image is built, you can run it locally:
Test your FastAPI application to ensure it generates the desired output.
Visit http://localhost:7860/docs in your web browser to access the Swagger UI.
You should see the Swagger docs generated by FastAPI locally.
To make the Docker image available to OGC platform, push it to a public repository like Docker Hub:
The CodeLlama model can be deployed on any cloud provider’s K8s cluster such as EKS, GKE, AKS, Linode, Digital Ocean.
To start with, spin up a GPU cluster with high RAM on your preferred cloud provider, we are using GKE in this case. You can follow the steps from here, that guide you through the process of provisioning a GKE cluster.
Once the cluster is up, ensure it's available in your OGC organisation. This typically involves copying a Helm script and adding it to your provider's console.
For best results when using the CodeLlama-7b-Instruct model, it’s advisable to choose at least 1 GPU with a minimum of 120GB memory. Refer the screenshot below.
Now, configure a package on OGC that includes your application image and other necessary commands/arguments. Make sure to add the cluster to the project where the package will be deployed.
In the application details, provide the Docker Hub image path (or any other repository where you stored the image file). If you didn't specify a tag, OGC will use the latest version of the image file as shown below:
Next, define network policies. Add port information to the container and define network policies, allowing access to the application from anywhere within the traffic source to the application port e.g., 7860
Once the networking is defined, specify the placement policy, indicating which type of infrastructure you want the application to be deployed on. For LLM models, select clusters with the highest performance and GPU support.
Finally, it's time to deploy your models on OGC. Deployment is a straightforward process that can be performed by clicking on the Deploy button on the OGC UI. Once your models are deployed, OGC ensures that your models are always available and performing optimally.
Once the FastAPI web server is up and running, obtain the Fully Qualified Domain Name (FQDN) from OGC's deployment details page. Append with port :7860/docs
, to the FQDN to access the Swagger UI and validate the function.
You can further enhance your deployment by integrating the GET responses with a front-end application. This allows you to visualise and interact with the responses generated by the LLM models in real-time.
I used Appsmith to build an end user chat application.
By following these steps, you can effectively run containerised LLM models like Meta's Code Llama, Llama2 on the Ori Global Cloud.
You can also try using other LLMs from Hugging Face, deploy front-end apps that utilise the FastAPI endpoints to build applications for your end users. You might also build a pipeline to fine-tune, optimise, test and deploy many such LLMs and make them more suitable for your applications. The possibilities of building and deploying containerised LLM applications on OGC are endless! Give it a try, and let us know how it goes.
References:
Hugging Face - Code Llama Models