Deploying generative AI/ML or large language models (LLMs) across projects at scale can be incredibly challenging. While LLMs like Meta's Llama2 and Code Llama have gained popularity for their ability to generate human-like text, the demand for better performance has left many projects with rapidly inflating compute costs.
In this blog post, we will explore how to build and deploy Hugging Face's LLM models on Ori Global Cloud (OGC) using FastAPI.
Why Deploy LLMs Using Ori?
Before we get into the specifics of building and deploying LLM models on OGC, it's essential to understand why we want a platform that can handle scalability in the first place. In an enterprise setting, increasing LLM performance for practical deployments also means increasing training resources—time, data and parameters—which can easily rack up high costs if not done efficiently.
Containerised Orchestration
The Ori Global Cloud platform is cloud agnostic and enables seamless deployment across different cloud providers or on-premise infrastructure. With its container orchestration system, we can scale AI by deploying containers with the LLM model to any environment, regardless of infrastructure choices or application architectures.
It also provides automation features that streamline the deployment process and ensures continuity of running deployments through self-healing processes, making it an ideal platform for deploying LLM models.
Let's go step-by-step and build a FastAPI app for code generation, demonstrating how the CodeLlama-7b-Instruct-hf model generates code when given input text.
Before you begin, ensure that you have the following prerequisites in place:
- A basic understanding of Docker and containerization.
- Access to a public container repository (e.g., Docker Hub)
- Sign up for a free account on Ori Global Cloud.
- An account on Hugging Face.
- A cloud provider account with GPU support (e.g., Google Kubernetes Engine (GKE)).
Create a FastAPI web application
We need inference code for the CodeLlama-7b-instruct model. Create a FastAPI web application in Python. You can name the main script `app.py`.
from fastapi import FastAPI
from transformers import AutoTokenizer
import transformers
# Create a new FastAPI app instance
app = FastAPI()
# Load model
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")
# Initialize the text generation pipeline
# This function will be able to generate text
# given an input.
pipeline = transformers.pipeline("text-generation", model="codellama/CodeLlama-7b-Instruct-hf", device_map="auto")
# Define a function to handle the GET request at `/generate`
# The generate() function is defined as a FastAPI route that takes a
# string parameter called text. The function generates text based on the # input using the pipeline() object, and returns a JSON response
# containing the generated code under the key "output"
@app.get("/generate")
def generate(text: str):
"""
Using the text-generation pipeline from `transformers`, generate code from the given input text. The model used is `codellama/CodeLlama-7b-Instruct-hf`, which can be found [here](<https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf>).
"""
# Use the pipeline to generate code from the given input text
output = pipeline(text, do_sample=True, top_k=10, top_p = 0.9, temperature = 0.95, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id, max_length=256,)
# Return the generated text in a JSON response
return {"output": output[0]["generated_text"]}
📣 When using a gated models such as Meta’s Llama2, you will need a User Access Token from Hugging Face. Once you get approved from Meta, you can create a read access token in your profile settings.
In the above Python code, we are importing the dependencies and specifying the Tokenizer and the pipeline. There are also various parameters specified in the inference code above with recommended values.
For device_map
: The inference can be done using CPU, single GPU & Multi GPU by changing the device_map. Remove device_map parameter to make inference only on CPU. With a multi-GPU use device_map=”auto”
.
List the Python dependencies
Make sure to list the dependencies of your Python application in a `requirements.txt` file.
fastapi==0.74.*
requests==2.27.*
uvicorn[standard]==0.17.*
sentencepiece==0.1.*
torch==1.11.*
Setting up a Docker image
Create a `Dockerfile` to set up the environment, install the dependencies, and launch the Python app on port 7860.
# Use the official Python 3.9 image
FROM python:3.9
# Set the working directory to /code
WORKDIR /code
# Copy the current directory contents into the container at /code
COPY ./requirements.txt /code/requirements.txt
# Install requirements.txt
# RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
RUN python -m pip install git+https://github.com/huggingface/transformers.git@main accelerate --no-cache-dir --upgrade -r /code/requirements.txt
# Set up a new user named "user" with user ID 1000
RUN useradd -m -u 1000 user
# Switch to the "user" user
USER user
# Set home to the user's home directory
ENV HOME=/home/user \
PATH=/home/user/.local/bin:$PATH
# Set the working directory to the user's home directory
WORKDIR $HOME/app
# Copy the current directory contents into the container at $HOME/app setting the owner to the user
COPY --chown=user . $HOME/app
# Start the FastAPI app on port 7860, the default port expected by Spaces
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
💡 Make sure to install transformers from main until the next version is released:
pip install git+https://github.com/huggingface/transformers.git@main accelerate
Build and test locally
After creating the Python application and Dockerfile, build a Docker image from the project's root directory using the following command:
docker build -t your-app-name .
Once the image is built, you can run it locally:
docker run -p 7860:7860 your-app-name
Test your FastAPI application to ensure it generates the desired output.
Visit http://localhost:7860/docs in your web browser to access the Swagger UI.
You should see the Swagger docs generated by FastAPI locally.
You should be able to click on 'Try it out' to test your app.
Build and test locally
To make the Docker image available to OGC platform, push it to a public repository like Docker Hub:
docker buildx build --platform linux/amd64,linux/arm64 --push -t your-dockerhub-username/your-app-name:latest .
Note: I used `buildx` to build multi-platform images
Setup a GPU cluster on your Cloud Provider
The CodeLlama model can be deployed on any cloud provider’s K8s cluster such as EKS, GKE, AKS, Linode, Digital Ocean.
To start with, spin up a GPU cluster with high RAM on your preferred cloud provider, we are using GKE in this case. You can follow the steps from here, that guide you through the process of provisioning a GKE cluster.
⚠️ Select `Enable spot VMs instance` to minimise the cost of your cluster!
Once the cluster is up, ensure it's available in your OGC organisation. This typically involves copying a Helm script and adding it to your provider's console.
For best results when using the CodeLlama-7b-Instruct model, it’s advisable to choose at least 1 GPU with a minimum of 120GB memory. Refer the screenshot below.
Configure a Package in OGC
Now, configure a package on OGC that includes your application image and other necessary commands/arguments. Make sure to add the cluster to the project where the package will be deployed.
In the application details, provide the Docker Hub image path (or any other repository where you stored the image file). If you didn't specify a tag, OGC will use the latest version of the image file as shown below:
For large scale parallel processing, you can run your models on multiple clusters by setting the minimum and maximum number of clusters with replicas per cluster.
Next, define network policies. Add port information to the container and define network policies, allowing access to the application from anywhere within the traffic source to the application port e.g., 7860
Once the networking is defined, specify the placement policy, indicating which type of infrastructure you want the application to be deployed on. For LLM models, select clusters with the highest performance and GPU support.
Finally, it's time to deploy your models on OGC. Deployment is a straightforward process that can be performed by clicking on the Deploy button on the OGC UI. Once your models are deployed, OGC ensures that your models are always available and performing optimally.
Validate the application
Once the FastAPI web server is up and running, obtain the Fully Qualified Domain Name (FQDN) from OGC's deployment details page. Append with port :7860/docs
, to the FQDN to access the Swagger UI and validate the function.
Integrate with Front-end application
You can further enhance your deployment by integrating the GET responses with a front-end application. This allows you to visualise and interact with the responses generated by the LLM models in real-time.
I used Appsmith to build an end user chat application.
You're done!
By following these steps, you can effectively run containerised LLM models like Meta's Code Llama, Llama2 on the Ori Global Cloud.
You can also try using other LLMs from Hugging Face, deploy front-end apps that utilise the FastAPI endpoints to build applications for your end users. You might also build a pipeline to fine-tune, optimise, test and deploy many such LLMs and make them more suitable for your applications. The possibilities of building and deploying containerised LLM applications on OGC are endless! Give it a try, and let us know how it goes.
References:
Hugging Face - Code Llama Models
Hugging Face’s Docker Spaces