Blog Post

AI at Scale: Deploy LLMs like Code Llama on Any Cloud

Follow this step-by-step guide to quickly deploy Meta’s Code Llama and other open-source Large Language Models (LLMs), using Python and Hugging Face on the Ori Global Cloud platform.

Deploying generative AI/ML or large language models (LLMs) across projects at scale can be incredibly challenging. While LLMs like Meta's Llama2 and Code Llama have gained popularity for their ability to generate human-like text, the demand for better performance has left many projects with rapidly inflating compute costs.

In this blog post, we will explore how to build and deploy Hugging Face's LLM models on Ori Global Cloud (OGC) using FastAPI.

Why Deploy LLMs Using Ori?

Before we get into the specifics of building and deploying LLM models on OGC, it's essential to understand why we want a platform that can handle scalability in the first place. In an enterprise setting, increasing LLM performance for practical deployments also means increasing training resources—time, data and parameters—which can easily rack up high costs if not done efficiently.

Containerised Orchestration 

The Ori Global Cloud platform is cloud agnostic and enables seamless deployment across different cloud providers or on-premise infrastructure. With its container orchestration system, we can scale AI by deploying containers with the LLM model to any environment, regardless of infrastructure choices or application architectures.

It also provides automation features that streamline the deployment process and ensures continuity of running deployments through self-healing processes, making it an ideal platform for deploying LLM models.


Let's go step-by-step and build a FastAPI app for code generation, demonstrating how the CodeLlama-7b-Instruct-hf model generates code when given input text.

Before you begin, ensure that you have the following prerequisites in place:

  1. A basic understanding of Docker and containerization.
  2. Access to a public container repository (e.g., Docker Hub)
  3. Sign up for a free account on Ori Global Cloud.
  4. An account on Hugging Face.
  5. A cloud provider account with GPU support (e.g., Google Kubernetes Engine (GKE)).

Create a FastAPI web application

We need inference code for the CodeLlama-7b-instruct model. Create a FastAPI web application in Python. You can name the main script `app.py`.

from fastapi import FastAPI from transformers import AutoTokenizer import transformers # Create a new FastAPI app instance app = FastAPI() # Load model tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf") # Initialize the text generation pipeline # This function will be able to generate text # given an input. pipeline = transformers.pipeline("text-generation", model="codellama/CodeLlama-7b-Instruct-hf", device_map="auto") # Define a function to handle the GET request at `/generate` # The generate() function is defined as a FastAPI route that takes a # string parameter called text. The function generates text based on the # input using the pipeline() object, and returns a JSON response # containing the generated code under the key "output" @app.get("/generate") def generate(text: str): """ Using the text-generation pipeline from `transformers`, generate code from the given input text. The model used is `codellama/CodeLlama-7b-Instruct-hf`, which can be found [here](<https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf>). """ # Use the pipeline to generate code from the given input text output = pipeline(text, do_sample=True, top_k=10, top_p = 0.9, temperature = 0.95, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id, max_length=256,) # Return the generated text in a JSON response return {"output": output[0]["generated_text"]}

 

📣 When using a gated models such as Meta’s Llama2, you will need a User Access Token from Hugging Face. Once you get approved from Meta, you can create a read access token in your profile settings.


In the above Python code, we are importing the dependencies and specifying the Tokenizer and the pipeline. There are also various parameters specified in the inference code above with recommended values.

For device_map: The inference can be done using CPU, single GPU & Multi GPU by changing the device_map. Remove device_map parameter to make inference only on CPU. With a multi-GPU use device_map=”auto”.

List the Python dependencies

Make sure to list the dependencies of your Python application in a `requirements.txt` file.

fastapi==0.74.* requests==2.27.* uvicorn[standard]==0.17.* sentencepiece==0.1.* torch==1.11.*


Setting up a Docker image

Create a `Dockerfile` to set up the environment, install the dependencies, and launch the Python app on port 7860.

# Use the official Python 3.9 image FROM python:3.9 # Set the working directory to /code WORKDIR /code # Copy the current directory contents into the container at /code COPY ./requirements.txt /code/requirements.txt # Install requirements.txt # RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt RUN python -m pip install git+https://github.com/huggingface/transformers.git@main accelerate --no-cache-dir --upgrade -r /code/requirements.txt # Set up a new user named "user" with user ID 1000 RUN useradd -m -u 1000 user # Switch to the "user" user USER user # Set home to the user's home directory ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH # Set the working directory to the user's home directory WORKDIR $HOME/app # Copy the current directory contents into the container at $HOME/app setting the owner to the user COPY --chown=user . $HOME/app # Start the FastAPI app on port 7860, the default port expected by Spaces CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

 

💡 Make sure to install transformers from main until the next version is released:
pip install git+https://github.com/huggingface/transformers.git@main accelerate


Build and test locally

After creating the Python application and Dockerfile, build a Docker image from the project's root directory using the following command:

docker build -t your-app-name .

Once the image is built, you can run it locally:

docker run -p 7860:7860 your-app-name

Test your FastAPI application to ensure it generates the desired output. 

Visit http://localhost:7860/docs in your web browser to access the Swagger UI.

You should see the Swagger docs generated by FastAPI locally.

You should be able to click on 'Try it out' to test your app.


 

Build and test locally

To make the Docker image available to OGC platform, push it to a public repository like Docker Hub:

docker buildx build --platform linux/amd64,linux/arm64 --push -t your-dockerhub-username/your-app-name:latest .
Note: I used `buildx` to build multi-platform images

 

Setup a GPU cluster on your Cloud Provider

The CodeLlama model can be deployed on any cloud provider’s K8s cluster such as EKS, GKE, AKS, Linode, Digital Ocean.

To start with, spin up a GPU cluster with high RAM on your preferred cloud provider, we are using GKE in this case. You can follow the steps from here, that guide you through the process of provisioning a GKE cluster. 

⚠️ Select `Enable spot VMs instance` to minimise the cost of your cluster!


Once the cluster is up, ensure it's available in your OGC organisation. This typically involves copying a Helm script and adding it to your provider's console.

For best results when using the CodeLlama-7b-Instruct model, it’s advisable to choose at least 1 GPU with a minimum of 120GB memory. Refer the screenshot below.


 

Configure a Package in OGC

Now, configure a package on OGC that includes your application image and other necessary commands/arguments. Make sure to add the cluster to the project where the package will be deployed.

In the application details, provide the Docker Hub image path (or any other repository where you stored the image file). If you didn't specify a tag, OGC will use the latest version of the image file as shown below:

For large scale parallel processing, you can run your models on multiple clusters by setting the minimum and maximum number of clusters with replicas per cluster.


Next, define network policies. Add port information to the container and define network policies, allowing access to the application from anywhere within the traffic source to the application port e.g., 7860

Once the networking is defined, specify the placement policy, indicating which type of infrastructure you want the application to be deployed on. For LLM models, select clusters with the highest performance and GPU support.

Finally, it's time to deploy your models on OGC. Deployment is a straightforward process that can be performed by clicking on the Deploy button on the OGC UI. Once your models are deployed, OGC ensures that your models are always available and performing optimally. 


 

Validate the application

Once the FastAPI web server is up and running, obtain the Fully Qualified Domain Name (FQDN) from OGC's deployment details page. Append with port :7860/docs, to the FQDN to access the Swagger UI and validate the function.


 

Integrate with Front-end application

You can further enhance your deployment by integrating the GET responses with a front-end application. This allows you to visualise and interact with the responses generated by the LLM models in real-time.

I used Appsmith to build an end user chat application.



You're done! 

By following these steps, you can effectively run containerised LLM models like Meta's Code Llama, Llama2 on the Ori Global Cloud. 

You can also try using other LLMs from Hugging Face, deploy front-end apps that utilise the FastAPI endpoints to build applications for your end users. You might also build a pipeline to fine-tune, optimise, test and deploy many such LLMs and make them more suitable for your applications. The possibilities of building and deploying containerised LLM applications on OGC are endless! Give it a try, and let us know how it goes.

huggingface_logo-noborder 1-1

 

References:

Hugging Face - Code Llama Models

Hugging Face’s Docker Spaces

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.