AI

Accelerate Llama 3.1 8B Instruct Inference with TensorRT LLM

An end to end Tutorial using Ori's Virtual Machines, Llama3.1 8B Instruct, and FastAPI for speedy batch inference with TensorRT LLM.

There’s so much hype around the inference speeds achieved by TensorRT LLM, but it’s tough to know where to get started when optimising your own LLM deployment. Here we provide a complete guide to building a TensorRT LLM engine and deploying an API to batch requests on Ori’s Virtual Machines. 

An Introduction to TensorRT and the LLM API

TensorRT is an SDK developed by NVIDIA for high-performance deep learning inference. It provides optimisations, runtime, and deployment tools that significantly accelerate AI applications, particularly when running on NVIDIA GPUs. Nvidia reports performance with “TensorRT-based applications perform up to 36X faster than their CPU-only platform during inference.” Third party benchmarks have also verified TensorRT output performs other inferencing engines such as the last engine we benchmarked with BeFOri: vLLM.

Key Features of TensorRT:

  • Compiled Inferencing Engine: Developers are able to compile a model into an optimised C++ TensorRT engine through a simple Python library (without interacting with C++) that runs much faster than the raw weights.
  • Optimised Inferencing Performance: Developers are able to apply optimisation techniques such as quantization, layer and tensor fusion, and kernel tuning through parameters when building the TensorRT engine.
  • Automated Dynamic Batching : Developers can rely on the TensorRT engine to efficiently manage memory and handle varying input sizes and batch dimensions enabling auto scaling.

These result in the highest throughput and lowest latency inference, providing you with both super fast results and potentially substantial cost savings.

The TensorRT LLM API

The TensorRT SDK includes text, audio, and video modals but here we will focus on optimisation of text to text generation using the LLM class in the tensorrt_llm library. The API is currently under development, and the documentation is sparse, but we were able to modify the Generate Text in Streaming example provided in the TensorRT-LLM git repo to successfully deploy an API wrapping the TensorRT Engine for Llama3.1 8B Instruct. 

Limitations of TensorRT

As mentioned, the TensorRT LLM API is currently under development, and there may be breaking changes in the future. So there is no doubt Nvidia will roll out some exciting enhancements in 2025, but for now, these are the challenges we ran into.

Limited Model support

At the time of writing the supported models include:

  • Llama (including variants Mistral, Mixtral, InternLM)
  • GPT (including variants Starcoder-1/2, Santacoder)
  • Gemma-1/2
  • Phi-1/2/3
  • ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
  • QWen-1/1.5/2
  • Falcon
  • Baichuan-1/2
  • GPT-J
  • Mamba-½

TensorRT requires significant effort to optimise and validate models for performance and precision across its supported quantisations. Each new model must be tailored to ensure compatibility with TensorRT's kernel operations and its inference engine, which requires substantial time and resources.

Lack of Documentation and Tutorials for TensorRT-LLM

The TensorRT-LLM git repo contains a multitude of examples, primarily organised by model; however the lack of documentation, dozens of command line arguments, and thousands of lines of messy code make it tough to decipher. 

It’s best to start with the directory TensorRT-LL/examples/llm-api if your model is supported by the LLM API, otherwise you’ll need to navigate to the model’s directory under TensorRT-LL/examples/ and work through the steps in the Quick Start Guide to:

  1. Compile the Model into a TensorRT Engine
    1. Convert the checkpoint
    2. Build the engine
  2. Run the model
  3. Deploy the model

However if your model is supported by the LLM API you're in luck - read on and follow the tutorial provided below.

Streaming Batch Responses

It appears the streaming and batching functionalities are not compatible at this time. While the Generate Text in Streaming example provided in the TensorRT-LLM git repo claims the results will print out 1 token at a time, this was not the behaviuor we observed when running it ourselves. You can pass the parameter streaming=True to the TensorRT runner.generate() function and successfully generate a response, but there does not appear to be built in functionality to consume those tokens as they are streamed back.

The TensorRT engine does not accept a streamer parameter such as the TextIteratorStreamer from the transformers library commonly used with vLLM to consume streaming responses as you might expect. This makes it challenging to consume streaming responses, especially when a batch of requests are generated concurrently.

There are currently 16 open issues with streaming in various parts of the repo, and it appears this functionality is certainly still under development. For the time being we moved forward with a batch inference tutorial instead.

TensorRT LLM Tutorial:

You will need to sign up to Ori's Public Cloud, and request access to the Meta Llama3.1 models on Hugging Face before completing these steps.

1. Create a VM on Ori's Public Cloud 

Log into the Ori Console, navigate to the Virtual Machines page and create a new instance. When you reach the option to add an init script, copy and paste the appropriate script from below:

Init Script for H100 SXM VM:

Init Script for A100 VM:

#!/bin/bash

sudo apt update && \
sudo apt upgrade wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
sudo dpkg -i cuda-keyring_1.1-1_all.deb && \
sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-1 && \
sudo apt-get install -y nvidia-driver-555-open && \
sudo apt-get install -y cuda-drivers-555 && echo "blacklist nvidia_uvm" | sudo tee /etc/modprobe.d/nvlink-denylist.conf && \
echo "options nvidia NVreg_NvLinkDisable=1" | sudo tee /etc/modprobe.d/disable-nvlink.conf && \
sudo apt install nvidia-cuda-toolkit && \
sudo update-initramfs -u && \
sudo apt upgrade && \
sudo reboot

#!/bin/bash

sudo apt update && \
sudo apt upgrade wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
sudo dpkg -i cuda-keyring_1.1-1_all.deb && \
sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-1 && \
sudo apt-get install -y nvidia-driver-555-open && \
sudo apt-get install -y cuda-drivers-555 && \
sudo apt install nvidia-cuda-toolkit && \
sudo update-initramfs -u && \
sudo apt upgrade && \
sudo reboot
 

 

It will take up to 10 minutes for your machine to be provisioned and become available. 

2. Install Dependencies

You can copy the ssh command directly from the Ori Console to connect to your machine, and then run the following commands:

 

# Verify init script installation
nvidia-smi
nvcc --version
cat /proc/driver/nvidia/version
nvidia-smi -q | grep -A5 Fabric
# Expect NAs in response

# Setup venv and activate
sudo apt install python3.10-venv && python3 -m venv tensorrt
source tensorrt/bin/activate

# Python Package Index Installation
wheel && python3 -m pip install --upgrade tensorrt
python3 -m pip install --upgrade pip

# Install TensorRT
wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/local_repo/nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8_1.0-1_amd64.deb && \
sudo dpkg -i nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8_1.0-1_amd64.deb && \
sudo cp /var/nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8/nv-tensorrt-local-EE22FB8A-keyring.gpg /usr/share/keyrings/ && \
sudo cp /var/nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8/*-keyring.gpg /usr/share/keyrings/ && \
sudo apt-get install tensorrt

sudo apt install libmpich-dev && sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev

sudo apt install python3-dev

# Enter "Y" when prompted for permission

 

To install the required python libraries, create a requirements.txt file containing the following:

fastapi==0.115.4
huggingface-hub
ray==2.11.0
ray[serve]==2.11.0
tensorrt_llm==0.15.0.dev2024111200
torch==2.5.1
transformers==4.43.4
wheel==0.43.0

Then install the libraries and log into the HuggingFace CLI using the access token associated with the account you used to request permission the the Llama3.1 models:

pip install -r requirements.txt

huggingface-cli login --token "<your-access-token>"

3. Create a Fast API App to wrap the TensorRT LLM Engine

Create a python file called deploy_tensorrt_engine.py that contains:

from tensorrt_llm import LLM, SamplingParams
import logging
from fastapi import FastAPI, HTTPException
from ray import serve
from itertools import islice
from typing import Dict
import uuid
import time
import threading

logger = logging.getLogger("ray.serve")

fastapi_app = FastAPI()


@serve.deployment(ray_actor_options={"num_gpus": 1})
@serve.ingress(fastapi_app)
class DeployTRTEngine:

    def __init__(self, model_id: str, ccr: int, batch_time=1.0):
        self.model = LLM(model=model_id)

        self.queue = {}
        self.statuses = {}
        self.outputs = {}

        self.ccr = ccr
        self.batch_time = batch_time
        self.timer = 0

    @fastapi_app.post("/")
    def handle_request(self, prompt: str):

        # If the queue is empty then (re)start the timer
        queue_len = len(self.queue)
        if queue_len == 0:
            self.timer = time.time()

        # Generate a unique ID for the request and set up tracking
        task_id = str(uuid.uuid4())
        self.queue[task_id] = prompt
        queue_len += 1
        self.statuses[task_id] = "in queue"

        # If we have the desired number of concurrent requests
        # or 2 seconds have passed then start generating
        if queue_len >= self.ccr or time.time() - self.timer > self.batch_time:
            # make a dictionary of prompts that contain the desired number of
            # concurrent requests or less
            prompt_len = min(self.ccr, queue_len)
            prompts_dict = dict(islice(self.queue.items(), prompt_len))

            # remove them from the queue
            self.queue = dict(islice(self.queue.items(), prompt_len, None))

            # update statuses
            self.statuses = {
                key: ("in progress" if key in prompts_dict else value)
                for key, value in self.statuses.items()
            }

            # Start a background thread to process the task
            threading.Thread(
                target=self.generate_text, kwargs={"prompts": prompts_dict}
            ).start()
        return {"task_id": task_id}

    def generate_text(self, prompts: Dict[str, str]):
        prompt_list = list(prompts.values())

        # Generate Outputs
        raw_outputs = self.model.generate(prompt_list)

        # Process Outputs
        for _output in raw_outputs:
            _task_id, input_prompt = next(iter(prompts.items()))
            prompts.pop(_task_id)
            self.outputs[_task_id] = {
                "prompt": input_prompt,
                "text": _output.outputs[0].text,
                "token_len": len(_output.outputs[0].token_ids),
            }

            self.statuses[_task_id] = "complete"

    @fastapi_app.get("/response/{task_id}")
    def get_response(self, task_id: str):
        # Get the status of the task, if the task id is not found raise an error
        try:
            status = self.statuses[task_id]
        except KeyError:
            raise HTTPException(status_code=404, detail="Task ID not found")

        # Return 202 if its not done generating yet
        if status in ["in queue", "in progress"]:
            raise HTTPException(status_code=202, detail=f"Task is {status}.")
        ret = self.outputs.pop(task_id)
        self.statuses.pop(task_id)
        return ret


app = DeployTRTEngine.bind("meta-llama/Meta-Llama-3.1-8B-Instruct", 2, 2.0)

You may need to adjust line 16 `@serve.deployment(ray_actor_options={"num_gpus": 1})` to ensure the correct number of GPUs are made available to the app.

You can also modify the last line to adjust the parameters:

  • model_id can be changed from "meta-llama/Meta-Llama-3.1-8B-Instruct" to the model id from Hugging Face for any supported model
  • ccr can be changed from 2 to any number of desired concurrent requests, this acts a maximum limit on the batch size that will be sent to the engine for concurrent generation
  • batch_time can be changed from 2.0 to any number of seconds to wait for additional requests before sending a batch of prompts to the engine for concurrent generation

Deploy the app by running:

serve run deploy_tensorrt_engine:app

4. Call the TensorRT Engine API

Open a new terminal window and open a second connection to the VM by running the same ssh command found in the Ori console at the beginning.

Below is a simple Python script that sends a single prompt to the API and poles every 0.1 seconds until it receives a response, and then prints the response:

import requests
import time

if __name__ == "__main__":
    url = "http://localhost:8000/"
    prompt = "It's finally working! Now "

    post_response = requests.post(url, params={"prompt": prompt})
    post_response.raise_for_status()

    if post_response.status_code == 200:
        data = post_response.json()
        task_id = data["task_id"]
    else:
        raise ("Error:", post_response.text)

    get_url = f"{url}response/{task_id}"
    while True:
        try:
            get_response = requests.get(get_url)
            if get_response.status_code == 200:
                print("Task completed!")
                print("Response:", get_response.json())
                break  # Exit the loop

            elif get_response.status_code == 202:
                print(f"{get_response.json()['detail']}")

                # Wait for 0.1 seconds before retrying
                time.sleep(0.1)

            else:
                get_response.raise_for_status()

        except requests.exceptions.RequestException as e:
            print(f"An error occurred: {e}")
            break  # Exit the loop on request errors

To send multiple prompts to be batched together, simply loop the function call requests.post(url, params={"prompt": prompt}) for each prompt. Create a list of the task_ids it returns to retrieve the responses with a for loop over requests.get(f"{url}response/{task_id}”).

You can also call the API from outside the VM by updating the url to replace localhost with the ip address and handling any required authentication.

Conclusion

TensorRT is a powerful tool for accelerating large language model inference, and its deployment on Ori's Virtual Machines provides an efficient and cost-effective solution for high-performance AI applications. While the TensorRT LLM API is still evolving, it already offers impressive features for optimizing and managing LLM inference. This tutorial demonstrates how to get started with TensorRT, showcasing its ability to handle batch inference and the potential for real-world applications.

For ML engineers looking to maximise GPU utilisation and minimise latency, Ori's platform combined with TensorRT offers an ideal setup to experiment, deploy, and scale AI models. As NVIDIA continues to refine TensorRT and its LLM capabilities, its adoption will undoubtedly grow, making it a cornerstone for AI inference in production environments.

 

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.