Benchmarking Llama 3.1 8B Instruct on Nvidia H100 and A100 chips with the vLLM Inferencing Engine
Benchmarking llama 3.1 8B Instruct with vLLM using BeFOri to benchmark time to first token (TTFT), inter-token latency, end to end latency, and...
An end to end Tutorial using Ori's Virtual Machines, Llama3.1 8B Instruct, and FastAPI for speedy batch inference with TensorRT LLM.
There’s so much hype around the inference speeds achieved by TensorRT LLM, but it’s tough to know where to get started when optimising your own LLM deployment. Here we provide a complete guide to building a TensorRT LLM engine and deploying an API to batch requests on Ori’s Virtual Machines.
TensorRT is an SDK developed by NVIDIA for high-performance deep learning inference. It provides optimisations, runtime, and deployment tools that significantly accelerate AI applications, particularly when running on NVIDIA GPUs. Nvidia reports performance with “TensorRT-based applications perform up to 36X faster than their CPU-only platform during inference.” Third party benchmarks have also verified TensorRT output performs other inferencing engines such as the last engine we benchmarked with BeFOri: vLLM.
These result in the highest throughput and lowest latency inference, providing you with both super fast results and potentially substantial cost savings.
The TensorRT SDK includes text, audio, and video modals but here we will focus on optimisation of text to text generation using the LLM class in the tensorrt_llm library. The API is currently under development, and the documentation is sparse, but we were able to modify the Generate Text in Streaming example provided in the TensorRT-LLM git repo to successfully deploy an API wrapping the TensorRT Engine for Llama3.1 8B Instruct.
As mentioned, the TensorRT LLM API is currently under development, and there may be breaking changes in the future. So there is no doubt Nvidia will roll out some exciting enhancements in 2025, but for now, these are the challenges we ran into.
At the time of writing the supported models include:
TensorRT requires significant effort to optimise and validate models for performance and precision across its supported quantisations. Each new model must be tailored to ensure compatibility with TensorRT's kernel operations and its inference engine, which requires substantial time and resources.
The TensorRT-LLM git repo contains a multitude of examples, primarily organised by model; however the lack of documentation, dozens of command line arguments, and thousands of lines of messy code make it tough to decipher.
It’s best to start with the directory TensorRT-LL/examples/llm-api if your model is supported by the LLM API, otherwise you’ll need to navigate to the model’s directory under TensorRT-LL/examples/ and work through the steps in the Quick Start Guide to:
However if your model is supported by the LLM API you're in luck - read on and follow the tutorial provided below.
It appears the streaming and batching functionalities are not compatible at this time. While the Generate Text in Streaming example provided in the TensorRT-LLM git repo claims the results will print out 1 token at a time, this was not the behaviuor we observed when running it ourselves. You can pass the parameter streaming=True to the TensorRT runner.generate() function and successfully generate a response, but there does not appear to be built in functionality to consume those tokens as they are streamed back.
The TensorRT engine does not accept a streamer parameter such as the TextIteratorStreamer from the transformers library commonly used with vLLM to consume streaming responses as you might expect. This makes it challenging to consume streaming responses, especially when a batch of requests are generated concurrently.
There are currently 16 open issues with streaming in various parts of the repo, and it appears this functionality is certainly still under development. For the time being we moved forward with a batch inference tutorial instead.
You will need to sign up to Ori's Public Cloud, and request access to the Meta Llama3.1 models on Hugging Face before completing these steps.
Log into the Ori Console, navigate to the Virtual Machines page and create a new instance. When you reach the option to add an init script, copy and paste the appropriate script from below:
Init Script for H100 SXM VM: |
Init Script for A100 VM: |
#!/bin/bash sudo apt update && \ |
#!/bin/bash sudo apt update && \ |
It will take up to 10 minutes for your machine to be provisioned and become available.
You can copy the ssh command directly from the Ori Console to connect to your machine, and then run the following commands:
# Verify init script installation # Setup venv and activate # Python Package Index Installation # Install TensorRT sudo apt install libmpich-dev && sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev sudo apt install python3-dev # Enter "Y" when prompted for permission |
To install the required python libraries, create a requirements.txt file containing the following:
fastapi==0.115.4 |
Then install the libraries and log into the HuggingFace CLI using the access token associated with the account you used to request permission the the Llama3.1 models:
pip install -r requirements.txt huggingface-cli login --token "<your-access-token>" |
Create a python file called deploy_tensorrt_engine.py that contains:
from tensorrt_llm import LLM, SamplingParams logger = logging.getLogger("ray.serve") fastapi_app = FastAPI() @serve.deployment(ray_actor_options={"num_gpus": 1}) def __init__(self, model_id: str, ccr: int, batch_time=1.0): self.queue = {} self.ccr = ccr @fastapi_app.post("/") # If the queue is empty then (re)start the timer # Generate a unique ID for the request and set up tracking # If we have the desired number of concurrent requests # remove them from the queue # update statuses # Start a background thread to process the task def generate_text(self, prompts: Dict[str, str]): # Generate Outputs # Process Outputs self.statuses[_task_id] = "complete" @fastapi_app.get("/response/{task_id}") # Return 202 if its not done generating yet app = DeployTRTEngine.bind("meta-llama/Meta-Llama-3.1-8B-Instruct", 2, 2.0) |
You may need to adjust line 16 `@serve.deployment(ray_actor_options={"num_gpus": 1})` to ensure the correct number of GPUs are made available to the app.
You can also modify the last line to adjust the parameters:
Deploy the app by running:
serve run deploy_tensorrt_engine:app |
Open a new terminal window and open a second connection to the VM by running the same ssh command found in the Ori console at the beginning.
Below is a simple Python script that sends a single prompt to the API and poles every 0.1 seconds until it receives a response, and then prints the response:
import requests if __name__ == "__main__": post_response = requests.post(url, params={"prompt": prompt}) if post_response.status_code == 200: get_url = f"{url}response/{task_id}" elif get_response.status_code == 202: # Wait for 0.1 seconds before retrying else: except requests.exceptions.RequestException as e: |
To send multiple prompts to be batched together, simply loop the function call requests.post(url, params={"prompt": prompt}) for each prompt. Create a list of the task_ids it returns to retrieve the responses with a for loop over requests.get(f"{url}response/{task_id}”).
You can also call the API from outside the VM by updating the url to replace localhost with the ip address and handling any required authentication.
TensorRT is a powerful tool for accelerating large language model inference, and its deployment on Ori's Virtual Machines provides an efficient and cost-effective solution for high-performance AI applications. While the TensorRT LLM API is still evolving, it already offers impressive features for optimizing and managing LLM inference. This tutorial demonstrates how to get started with TensorRT, showcasing its ability to handle batch inference and the potential for real-world applications.
For ML engineers looking to maximise GPU utilisation and minimise latency, Ori's platform combined with TensorRT offers an ideal setup to experiment, deploy, and scale AI models. As NVIDIA continues to refine TensorRT and its LLM capabilities, its adoption will undoubtedly grow, making it a cornerstone for AI inference in production environments.
Benchmarking llama 3.1 8B Instruct with vLLM using BeFOri to benchmark time to first token (TTFT), inter-token latency, end to end latency, and...
Access BeFOri for LLama2 and LLama3 Benchmarks on Nvidia V100s and H100 Chips
Discover how to use BeFOri to calculate a cost per input and output token for self hosted models and apply this methodology to the DBRX Base model...