AI

Unveiling a New Benchmarking Framework from Ori

Access BeFOri for LLama2 and LLama3 Benchmarks on Nvidia V100s and H100 Chips

Using the New Benchmarking Framework from Ori for LLama2 and LLama3 Benchmarks on Nvidia V100s and H100 Chips

Why a Chip Benchmarking Framework for AI Models?

At Ori, we were frustrated that there were no open source projects to benchmark self-hosted AI model performance across different chips. So, we created our own! We named it “Benchmarking Framework from Ori” or “BeFOri” for short. (You can check out the GitHub repo here.) Unlike standard applications,  bringing cutting edge AI applications to market necessitates the careful selection of hardware to run the dependent models at scale. Hardware  impacts both the performance and economics of your AI application. Unfortunately, there’s no simple heuristic to base this decision on, but a few things are certain: you need enough RAM to hold billions of parameters in memory, enough bandwidth to get your prompts and inferences to and from users, and enough compute  to handle parallel requests at scale.

Long before you’re ready to select the number and type of chip architecture to scale your application, you must first develop  or select an off-the-shelf model, create or acquire data for training, and/or perform fine tuning exercises to the model. Fortunately, there’s an abundance of benchmarks for AI Model Performance and leader boards to share results that, through the power of crowdsourcing, can  provide guidance on the relevance, coherence, creativity, and safety of an out-of-box model, with the ability to track  as you train. 

After investing all this time and resources developing an AI application, why would you leave the selection of the optimal hardware to chance? Trial and error without standardized benchmarks is inefficient at best, and extremely unlikely to lead to an optimal solution. To address this gap in the MLOps cycle, Ori developed an easy to use open source solution to bring scientific rigor to the chip selection process, ensuring our customers are getting the most out of their GPU hour spend.

Introducing GPU Benchmarks for Self Hosted Models with BeFOri

Our newly developed BeFOri framework enables you to measure 4 key metrics for GPU chip performance for large language model (LLM) inference: 

  • Time to First Token (TTFT)
  • Inter-Token Latency (ITL)
  • End-to-End Latency (EL)
  • Token Throughput (TT) 

We choose to focus on LLM interference first because it’s one of the most common applications we see in the market today. In the future we will add capabilities to the framework to measure LLM training, as well as image generation and categorization tasks.

Supported models

Currently, BeFOri supports:

  • Llama2 (self hosted)
  • Llama3 (self hosted)
  • Several APIs and their compatible models including Open AI, Anthropic, Together AI, Hugging Face, and LiteLLM (a capability we inherited from forking LLMPerf from Anyscale). 

The benchmarking  framework leverages the Ray open source project  in order to parallelize Python. This enables as many concurrent requests as the hardware can support.

Customizations include the ability to select the average prompt length in tokens, as well as a standard deviation to better replicate realistic scenarios. Currently, the BeFOri selects from a library of Shakespear’s sonnets to provide inputs of the desired length. Finally, you can specify the average and standard deviation of the number of output tokens which will be embedded in the prompt sent to the model.

Metric: Time to First Token

Turning to the metrics, first we have Time to First Token (TTFT), sometimes called the prefill time, which measures the time that elapses between sending a prompt to an LLM and receiving the first word, or token back. TTFT  essentially tells you how long the user must wait before they start to see the model’s response. This is especially important for real-time interactions where users will start reading the response as the first words appear. For reference, users are accustomed to waiting less than a second for responses in modern web applications. This is an indicator of how long the model takes to process the prompt and make its first inference, so it is very sensitive to the length of the prompt.

Metric: Inter-Token Latency

Inter-Token latency (ITL), sometimes called time per output token, measures the time that elapses between each token that an LLM generates. ITL  is an indicator of how quickly the subsequent words in the response will appear after the first word. This metric impacts the user’s perception of how quickly the model responds in real-time applications. For reference, many popular models available today boast speeds of 20-50 milliseconds, but on average people spend about 250 milliseconds per word when reading, so inter-token latencies below this value will  be perceived as fast. 

Metric: End-to-End Latency

The End-to-End Latency (ETEL), sometimes just called latency, combines metrics along with the number of tokens in the response, to provide a single metric for response speed. ETEL measures the time from when a prompt is sent to the model to when the final token is generated and the response is complete. This metric is especially meaningful for applications where the user will not see the response until the model has generated the last token, and offline workloads.

Metric: Token Throughput

Token Throughput (TT) is the inverse of end-to-end latency for a single concurrent request, however with multiple concurrent requests it gives the best indication of performance across those requests. This is most meaningful for real-time applications where multiple users are expected to be calling on the model at the same time.

Metric Measurements

Below you can see a code snippet demonstrating how these measurements are recorded for self-hosted models (adapted from the BeFOri repo):

import os
import torch
import time
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3-8B"
prompt = "Why is the sky blue?"
max_length = 256
metrics = {}

# Read access token from environment variable
access_token = os.environ.get("HF_ACCESS_TOKEN")

# Instantiate the model
model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        token=access_token
    )

# Set model to evaluation mode
model.eval()

# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
        model_name, 
        token=access_token,
    )

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")
ttft_start_time = time.monotonic()

# Generate the first token
with torch.no_grad():
    outputs = model.generate(
        inputs=input_ids,
        max_new_tokens=1,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode the first token
first_token = tokenizer.decode(
        outputs[0][-1], 
        skip_special_tokens=True,
    )
ttft = time.monotonic() - ttft_start_time

# Generate the full response
start_time = time.monotonic()
with torch.no_grad():
    outputs = model.generate(
        inputs=input_ids,
        max_length=max_length,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode the generated text
generated_text = tokenizer.decode(
    outputs[0], 
    skip_special_tokens=True
)

total_request_time = time.monotonic() - start_time

# Calculate metrics
tokens_received = outputs.shape[1]
prompt_len = input_ids.shape[1]
output_tokens = outputs.shape[1]

metrics["itl"] = (total_request_time - ttft) / tokens_received
metrics["ttft"] = ttft
metrics["e2e_latency"] = total_request_time
metrics["tt"] = tokens_received / total_request_time
metrics["total_tokens"] = output_tokens + prompt_len
metrics["output_tokens"] = output_tokens
metrics["prompt_tokens"] = prompt_len

Benchmarking Llama2 and Llama3 on NVIDIA V100S and H100 Chips

To demonstrate the power of BeFOri, and to kick off the first of many benchmarking studies to come, we will share our results for benchmarking Llama2 7B and Llama3 8B models on NVIDIA V100S and H100 chips. In addition to comparing the two models and the two chips, we have also compared the results of two different input prompt lengths, 64 and 256 tokens, using the framework. Below we have visualized the results and highlighted some key insights.

Llama2 vs. Llama3 Performance

Earlier this year Mark Zuckerberg reported that Meta has the equivalent 600,000 NVIDIA H100 equivalences of compute (350,000 actual H100s and the balance made up of other chip types). While we don’t know the exact portion of these resources that were used or cost of creating Llama3, it’s safe to say expectations on its improvement over Llama2 have been high. The model performance benchmarks have shown a strong improvement, however we found LLama3 did not perform as quickly as Llama2.

Inter-Token Latency (ITL)
Lower is better
IMG_0010

With the exception of one concurrent request on 2 X V100S chips, Llama3 8B was on average 7.3% slower than Llama2 7B. For both models, ITL increased with more concurrent requests on a given chip type.

Time to First Token (TTFT)
Lower is better
IMG_0012

Llama3 8B performed much better than Llama2 7B for TTFT on 2 X V100S, but performance was about the same on the H100 chip. This indicates improvements in the time to process input prompts on less powerful chips, especially with a billion extra parameters in the Llama3 model.

End-to-End Latency (ETEL)
Lower is better
IMG_0008

Llama3 8B was slower than Llama2 7B Chat for every configuration we tested, by an average of 31.7% for ETEL.

Token Throughput (TT)
Higher is better
Token Throughput

The results for TT are mixed with the performance of Llama2 and Llama3 falling with one standard deviation of each other for each configuration, with the exception of one concurrent request on the 2 X V100S, where Llama2 7B Chat achieved nearly double the throughput of Llama3 8B.

NVIDIA H100 vs 2 X V100S Performance

Today you can rent one NVIDIA H100 on Ori Cloud for $3.24/h and two Nvidia V100S for $1.91/h, which will give you the following:

Chip VRAM/GPU vCPUs RAM (GB) Storage SSD Storage NVMe Bandwidth (GBPs)
2 x NVIDIA V100S 64 30 90 500   4
1 x NVIDIA H100 80 30 380 50 3840 8

Below you can see a detailed breakdown on the improved performance you can expect by opting to use the for the premium H100 chips.

Inter-Token Latency (ITL)
Lower is better
IMG_0009

With the exception of Llama2 7B Chat with 1 concurrent request, the H100 chip provided an average of 52.0% decrease in ITL over 2 X V100S.

Time to First Token (TTFT)
Lower is better
IMG_0011

For all configurations, the H100 chip decreased TTFT by an average of 40.9%.

End to End Latency (ETEL)
Lower is better
IMG_0007

For all configurations, the H100 chip decreased ETEL by an average of 53.7%.

Token Throughput (TT)
Higher is better
IMG_0013

With the exception of Llama2 7B Chat with one concurrent request, the H100 chip increased token throughput by an average of .83 tokens per second.

Other Findings

In addition to the results documented above, we found some other interesting results:

  • The H100 chip can handle a maximum of 12 concurrent requests for Llama2 7B Chat and 11 concurrent requests for Llama3 8B.
  • Increasing to 256 input prompt tokens from the 64 input prompt tokens used in each of the benchmarks above resulted in very large standard deviations relative to the mean. This indicates that more requests must be run in order to obtain a smaller standard deviation relative to the mean so that the benchmarking results are meaningful (i.e. the mean +/- 1 standard deviation for 2 results does not overlap).

Next Steps: Get Started with BeFOri

If we’ve convinced you of the power of BeFOri and you’re interested in testing it out for yourself, you can get started with our GitHub repo today! Below you can find a video and step-by-step code to:

  • Setup Llama2 and Llama3 on an Ori GPU
  • Clone BeFOri
  • Start running your own custom benchmarks. 

Tip: Make sure you request Llama2 and/or Llama3 access on HuggingFace first, wait for the approval, and replace the environment variable below with your HuggingFace Access Token.

# SSH into your Ori VM, the ip address can be found in the Ori console
ssh root@##.##.###.### -i /path/to/.ssh/key

# Ensure packages are updated and upgraded
sudo apt update
sudo apt upgrade

# Install dependencies
sudo apt install nvidia-cuda-toolkit
wget \
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1
.1-1_all.deb
sudo dpkg -i cuda-keyring_1
.1-1_all.deb
sudo apt-get update
sudo apt-get install -y nvidia-driver-550-open
sudo apt-get install -y cuda-drivers-550
sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt install python3-pip

sudo reboot
# rebooting will close connection, wait a few mins and reconnect via SSH
ssh root@##.##.###.### -i /path/to/.ssh/key


# Verify your drivers are installed correctly
nvidia-smi
nvcc --version
cat /proc/driver/nvidia/version

# BeFOri Setup
git clone https://github.com/ori-edge/BeFOri.git
cd ./BeFOri
pip install -r requirements.txt

# Add your new repo to your python path
export PYTHONPATH="/PATH/TO/BeFOri/src/"

# Set your Hugging Face access token as an environment variable
export HF_ACCESS_TOKEN="XXXXXXXXXXXXXX"

# Run a benchmark
python3 token_benchmark_ray.py --model "meta-llama/Meta-Llama-3-8B" --mean-input-tokens 64 --stddev-input-tokens 8 --mean-output-tokens 128 --stddev-output-tokens 8 --max-num-completed-requests 10 --timeout 900 --num-concurrent-requests 2 --results-dir "result_outputs" --llm-api transformers-lib

# Check results
./results_outputs/meta-llama-Meta-Llama-3-8B_64_128_individual_responses.json
cat ./result_outputs/meta-llama-Meta-Llama-3-8B_64_128_summary.json

 

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.