How to run Snowflake Arctic Model Inference on NVIDIA H100s
Ready to experience the Snowflake-Arctic-instruct model with Hugging Face? In this blog we are going to walk you through environment setup, model...
Access BeFOri for LLama2 and LLama3 Benchmarks on Nvidia V100s and H100 Chips
At Ori, we were frustrated that there were no open source projects to benchmark self-hosted AI model performance across different chips. So, we created our own! We named it “Benchmarking Framework from Ori” or “BeFOri” for short. (You can check out the GitHub repo here.) Unlike standard applications, bringing cutting edge AI applications to market necessitates the careful selection of hardware to run the dependent models at scale. Hardware impacts both the performance and economics of your AI application. Unfortunately, there’s no simple heuristic to base this decision on, but a few things are certain: you need enough RAM to hold billions of parameters in memory, enough bandwidth to get your prompts and inferences to and from users, and enough compute to handle parallel requests at scale.
Long before you’re ready to select the number and type of chip architecture to scale your application, you must first develop or select an off-the-shelf model, create or acquire data for training, and/or perform fine tuning exercises to the model. Fortunately, there’s an abundance of benchmarks for AI Model Performance and leader boards to share results that, through the power of crowdsourcing, can provide guidance on the relevance, coherence, creativity, and safety of an out-of-box model, with the ability to track as you train.
After investing all this time and resources developing an AI application, why would you leave the selection of the optimal hardware to chance? Trial and error without standardized benchmarks is inefficient at best, and extremely unlikely to lead to an optimal solution. To address this gap in the MLOps cycle, Ori developed an easy to use open source solution to bring scientific rigor to the chip selection process, ensuring our customers are getting the most out of their GPU hour spend.
Our newly developed BeFOri framework enables you to measure 4 key metrics for GPU chip performance for large language model (LLM) inference:
We choose to focus on LLM interference first because it’s one of the most common applications we see in the market today. In the future we will add capabilities to the framework to measure LLM training, as well as image generation and categorization tasks.
Currently, BeFOri supports:
The benchmarking framework leverages the Ray open source project in order to parallelize Python. This enables as many concurrent requests as the hardware can support.
Customizations include the ability to select the average prompt length in tokens, as well as a standard deviation to better replicate realistic scenarios. Currently, the BeFOri selects from a library of Shakespear’s sonnets to provide inputs of the desired length. Finally, you can specify the average and standard deviation of the number of output tokens which will be embedded in the prompt sent to the model.
Turning to the metrics, first we have Time to First Token (TTFT), sometimes called the prefill time, which measures the time that elapses between sending a prompt to an LLM and receiving the first word, or token back. TTFT essentially tells you how long the user must wait before they start to see the model’s response. This is especially important for real-time interactions where users will start reading the response as the first words appear. For reference, users are accustomed to waiting less than a second for responses in modern web applications. This is an indicator of how long the model takes to process the prompt and make its first inference, so it is very sensitive to the length of the prompt.
Inter-Token latency (ITL), sometimes called time per output token, measures the time that elapses between each token that an LLM generates. ITL is an indicator of how quickly the subsequent words in the response will appear after the first word. This metric impacts the user’s perception of how quickly the model responds in real-time applications. For reference, many popular models available today boast speeds of 20-50 milliseconds, but on average people spend about 250 milliseconds per word when reading, so inter-token latencies below this value will be perceived as fast.
The End-to-End Latency (ETEL), sometimes just called latency, combines metrics along with the number of tokens in the response, to provide a single metric for response speed. ETEL measures the time from when a prompt is sent to the model to when the final token is generated and the response is complete. This metric is especially meaningful for applications where the user will not see the response until the model has generated the last token, and offline workloads.
Token Throughput (TT) is the inverse of end-to-end latency for a single concurrent request, however with multiple concurrent requests it gives the best indication of performance across those requests. This is most meaningful for real-time applications where multiple users are expected to be calling on the model at the same time.
Below you can see a code snippet demonstrating how these measurements are recorded for self-hosted models (adapted from the BeFOri repo):
|
To demonstrate the power of BeFOri, and to kick off the first of many benchmarking studies to come, we will share our results for benchmarking Llama2 7B and Llama3 8B models on NVIDIA V100S and H100 chips. In addition to comparing the two models and the two chips, we have also compared the results of two different input prompt lengths, 64 and 256 tokens, using the framework. Below we have visualized the results and highlighted some key insights.
Earlier this year Mark Zuckerberg reported that Meta has the equivalent 600,000 NVIDIA H100 equivalences of compute (350,000 actual H100s and the balance made up of other chip types). While we don’t know the exact portion of these resources that were used or cost of creating Llama3, it’s safe to say expectations on its improvement over Llama2 have been high. The model performance benchmarks have shown a strong improvement, however we found LLama3 did not perform as quickly as Llama2.
With the exception of one concurrent request on 2 X V100S chips, Llama3 8B was on average 7.3% slower than Llama2 7B. For both models, ITL increased with more concurrent requests on a given chip type.
Llama3 8B performed much better than Llama2 7B for TTFT on 2 X V100S, but performance was about the same on the H100 chip. This indicates improvements in the time to process input prompts on less powerful chips, especially with a billion extra parameters in the Llama3 model.
Llama3 8B was slower than Llama2 7B Chat for every configuration we tested, by an average of 31.7% for ETEL.
The results for TT are mixed with the performance of Llama2 and Llama3 falling with one standard deviation of each other for each configuration, with the exception of one concurrent request on the 2 X V100S, where Llama2 7B Chat achieved nearly double the throughput of Llama3 8B.
Today you can rent one NVIDIA H100 on Ori Cloud for $3.24/h and two Nvidia V100S for $1.91/h, which will give you the following:
Chip | VRAM/GPU | vCPUs | RAM (GB) | Storage SSD | Storage NVMe | Bandwidth (GBPs) |
2 x NVIDIA V100S | 64 | 30 | 90 | 500 | 4 | |
1 x NVIDIA H100 | 80 | 30 | 380 | 50 | 3840 | 8 |
Below you can see a detailed breakdown on the improved performance you can expect by opting to use the for the premium H100 chips.
With the exception of Llama2 7B Chat with 1 concurrent request, the H100 chip provided an average of 52.0% decrease in ITL over 2 X V100S.
For all configurations, the H100 chip decreased TTFT by an average of 40.9%.
For all configurations, the H100 chip decreased ETEL by an average of 53.7%.
With the exception of Llama2 7B Chat with one concurrent request, the H100 chip increased token throughput by an average of .83 tokens per second.
In addition to the results documented above, we found some other interesting results:
If we’ve convinced you of the power of BeFOri and you’re interested in testing it out for yourself, you can get started with our GitHub repo today! Below you can find a video and step-by-step code to:
Tip: Make sure you request Llama2 and/or Llama3 access on HuggingFace first, wait for the approval, and replace the environment variable below with your HuggingFace Access Token.
# SSH into your Ori VM, the ip address can be found in the Ori console # Ensure packages are updated and upgraded # Add your new repo to your python path # Set your Hugging Face access token as an environment variable # Check results |
Ready to experience the Snowflake-Arctic-instruct model with Hugging Face? In this blog we are going to walk you through environment setup, model...
Benchmarking llama 3.1 8B Instruct with vLLM using BeFOri to benchmark time to first token (TTFT), inter-token latency, end to end latency, and...
Discover how to use BeFOri to calculate a cost per input and output token for self hosted models and apply this methodology to the DBRX Base model...