Tutorial

How to run Pixtral 12B on a cloud GPU with vLLM

Discover how to get Mistral’s new multimodal LLM, Pixtral 12B up and running on an Ori cloud GPU.

Sep 19, 2024

Tutorial

In our previous tutorial, we explained how to run an AI image generator on an Ori cloud GPU to create images based on a text prompt. This tutorial will demonstrate another paradigm involving images and text – how to use text prompts to understand and interpret images.

During the past couple of years, Mistral AI has emerged as one of the key providers of open source large language models (LLMs), with their Instruct v0.2 model topping more than a million downloads. Mistral recently announced their first multimodal model, Pixtral 12B which is trained to understand both images and text.

Here’s a quick rundown of Pixtral 12B specifications:

	Pixtral 12B
Architecture	12B parameter Multimodal Decoder + 400M parameter Vision Encoder
Image attributes	Variable size and aspect ratios
Sequence length	128k tokens
Licensing	Apache 2.0

One of Pixtral’s key strengths is that it doesn’t compromise on key text capabilities such as instruction following, coding, and math, while excelling at image analysis at the same time. The native multimodal capability stems from how Pixtral was trained, with interleaved image and text data. Performance benchmarks shared by Mistral portray strong multimodal performance when compared to several other leading LLMs.

Source: Mistral Blog

Pixtral enables generative AI to be applied for a wide range of uses cases that includes the following and many more:

Optical Character Recognition (OCR) that is used in validating financial and legal documents, Captcha security systems etc.
Image captioning which is often used in ecommerce, social networking and many other internet applications.
Information extraction from images such as identifying number plate data, linking product images to SKUs etc.
Assistive technology for blind and visually impaired users.
Analyzing and mapping complex technical images.

How to run Pixtral on an Ori virtual machine

Pre-requisites

We’ll be using the vLLM utility to serve Pixtral for this demo. Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 PCIe with 80 GB VRAM and 380 GiB of system memory for this demo because vLLM needs 24 GB VRAM to load the model, and some more memory for the graph. We recommend using an NVIDIA A100 or H100 GPU to run this model. For the operating system, we chose Ubuntu 22.04, however Debian is also an option.

Prerequisites

Quick Tip

Do not use the Init script when creating the VM if you plan to use vLLM because libraries in the vLLM utility are compiled with Cuda 12.1 whereas the init script installs the newer version, Cuda 12.6.

Step 1: Once you SSH into your VM, install Cuda 12.1 and NVIDIA drivers

Cuda 12.1:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-toolkit-12-1

NVIDIA Drivers

sudo add-apt-repository ppa:graphics-drivers/ppa --yes && sudo apt update && sudo apt install -y nvidia-driver-555

You need not install Pytorch as it is bundled with vLLM.

Step 2: If you haven’t done it already, install Python and create a virtual environment

apt install python3.10-venv python3.10 -m venv pixtral-env

Quick Tip

Virtual environments help you install packages safely in an isolated environment without disturbing other projects.

Activate the virtual environment

source pixtral-env/bin/activate

Step 3: Install Hugging face CLI if you haven’t before and log in

pip install -U "huggingface_hub[cli]" huggingface-cli login

Step 4: Install vLLM and upgrade to the latest Mistral version

pip install vllm pip install --upgrade mistral_common

Step 5: Create a Python (.py) file with the following code to run vLLM from the terminal

from vllm import LLM from vllm.sampling_params import SamplingParams model_name = "mistralai/Pixtral-12B-2409" sampling_params = SamplingParams(max_tokens=8192) llm = LLM(model=model_name, tokenizer_mode="mistral") prompt = "Describe this image in detail." image_url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg" messages = [ { "role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}] }, ] outputs = llm.chat(messages, sampling_params=sampling_params) print(outputs[0].outputs[0].text)

Alternative ways to run Pixtral on the cloud

Mistral-inference

Use the mistral-inference library to run Pixtral on a cloud GPU or on a local machine. Mistral has shared a code example here.

Jupyter notebook

Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here

pip3 install notebook jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0

The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.

How good is Pixtral?

We threw a variety of questions at Pixtral regarding images, ranging from descriptions, to dominant colors, identification of people, birds, and animals. Pixtral aced these prompts except the odd occasion where it got a few details wrong, and only in one instance it failed to summarize the result in the number of sentences specified within the prompt.

Here are some results of queries about open source images

See Pixtral 12B in action

Prompt: Describe this image in detail

Image Source: Pixabay

Result: The image captures a serene winter scene at a lake. The lake, a deep shade of blue, is the central focus of the image. It's surrounded by a forest of evergreen trees, their branches heavy with snow, creating a beautiful contrast against the clear sky.

On the shore of the lake, there's a small wooden cabin. Its brown roof and white walls stand out against the snowy landscape. The cabin is situated on the right side of the image, with a red boat docked next to it, adding a pop of color to the otherwise monochrome scene.

The sky above is a light blue, with a few clouds scattered across it. The entire scene is a harmonious blend of natural elements, each contributing to the peaceful ambiance of the winter landscape.

Prompt: Identify the bird in the image.

Pixtral multimodal LLM testing with vLLM on a cloud GPU

Image Source: NPS

Result: The bird in the image appears to be a ptarmigan. This species is known for its plumage which changes with the seasons. In the winter, ptarmigan birds have white feathers to blend in with the snowy environment, while in the summer, their feathers change to a more brown or gray color to blend in with the rocky and grassy terrain. The bird shown in the image has white feathers, suggesting it is in a winter camouflage mode. Ptarmigans are often found in arctic and alpine regions and are well-adapted to cold environments.

Prompt: Identify the animal in the image.

Image Source: NPS

Result: The animal in the image is a mountain goat.

Prompt: What is the dominant color in this image?

Image Source: NPS

Result: The dominant color in this image is white, which is mostly due to the extensive snow coverage

Prompt: How many people in this image?

Image Source: Pixabay

Result: There are three people in the image.

Prompt: What colored shirt is the person in this image wearing?

Image Source: Pixabay

Result: The person in the image is wearing a black shirt.

Prompt: Translate to English

Mistral multimodal LLM testing with vLLM on a cloud GPU

Image Source: Pixabay

Result: Sure, here is the translation of the text from German to English:

"The text reads:
'Dear Guest...
Youare welcome!'"

So, the whole translation is:
"Dear Guest...
You are welcome!"

Although the model excelled with the majority of the queries, it did fall short with the following prompt where we it took two sentences to describe an image instead of the one sentence we asked for.

Prompt: Describe this image in one sentence

Image Source: NPS

Result: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue.

Imagine another AI reality. Build it on Ori.

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori helps AI startups & businesses to:

Deploy AI-Optimized GPU instances for training, finetuning and inference workloads.
Scale effortlessly from fractional GPUs to bespoke private clouds
Reduce your GPU costs by up to 75% when compared to hyperscale cloud providers.

How to run Pixtral 12B on a cloud GPU with vLLM