Tutorial

How to run Pixtral 12B on a cloud GPU with vLLM

Discover how to get Mistral’s new multimodal LLM, Pixtral 12B up and running on an Ori cloud GPU.

 
In our previous tutorial, we explained how to run an AI image generator on an Ori cloud GPU to create images based on a text prompt. This tutorial will demonstrate another paradigm involving images and text – how to use text prompts to understand and interpret images.
 
During the past couple of years, Mistral AI has emerged as one of the key providers of open source large language models (LLMs), with their Instruct v0.2 model topping more than a million downloads. Mistral recently announced their first multimodal model, Pixtral 12B which is trained to understand both images and text.
 
 Here’s a quick rundown of Pixtral 12B specifications:
 
  Pixtral 12B
Architecture
12B parameter Multimodal Decoder + 400M parameter Vision Encoder
Image attributes
Variable size and aspect ratios
Sequence length
128k tokens
Licensing
Apache 2.0
One of Pixtral’s key strengths is that it doesn’t compromise on key text capabilities such as instruction following, coding, and math, while excelling at image analysis at the same time. The native multimodal capability stems from how Pixtral was trained, with interleaved image and text data. Performance benchmarks shared by Mistral portray strong multimodal performance when compared to several other leading LLMs.
 Pixtral Benchmarks
 Source: Mistral Blog
 
Pixtral enables generative AI to be applied for a wide range of uses cases that includes the following and many more:
 
  • Optical Character Recognition (OCR) that is used in validating financial and legal documents, Captcha security systems etc.
  • Image captioning which is often used in ecommerce, social networking and many other internet applications.
  • Information extraction from images such as identifying number plate data, linking product images to SKUs etc.
  • Assistive technology for blind and visually impaired users.
  • Analyzing and mapping complex technical images.
 
Join the Ori Discord server
 

How to run Pixtral on an Ori virtual machine

Pre-requisites

We’ll be using the vLLM utility to serve Pixtral for this demo. Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 PCIe with 80 GB VRAM and 380 GiB of system memory for this demo because vLLM needs 24 GB VRAM to load the model, and some more memory for the graph. We recommend using an NVIDIA A100 or H100 GPU to run this model. For the operating system, we chose Ubuntu 22.04, however Debian is also an option. 

  

Prerequisites

Quick Tip
Do not use the Init script when creating the VM if you plan to use vLLM because libraries in the vLLM utility are compiled with Cuda 12.1 whereas the init script installs the newer version, Cuda 12.6.

 

 
 
Step 1: Once you SSH into your VM, install Cuda 12.1 and NVIDIA drivers

Cuda 12.1:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-toolkit-12-1
  
 NVIDIA Drivers
sudo add-apt-repository ppa:graphics-drivers/ppa --yes && sudo apt update && sudo apt install -y nvidia-driver-555
 You need not install Pytorch as it is bundled with vLLM.
 

Step 2: If you haven’t done it already, install Python and create a virtual environment

apt install python3.10-venv python3.10 -m venv pixtral-env
  
 
 
 
Quick Tip
Virtual environments help you install packages safely in an isolated environment without disturbing other projects.
 

 

Amanda Smith

FOUNDER, CEO & EXECUTIVE CHAIRMAN

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

 Activate the virtual environment

source pixtral-env/bin/activate
 

Step 3:  Install Hugging face CLI if you haven’t before and log in

pip install -U "huggingface_hub[cli]" huggingface-cli login

Step 4: Install vLLM and upgrade to the latest Mistral version

pip install vllm pip install --upgrade mistral_common
 

 Step 5: Create a Python (.py) file with the following code to run vLLM from the terminal

from vllm import LLM from vllm.sampling_params import SamplingParams model_name = "mistralai/Pixtral-12B-2409" sampling_params = SamplingParams(max_tokens=8192) llm = LLM(model=model_name, tokenizer_mode="mistral") prompt = "Describe this image in detail." image_url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg" messages = [ { "role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}] }, ] outputs = llm.chat(messages, sampling_params=sampling_params) print(outputs[0].outputs[0].text)
 
 Alternative ways to run Pixtral on the cloud
 

Mistral-inference

Use the mistral-inference library to run Pixtral on a cloud GPU or on a local machine. Mistral has shared a code example here.
  

Jupyter notebook

 Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here
pip3 install notebook jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0
The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.
 

How good is Pixtral?

We threw a variety of questions at Pixtral regarding images, ranging from descriptions, to dominant colors, identification of people, birds, and animals. Pixtral aced these prompts except the odd occasion where it got a few details wrong, and only in one instance it failed to summarize the result in the number of sentences specified within the prompt.

Here are some results of queries about open source images
 

See Pixtral 12B in action

Prompt: Describe this image in detail
 
Testing Mistral's multimodal LLM Pixtral
Image Source: Pixabay
Result: The image captures a serene winter scene at a lake. The lake, a deep shade of blue, is the central focus of the image. It's surrounded by a forest of evergreen trees, their branches heavy with snow, creating a beautiful contrast against the clear sky.
 
On the shore of the lake, there's a small wooden cabin. Its brown roof and white walls stand out against the snowy landscape. The cabin is situated on the right side of the image, with a red boat docked next to it, adding a pop of color to the otherwise monochrome scene.
 
The sky above is a light blue, with a few clouds scattered across it. The entire scene is a harmonious blend of natural elements, each contributing to the peaceful ambiance of the winter landscape.
 
Prompt: Identify the bird in the image.
 
Pixtral multimodal LLM testing with vLLM on a cloud GPU
Image Source: NPS
Result: The bird in the image appears to be a ptarmigan. This species is known for its plumage which changes with the seasons. In the winter, ptarmigan birds have white feathers to blend in with the snowy environment, while in the summer, their feathers change to a more brown or gray color to blend in with the rocky and grassy terrain. The bird shown in the image has white feathers, suggesting it is in a winter camouflage mode. Ptarmigans are often found in arctic and alpine regions and are well-adapted to cold environments.

Prompt: Identify the animal in the image.
 
How is Pixtral model
Image Source: NPS
Result: The animal in the image is a mountain goat. 
 
Prompt: What is the dominant color in this image?

Pixtral multimodal LLM testing with vLLM on a cloud GPU
 Image Source: NPS
Result: The dominant color in this image is white, which is mostly due to the extensive snow coverage
 
Prompt: How many people in this image?
 
vllm list models
 Image Source: Pixabay
Result: There are three people in the image.
 
Prompt: What colored shirt is the person in this image wearing?
 
Using vLLM to run models on a GPU
 Image Source: Pixabay
Result: The person in the image is wearing a black shirt.
 
Prompt: Translate to English
 
Mistral multimodal LLM testing with vLLM on a cloud GPU
Image Source: Pixabay
Result:  Sure, here is the translation of the text from German to English:

"The text reads:
'Dear Guest...
Youare welcome!'"

So, the whole translation is:
"Dear Guest...
You are welcome!"

Although the model excelled with the majority of the queries, it did fall short with the following prompt where we it took two sentences to describe an image instead of the one sentence we asked for.
 
Prompt: Describe this image in one sentence
 
Pixtral multimodal LLM testing with vLLM on a cloud GPU
 Image Source: NPS
Result: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue.
 
Imagine another AI reality. Build it on Ori.
 
Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori helps AI startups & businesses to:
 

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.