Here’s a quick rundown of Pixtral 12B specifications:
|
Pixtral 12B |
Architecture
|
12B parameter Multimodal Decoder + 400M parameter Vision Encoder
|
Image attributes
|
Variable size and aspect ratios |
Sequence length
|
128k tokens |
Licensing
|
Apache 2.0
|
One of Pixtral’s key strengths is that it doesn’t compromise on key text capabilities such as instruction following, coding, and math, while excelling at image analysis at the same time. The native multimodal capability stems from how Pixtral was trained, with interleaved image and text data. Performance benchmarks shared by Mistral portray strong multimodal performance when compared to several other leading LLMs.
Pixtral enables generative AI to be applied for a wide range of uses cases that includes the following and many more:
- Optical Character Recognition (OCR) that is used in validating financial and legal documents, Captcha security systems etc.
- Image captioning which is often used in ecommerce, social networking and many other internet applications.
- Information extraction from images such as identifying number plate data, linking product images to SKUs etc.
- Assistive technology for blind and visually impaired users.
- Analyzing and mapping complex technical images.
How to run Pixtral on an Ori virtual machine
Pre-requisites
We’ll be using the vLLM utility to serve Pixtral for this demo. Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 PCIe with 80 GB VRAM and 380 GiB of system memory for this demo because vLLM needs 24 GB VRAM to load the model, and some more memory for the graph. We recommend using an NVIDIA A100 or H100 GPU to run this model. For the operating system, we chose Ubuntu 22.04, however Debian is also an option.
Prerequisites
Step 1: Once you SSH into your VM, install Cuda 12.1 and NVIDIA drivers
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-1
NVIDIA Drivers
sudo add-apt-repository ppa:graphics-drivers/ppa --yes && sudo apt update && sudo apt install -y nvidia-driver-555
You need not install Pytorch as it is bundled with vLLM.
Step 2: If you haven’t done it already, install Python and create a virtual environment
apt install python3.10-venv
python3.10 -m venv pixtral-env
Activate the virtual environment
source pixtral-env/bin/activate
Step 3: Install Hugging face CLI if you haven’t before and log in
pip install -U "huggingface_hub[cli]"
huggingface-cli login
Step 4: Install vLLM and upgrade to the latest Mistral version
pip install vllm
pip install --upgrade mistral_common
Step 5: Create a Python (.py) file with the following code to run vLLM from the terminal
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Pixtral-12B-2409"
sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(model=model_name, tokenizer_mode="mistral")
prompt = "Describe this image in detail."
image_url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg"
messages = [
{
"role": "user",
"content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Alternative ways to run Pixtral on the cloud
Mistral-inference
Jupyter notebook
Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here
pip3 install notebook
jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0
The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.
How good is Pixtral?
We threw a variety of questions at Pixtral regarding images, ranging from descriptions, to dominant colors, identification of people, birds, and animals. Pixtral aced these prompts except the odd occasion where it got a few details wrong, and only in one instance it failed to summarize the result in the number of sentences specified within the prompt.
Here are some results of queries about open source images
See Pixtral 12B in action
Prompt: Describe this image in detail
Result: The image captures a serene winter scene at a lake. The lake, a deep shade of blue, is the central focus of the image. It's surrounded by a forest of evergreen trees, their branches heavy with snow, creating a beautiful contrast against the clear sky.
On the shore of the lake, there's a small wooden cabin. Its brown roof and white walls stand out against the snowy landscape. The cabin is situated on the right side of the image, with a red boat docked next to it, adding a pop of color to the otherwise monochrome scene.
The sky above is a light blue, with a few clouds scattered across it. The entire scene is a harmonious blend of natural elements, each contributing to the peaceful ambiance of the winter landscape.
Prompt: Identify the bird in the image.
Image Source: NPS
Result: The bird in the image appears to be a ptarmigan. This species is known for its plumage which changes with the seasons. In the winter, ptarmigan birds have white feathers to blend in with the snowy environment, while in the summer, their feathers change to a more brown or gray color to blend in with the rocky and grassy terrain. The bird shown in the image has white feathers, suggesting it is in a winter camouflage mode. Ptarmigans are often found in arctic and alpine regions and are well-adapted to cold environments.
Prompt: Identify the animal in the image.
Result: The animal in the image is a mountain goat.
Prompt: What is the dominant color in this image?
Image Source: NPS
Result: The dominant color in this image is white, which is mostly due to the extensive snow coverage
Prompt: How many people in this image?
Image Source: Pixabay
Result: There are three people in the image.
Prompt: What colored shirt is the person in this image wearing?
Result: The person in the image is wearing a black shirt.
Prompt: Translate to English
Result: Sure, here is the translation of the text from German to English:
"The text reads:
'Dear Guest...
Youare welcome!'"
So, the whole translation is:
"Dear Guest...
You are welcome!"
Although the model excelled with the majority of the queries, it did fall short with the following prompt where we it took two sentences to describe an image instead of the one sentence we asked for.
Prompt: Describe this image in one sentence
Image Source: NPS
Result: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue.
Imagine another AI reality. Build it on Ori.
Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori helps AI startups & businesses to: