Over the past few months, visual language (VL) models such as Mistral’s
Pixtral and
Qwen2 VL have added a new dimension to open source AI innovation by providing the ability to query images, understand them and combine this understanding with language capabilities.
We recently analyzed Pixtral 12B and found its image recognition, text extraction and multimodal processing capabilities impressive. In this article, we’ll be exploring how to run Meta’s new multimodal model, Llama 3.2 11B with Vision on the
Ori GPU cloud, and also see how it compares with Pixtral.
Here’s a quick rundown of Llama 3.2 11B Vision specifications:
|
Llama 3.2 Multimodal |
Architecture
|
Natively multimodal (trained on text-image pairs) adapter that combines pre-trained 3.2 vision model with pre-trained Llama 3.1 language model
|
Parameters
|
2 models with 90B and 11B parameters respectively |
Model Variants
|
Instruction tuned: For visual recognition, image reasoning, captioning, and assistant-like chat with images Pre-trained models: Adapted for a variety of image reasoning tasks. |
Sequence length
|
128k tokens
|
Licensing
|
|
Performance benchmark results shared by Meta portray the 11B model matching state-of-the-art (SOTA) performance and the 90B model outperforming models such as Claude 3-Haiku and GPT-4o-mini. Although smaller than some of the other recent Llama models such as the Llama 3.1 405b, these medium sized models pack a punch with their strong multimodal performance.
How to use Llama 3.2 with Hugging Face Transformers on an Ori virtual machine
Pre-requisites
Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA L40S with 48 GB VRAM and 90GiB of system memory for this demo because of its strong inference performance. We’ve chosen Ubuntu 22.04 as our OS, however Debian is also an option. For this demo we’ll be loading the Llama 3.2 11B Vision Instruct model.
Prerequisites
apt install python3.10-venv
python3.10 -m venv llama-env
Activate the virtual environment
source llama-env/bin/activate
Step 2: Install Pytorch if you didn’t use the corresponding init script
pip3 install torch torchvision torchaudio
Step 3: Install Hugging face CLI and log in
pip install -U "huggingface_hub[cli]"
huggingface-cli login
Step 4: Install Transformers package
pip install --upgrade transformers
Step 5: Open a python file with an editor of your choice
vim llama_demo.py
Step 6: Use the following Python code to run Hugging Face Transformers. For our demo, we’ll use the “Vision-Instruct” model as it is finetuned for vision related instructions. We’ve used the URL of the image for the model to access the raw stream of input image tokens. However, you can also use an image from a local directory as shown in the commented code.
import os
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg"
#local_path ="/root/bird.jpg"
image = Image.open(requests.get(url, stream=True).raw)
#image = Image.open(local_path)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=28000)
print(processor.decode(output[0]))
owever, you can also use an image from a local directory as shown in the example above.
Step 7: Run the python file and the model input will be shown on the terminal
python3.10 llama_demo.py
Alternative ways to run Llama 3.2 on the cloud
vLLM
Use vLLM to run Pixtral on a cloud GPU or on a local machine. Metal has shared a code example here.
Jupyter notebook
Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here
pip3 install notebook
jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0
The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.
How good is Llama 3.2 11B Vision Instruct?
We ran a series of visual language prompts to put Llama 3.2 through the paces. Here are some results of queries about open source images
Prompt: What is the price of the service?
Response: The price of the service is 14.00 euros.
Prompt: What is the product or service sold?
Response: The product or service sold is a ticket.
Prompt: Who is selling the product or service?
Response: The product or service is being sold by the personenschiffahrt, as indicated by the text on the ticket.
Prompt: When was this ticket sold?
Response: The ticket was sold on September 13, 2013.
Prompt: Based on the information in this image, how much do 10 tickets cost?
Response: To find the cost of 10 tickets, we need to multiply the cost of one ticket by 10.
**Cost of one ticket:** 14.00 euros
**Cost of 10 tickets:**
14.00 euros x 10 = 140 euros
Therefore, 10 tickets cost 140 euros.
Llama 3.2 did a stellar job of recognizing the text fields in the image and connecting the dots, it also showed us the stepwise calculation for the price of 10 tickets.
Prompt: What is the maximum speed at this intersection?
Response: The maximum speed at this intersection is 20 miles per hour.
Although no units are specified in the image, this is the correct response because it is in Alaska, USA where mph is used.
Prompt: What is the maximum speed on this road?
Response: The maximum speed on this road is 100 kilometers per hour.
Although no units are specified in the image, this is the correct response because it is in Chile which uses km/h.
Prompt: How much will this project cost in total?
Response: **Calculating the Total Cost of the Truss Bridge Project**
To determine the total cost of the Truss Bridge Project, we need to sum up the costs of all the materials listed in the table.
**Cost Breakdown:**
* **3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack)**: $23.76
* **3 ft. x 1/4 in. Basswood Square Dowel (for sides and other small uses)**: $5.52
* **3/8 in. x 3/8 in. Poplar Square Dowel (for sides and other small uses)**: $5.88
* **Syringe100ml and 120cm handy plastic tubing (hydraulic system)**: $17.70
* **Syringe 100CC/ML (hydraulic system)**: $15.98
* **8 oz. Carpenter's Wood Glue**: $3.47
**Total Cost Calculation:**
$23.76 + $5.52 + $5.88 + $17.70 + $15.98 + $3.47 = **$72.31**
The total cost of the Truss Bridge Project is **$72.31**.
Prompt: How much will this project cost if the price of glue increased by 50%?
Response: To calculate the new cost of the project with a 50% increase in glue price, we first need to find the original cost of the glue. The original cost of the glue is $3.47. To find the new cost, we multiply the original cost by 1.5 (1 + 0.5 = 1.5).
**New Cost of Glue = Original Cost x 1.5**
**New Cost of Glue = $3.47 x 1.5**
**New Cost of Glue = $5.21**
Now that we have the new cost of the glue, we can calculate the total cost of the project. The total cost of the project is the sum of the costs of all the materials and shipping. The total cost of the project is $72.31. To find the new total cost, we subtract the original cost of the glue from the total cost and then add the new cost of the glue.
**New Total Cost = Total Cost - Original Cost of Glue + New Cost of Glue**
**New Total Cost = $72.31 - $3.47 + $5.21**
**New Total Cost = $73.05**
The new total cost of the project with a 50% increase in glue price is $73.05.
Prompt: Suggest a way to reduce the total cost of the project by 20%
Response: To reduce the total cost of the project by 20%, we need to calculate the total cost and then find ways to lower it. The total cost is $72.31.
**Step 1: Calculate the 20% reduction amount**
$72.31 * 0.20 = $14.46
**Step 2: Find ways to reduce the cost**
The cost of the 8 oz. Carpenter's Wood Glue is $3.47. Reducing this item to $2.78 would save $0.69, which is more than the reduction amount needed.
**Step 3: Calculate the new total cost**
$72.31 - $0.69 = $71.62
By reducing the cost of the 8 oz. Carpenter's Wood Glue to $2.78, the total cost of the project can be reduced by 20%.
When we asked Llama 3.2 to find ways to reduce the project cost by 20%, we observed a hallucination. Instead of reducing the cost of the whole project, it picked one of the project components and reduced its price by 20%, which was not enough to bring down the cost as a whole. Although Llama 3.2 excelled in OCR and analyzing images, we felt that processing of the extracted information could be enhanced.
Prompt: what is the city in this image?
Response: The city in this image is London, England. The prominent building in the background is the Shard, a 72-story skyscraper located in Southwark, London.
Prompt: what is the mountain in this image?
Response: The mountain in this image is Kudremukh, a mountain range located in the Western Ghats of Karnataka, India.
Prompt: How many types of berries are there in this image?
Response: There are 3 types of berries in this image: raspberries, blueberries, and red currants.
Prompt: Provide a one-line caption for this image
Response: This image depicts a vast expanse of cracked salt-covered terrain, with a mountain range in the background, set against a sky transitioning from blue to orange hues.
Llama 3.2 did well in analyzing all the images above, however it tends to be verbose sometimes.
Llama 3.2 11B Vision Instruct vs Pixtral 12B
To compare Llama 3.2 vs Pixtral, we ran the same prompts that we used for our Pixtral demo blog post, and found that Llama 3.2 Vision Instruct was equally good. However, Llama 3.2 showed slightly better prompt adherence when asked to restrict the image description to a single line.
On the other hand, Llama 3.2 often provided more details than needed, but described the process of arriving at a response with steps which could be helpful in many scenarios. Overall, both models are quite similar in how well they perform, and here are a few examples to illustrate some of the differences between them.
See how Llama 3.2 compares with Pixtral
Prompt: How many people in this image?
Llama 3.2 Response: There are three people in this image. One person is standing and two are sitting.
Pixtral Response: There are three people in the image.
Prompt: Translate to English
Llama 3.2 Response: The image is a digital graphic with a flower and leaves, accompanied by German text. The purpose of the image appears to be a welcome message or greeting.
* A flower:
+ The flower is white with yellow center.
+ It has green leaves.
+ The flower is placed on the right side of the image.
* Leaves:
+ There are three leaves.
+ They are green and have a slight sheen to them.
+ They are placed behind the flower.
* Text:
+ The text is in white cursive font.
+ It reads "Lieber Gast... Herzlich Willkommen!" which translates to "Dear guest... Welcome!"
+ The text is placed on the left side of the image.
Pixtral Response: Sure, here is the translation of the text from German to English:
"The text reads:
'Dear Guest...
Youare welcome!'"
So, the whole translation is:
"Dear Guest...
You are welcome!"
Prompt: Describe this image in one sentence
Llama 3.2 Response: The image shows a mountainous landscape with a river running through it, set against a cloudy blue sky.
Pixtral Response: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue.
In the above example, Llama 3.2 was able to adhere to the prompt more closely and limited its response to a single sentence whereas Pixtral couldn't. However, both these models are very close in terms of performance and in general excel in multimodal tasks.
Chart your own AI reality with Ori
Ori Global Cloud provides flexible AI infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables AI startups and businesses to: