Tutorial

How to run Llama 3.2 11B Vision with Hugging Face Transformers on a cloud GPU

Learn how to deploy Meta’s multimodal Lllama 3.2 11B Vision model with Hugging Face Transformers on an Ori cloud GPU and see how it compares with Pixtral.

 
Meta’s Llama foundation models have spurred AI innovation by making AI accessible to numerous developers and startups. The Llama family of large language models (LLMs) is now the leading open source model system, having been downloaded nearly 350 million times.
 
Over the past few months, visual language (VL) models such as Mistral’s Pixtral and Qwen2 VL have added a new dimension to open source AI innovation by providing the ability to query images, understand them and combine this understanding with language capabilities.
 
We recently analyzed Pixtral 12B and found its image recognition, text extraction and multimodal processing capabilities impressive. In this article, we’ll be exploring how to run Meta’s new multimodal model, Llama 3.2 11B with Vision on the Ori GPU cloud, and also see how it compares with Pixtral.
 
Here’s a quick rundown of Llama 3.2 11B Vision specifications:
 
  Llama 3.2 Multimodal
Architecture
Natively multimodal (trained on text-image pairs) adapter that combines pre-trained 3.2 vision model with pre-trained Llama 3.1 language model
Parameters
2 models with 90B and 11B parameters respectively
Model Variants
Instruction tuned: For visual recognition, image reasoning, captioning, and assistant-like chat with images
Pre-trained models: Adapted for a variety of image reasoning tasks.
Sequence length
128k tokens
Licensing
Llama 3.2 Community: Commercial and research
Performance benchmark results shared by Meta portray the 11B model matching state-of-the-art (SOTA) performance and the 90B model outperforming models such as Claude 3-Haiku and GPT-4o-mini. Although smaller than some of the other recent Llama models such as the Llama 3.1 405b, these medium sized models pack a punch with their strong multimodal performance.
 
Meta Llama 3.2 performance benchmarks
  Source: Meta AI Blog
 
Llama 3.2 Vision shows potential applications across a multitude of use cases such as Visual Question Answering (VQA), Document VQA, image captioning, text retrieval, visual grounding to pinpoint objects and segments, and Optical Character Recognition (OCR).
 
Ori Global Cloud Discord Server
 

How to use Llama 3.2 with Hugging Face Transformers on an Ori virtual machine

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the  NVIDIA L40S with 48 GB VRAM and 90GiB of system memory for this demo because of its strong inference performance. We’ve chosen Ubuntu 22.04 as our OS, however Debian is also an option. For this demo we’ll be loading the Llama 3.2 11B Vision Instruct model.
 
 

 

  Prerequisites
Quick Tip
Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

 

 
 
 
Step 1: SSH into your VM, install Python and create a virtual environment
 
apt install python3.10-venv python3.10 -m venv llama-env
  
Activate the virtual environment
source llama-env/bin/activate
  
 
 
 
 
Quick Tip
Virtual environments help you install packages safely in an isolated environment without disturbing other projects.
 

 

 
 
Step 2: Install Pytorch if you didn’t use the corresponding init script
pip3 install torch torchvision torchaudio

Step 3:  Install Hugging face CLI  and log in

pip install -U "huggingface_hub[cli]" huggingface-cli login

Step 4: Install Transformers package

pip install --upgrade transformers
 

Step 5: Open a python file with an editor of your choice

vim llama_demo.py

Step 6: Use the following Python code to run Hugging Face Transformers. For our demo, we’ll use the “Vision-Instruct” model as it is finetuned for vision related instructions. We’ve used the URL of the image for the model to access the raw stream of input image tokens. However, you can also use an image from a local directory as shown in the commented code.

import os import requests import torch from PIL import Image from transformers import MllamaForConditionalGeneration, AutoProcessor model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" model = MllamaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained(model_id) url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg" #local_path ="/root/bird.jpg" image = Image.open(requests.get(url, stream=True).raw) #image = Image.open(local_path) messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."} ]} ] input_text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(image, input_text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=28000) print(processor.decode(output[0]))
 
 
 
owever, you can also use an image from a local directory as shown in the example above.
 
 
Quick Tip

This code snippet is based on the example provided by Meta here. However, we’ve made a couple of minor edits: changed the # of tokens to 28,000 and added support for local images.

 
 
 
 
 

 

Step 7: Run the python file and the model input will be shown on the terminal

python3.10 llama_demo.py
 

 Alternative ways to run Llama 3.2 on the cloud

vLLM

Use vLLM to run Pixtral on a cloud GPU or on a local machine. Metal has shared a code example here.
  

Jupyter notebook

 Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here
pip3 install notebook jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0
The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.
 

How good is Llama 3.2 11B Vision Instruct?

We ran a series of visual language prompts to put Llama 3.2 through the paces. Here are some results of queries about open source images
 
Llama 3.2 OCR
Image Source: Pixabay
 
Prompt: What is the price of the service?
Response: The price of the service is 14.00 euros.
 
Prompt: What is the product  or service sold?
Response: The product or service sold is a ticket.
 
Prompt: Who is selling the product or service?
Response: The product or service is being sold by the personenschiffahrt, as indicated by the text on the ticket.
 
Prompt: When was this ticket sold?
Response: The ticket was sold on September 13, 2013.
 
Prompt: Based on the information in this image, how much do 10 tickets cost?
Response: To find the cost of 10 tickets, we need to multiply the cost of one ticket by 10.
 
**Cost of one ticket:** 14.00 euros
 
**Cost of 10 tickets:**
14.00 euros x 10 = 140 euros
 
Therefore, 10 tickets cost 140 euros.
 
Llama 3.2 did a stellar job of recognizing the text fields in the image and connecting the dots, it also showed us the stepwise calculation for the price of 10 tickets.
 
Llama 3.2 Image analysis
Image Source: Pixabay
 
Prompt: What is the maximum speed at this intersection?
Response: The maximum speed at this intersection is 20 miles per hour.
 
Although no units are specified in the image, this is the correct response because it is in Alaska, USA where mph is used.
 
Llama 3.2 11B Vision analysis
Image Source: Pixabay
 
Prompt: What is the maximum speed on this road?
Response: The maximum speed on this road is 100 kilometers per hour.
 
Although no units are specified in the image, this is the correct response because it is in Chile which uses km/h.
 
Llama 3.2 Bill of Material analysis
 Image Source: Wikimedia
 
Prompt: How much will this project cost in total?
Response: **Calculating the Total Cost of the Truss Bridge Project**
 
To determine the total cost of the Truss Bridge Project, we need to sum up the costs of all the materials listed in the table.
 
**Cost Breakdown:**
 
*   **3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack)**: $23.76
*   **3 ft. x 1/4 in. Basswood Square Dowel (for sides and other small uses)**: $5.52
*   **3/8 in. x 3/8 in. Poplar Square Dowel (for sides and other small uses)**: $5.88
*   **Syringe100ml and 120cm handy plastic tubing (hydraulic system)**: $17.70
*   **Syringe 100CC/ML (hydraulic system)**: $15.98
*   **8 oz. Carpenter's Wood Glue**: $3.47
 
**Total Cost Calculation:**
 
$23.76 + $5.52 + $5.88 + $17.70 + $15.98 + $3.47 = **$72.31**
 
The total cost of the Truss Bridge Project is **$72.31**.
 
Prompt: How much will this project cost if the price of glue increased by 50%?
Response: To calculate the new cost of the project with a 50% increase in glue price, we first need to find the original cost of the glue. The original cost of the glue is $3.47. To find the new cost, we multiply the original cost by 1.5 (1 + 0.5 = 1.5).
 
**New Cost of Glue = Original Cost x 1.5**
**New Cost of Glue = $3.47 x 1.5**
**New Cost of Glue = $5.21**
 
Now that we have the new cost of the glue, we can calculate the total cost of the project. The total cost of the project is the sum of the costs of all the materials and shipping. The total cost of the project is $72.31. To find the new total cost, we subtract the original cost of the glue from the total cost and then add the new cost of the glue.
 
**New Total Cost = Total Cost - Original Cost of Glue + New Cost of Glue**
**New Total Cost = $72.31 - $3.47 + $5.21**
**New Total Cost = $73.05**
 
The new total cost of the project with a 50% increase in glue price is $73.05.

Prompt: Suggest a way to reduce the total cost of the project by 20%
Response: To reduce the total cost of the project by 20%, we need to calculate the total cost and then find ways to lower it. The total cost is $72.31.
 
**Step 1: Calculate the 20% reduction amount**
 
$72.31 * 0.20 = $14.46
 
**Step 2: Find ways to reduce the cost**
 
The cost of the 8 oz. Carpenter's Wood Glue is $3.47. Reducing this item to $2.78 would save $0.69, which is more than the reduction amount needed.
 
**Step 3: Calculate the new total cost**
 
$72.31 - $0.69 = $71.62
 
By reducing the cost of the 8 oz. Carpenter's Wood Glue to $2.78, the total cost of the project can be reduced by 20%.
 
When we asked Llama 3.2 to find ways to reduce the project cost by 20%, we observed a hallucination. Instead of reducing the cost of the whole project, it picked one of the project components and reduced its price by 20%, which was not enough to bring down the cost as a whole. Although Llama 3.2 excelled in OCR and analyzing images, we felt that processing of the extracted information could be enhanced.

 

Llama 3.2 11B Multimodal analysis

Image Source: Pixabay
 
Prompt: what is the city in this image?
Response: The city in this image is London, England. The prominent building in the background is the Shard, a 72-story skyscraper located in Southwark, London.
 
Using Llama 3.2 for multimodal analysis
 Image Source: Pixabay
 
Prompt: what is the mountain in this image?
Response: The mountain in this image is Kudremukh, a mountain range located in the Western Ghats of Karnataka, India.
 
Llama 3.2 medium model
Image Source: Pixabay
 
Prompt: How many types of berries are there in this image?
Response: There are 3 types of berries in this image: raspberries, blueberries, and red currants.
 
Llama 3.2 small models
Image Source: Pixabay
 
Prompt: Provide a one-line caption for this image
Response: This image depicts a vast expanse of cracked salt-covered terrain, with a mountain range in the background, set against a sky transitioning from blue to orange hues.
 

Llama 3.2 did well in analyzing all the images above, however it tends to be verbose sometimes.

Llama 3.2 11B Vision Instruct vs Pixtral 12B

To compare Llama 3.2 vs Pixtral, we ran the same prompts that we used for our Pixtral demo blog post, and found that Llama 3.2 Vision Instruct was equally good. However, Llama 3.2 showed slightly better prompt adherence when asked to restrict the image description to a single line. 

On the other hand, Llama 3.2 often provided more details than needed, but described the process of arriving at a response with steps which could be helpful in many scenarios. Overall, both models are quite similar in how well they perform, and here are a few examples to illustrate some of the differences between them.

See how Llama 3.2  compares with Pixtral

 
Llama 3.2 Transformers
 Image Source: Pixabay
 
Prompt: How many people in this image?
Llama 3.2 Response: There are three people in this image. One person is standing and two are sitting.
Pixtral Response: There are three people in the image.
 
Llama 3.1 11B
Image Source: Pixabay
 
Prompt: Translate to English
Llama 3.2 Response: The image is a digital graphic with a flower and leaves, accompanied by German text. The purpose of the image appears to be a welcome message or greeting.
 
* A flower:
        + The flower is white with yellow center.
        + It has green leaves.
        + The flower is placed on the right side of the image.
* Leaves:
        + There are three leaves.
        + They are green and have a slight sheen to them.
        + They are placed behind the flower.
* Text:
        + The text is in white cursive font.
        + It reads "Lieber Gast... Herzlich Willkommen!" which translates to "Dear guest... Welcome!"
        + The text is placed on the left side of the image.
 
Pixtral Response: Sure, here is the translation of the text from German to English:
 
"The text reads:
'Dear Guest...
Youare welcome!'"
 
So, the whole translation is:
"Dear Guest...
You are welcome!"
 
Llama 3.2 Vision
 Image Source: NPS
 
Prompt: Describe this image in one sentence
Llama 3.2 Response: The image shows a mountainous landscape with a river running through it, set against a cloudy blue sky.
Pixtral Response: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue.
 
In the above example, Llama 3.2 was able to adhere to the prompt more closely and limited its response to a single sentence whereas Pixtral couldn't. However, both these models are very close in terms of performance and in general excel in multimodal tasks. 
 

Chart your own AI reality with Ori

Ori Global Cloud provides flexible AI infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables AI startups and businesses to:

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.