Tutorial

How to run Llama 4 on a cloud GPU

Learn how to deploy Meta’s multimodal Llama 4 models with Hugging Face Transformers and vLLM on an Ori cloud GPU, and check our comparison of Llama 4 vs Llama 3.2.

Meta’s Llama foundation models have accelerated AI innovation, empowering countless developers and startups with unprecedented access. With over a billion downloads, the Llama series has emerged as the most widely adopted open-source AI model ecosystem. In this article, we’ll demonstrate how to deploy Meta’s new generation of foundation models, Llama 4, on the Ori AI cloud, and provide a comparison with its predecessor, the multimodal Llama 3.2.
 
Here’s a brief overview of Llama 4’s key specifications:
 
 
Llama 4 Scout and Maverick
Architecture
Mixture-of-Experts (MoE), Natively multimodal (Early fusion integrates text and visual tokens for pre-training)
Model Variants
Scout (16 experts with Instruction-tuned and Base versions)
Maverick (128 experts with Instruction-tuned and Base versions)
Parameters
Scout: 17B active parameters out of ~109B total,
Maverick: 17B active parameters out of ~400B total
Capabilities
Instruction-tuned: Optimized for code generation, visual reasoning, document summarization, multimodal assistant tasks
Sequence length
Scout (10 million for 16E Instruct, 256k for 16E)
Maverick (1 million for 128E Instruct, 256k for 128E)
Licensing
Llama 4 Community: Commercial and research
​Meta's Llama 4 models demonstrate notable performance across various AI benchmarks compared to leading models like OpenAI's GPT-4o and Google's Gemini 2.0 Flash.​
 
Llama 4 utilizes a Mixture-of-Experts (MoE) architecture to enhance efficiency and scalability. Unlike dense models like Llama 3, where all parameters are activated for each token, Llama 4 activates only a subset of specialized "experts" per token. For instance, Llama 4 Scout comprises 16 experts with a total of 109 billion parameters but activates only 17 billion parameters per token during inference, reducing computational costs significantly while maintaining high performance. This design allows Llama 4 to achieve comparable or superior results to larger dense models, such as OpenAI's GPT-4o, but with lower inference costs and improved scalability. ​
 
Llama 4 Scout features a 10-million-token context window, significantly surpassing GPT 4o's 128K tokens and Gemini 2.0 Flash's 1 million tokens. This extensive context capability enables Scout to handle long-context tasks effectively.
 
Llama 4 Maverick, with its 17 billion active parameters, excels in multimodal reasoning and coding tasks. These results highlight Llama 4's advancements in handling complex tasks and extended contexts, positioning it as a strong contender in the AI model landscape.
 

Meta Llama 4 benchmark performance numbers

Source: Meta
 

Ori Global Cloud Discord Server

How to run Llama 4 Scout with Hugging Face Transformers on an Ori virtual machine

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the 4x  NVIDIA H100 SXMs with 80 GB VRAM and 90GiB of system memory for this demo. However, it is recommended to use 8x H100s to achieve the full context window. We’ve chosen Ubuntu 22.04 as our OS, however Debian is also an option. For this demo we’ll be loading the Llama 4 Scout Instruct model.
 
  

Quick Tip
Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow are preinstalled for you.

 

 
Step 1: SSH into your VM, install Python and create a virtual environment
 
 
apt install python3.10-venv python3.10 -m venv llama-env
  
Activate the virtual environment
source llama-env/bin/activate
 
Step 2: Install Pytorch if you didn’t use the corresponding init script
 
pip3 install torch torchvision torchaudio
 
Step 3:  Install Hugging face CLI  and log in
 
pip install -U "huggingface_hub[cli]" huggingface-cli login
 
Step 4: Install Transformers and other dependencies
 
pip install transformers==4.51.0 pip install accelerate pip install hf_xet pip install auto-gptq bitsandbytes
  

 

 

Quick Tip

We ran into issues using the 4.51.1 version of Transformers, which we did not observe in 4.51.0.

 

 

Step 5: Spin up a Jupyter server and open it on your browser with your VM’s IP address
  
pip install notebook jupyter notebook --allow-root --no-browser --ip=0.0.0.0
 
Step 6: Create a notebook and load the model with this script

from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" processor = AutoProcessor.from_pretrained(model_id) model = Llama4ForConditionalGeneration.from_pretrained( model_id, attn_implementation="sdpa", device_map="auto", torch_dtype=torch.bfloat16, )
 
Note: We ran into errors using Flex Attention, as an alternative we chose SDPA which ran quite smoothly. This will take a while since the weights are getting loaded for the first time
 

Llama 4 Scout Model Loading Python Code

Here’s a snapshot of our memory usage

Llama 4 System Specifications

Let’s analyze two images and understand their similarities and differences, one is an image of a beach the other is that of a mountain. This code snippet is based on the example provided by Hugging Face here

 

url1 = "https://cdn.pixabay.com/photo/2019/03/02/18/43/beach-4030372_1280.jpg" url2 = "https://cdn.pixabay.com/photo/2021/06/28/04/46/mountain-6370590_1280.jpg" messages = [ { "role": "user", "content": [ {"type": "image", "url": url1}, {"type": "image", "url": url2}, {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"}, ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate( **inputs, max_new_tokens=1024, ) response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0] print(response)
 
Here’s the response from Llama 4 Scout:
 

Llama 4 Prompt Response

This example can be scaled to compare multiple images for visual-intensive workloads.
 

Trying out Llama 4’s Industry-leading Context Length with vLLM

One of Llama 4’s standout features is its super long context window (10 million for Scout). Since we only used 4 GPUs we couldn’t try out the full context length and managed to do only about 200,000 tokens which is still higher than other models.
 
Step 1: Install vLLM
 
pip install -U vllm pip install flashinfer
 
Step 2: Run this script (based on a Llama cookbook) in a notebook or from the terminal
 
import os from vllm import LLM, SamplingParams #Read in our example file def read_file_to_string(file_path): try: with open(file_path, "r") as file: content = file.read() return content except FileNotFoundError: print(f"File {file_path} not found.") return "File_Path_Error" #Please remember to set `attn_temperature_tuning` to `True` for best long context performance def load_llm(): llm = LLM( model="meta-llama/Llama-4-Scout-17B-16E-Instruct", enforce_eager=False, tensor_parallel_size=4, max_model_len=200000, override_generation_config= { "attn_temperature_tuning": True, } ) return llm llm = load_llm() file_content = read_file_to_string("/root/book.txt") PROMPT = f"""Write a couple of paragraphs about Anne's house and her environs\n\n\n{file_content} """ print("Showing long content") if len(file_content) > 100: print(file_content[:100]) else: print(file_content) conversations = [ [ { "role": "user", "content": PROMPT } ], ] # Create a sampling params object. sampling_params = SamplingParams(temperature=1, top_p=0.95, max_tokens=4000) # Remember to use `chat` function and not `generate` :) outputs = llm.chat(conversations, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f" Generated text: {generated_text}")
 
We tested out the needle in the haystack approach with a classic novel, Anne of Green Gables (word count of 114,000
 
Prompt: Write a couple of paragraphs about Anne's house and her environs
Llama 4 Response:
 
Llama 4 Context Window Test and Sequence Length


How good is Llama 4 Scout and how does it compare with Llama 3.2 11B

We ran a series of visual language (VL)  prompts to put Llama 4 through the paces.
 

Llama 4 Vision Test

 
Prompt: How many types of berries are there in this image?
Llama 4 Response: Llama responded incorrectly as 4 types of berries. 
 

Llama 4 Scout

Llama 3.2 Response: There are 3 types of berries in this image: raspberries, blueberries, and red currants. 
In this test, Llama 3.2 11B was more accurate than Llama 4.


Llama 4 Software

Prompt: what is the mountain in this image?
Llama 4 Response: The model gave us a response where it broadly identified the geography, without naming the exact mountain peak
 

Meta Llama 4

Llama 3.2 Response: The mountain in this image is Kudremukh, a mountain range located in the Western Ghats of Karnataka, India. We see better accuracy from Llama 3.2, since this image depicts the Kudremukh peak.
 
 
Llama 4 Prompts
Prompt: What is the maximum speed at this intersection?
Llama 4 Response: The speed limit at this intersection is 20. This information is clearly indicated on the sign to the right of the railroad crossing sign, which states "SPEED LIMIT 20".<|eot|>
Llama 3.2 Response: The maximum speed at this intersection is 20 miles per hour.
Although no units are specified in the image, Llama 3.2 has the more accurate response because the image is from Alaska, USA where mph is used.
 
 
Meta Llama AI
Prompt: What is the maximum speed on this road?
Llama 3.2 Response: The maximum speed on this road is 100 kilometers per hour.
Llama 4 Response: The maximum speed limit on this road is 100 kilometers per hour. This information is clearly indicated by the speed limit sign on the right side of the road, which features a red circle with a white interior and the number "100" in black text.
 
 

llama 4 meta

Prompt: Suggest a way to reduce the total cost of the project by 20%
Llama 4 Response: 
**Step 1: Calculate the target cost and the current total cost**
The current total cost of the project is $72.31.
To reduce the total cost by 20%, we need to calculate the target cost:
Target cost = Current total cost - (20% of current total cost)
= $72.31 - (0.20 * $72.31)
= $72.31 - $14.462
= $57.85
**Step 2: Analyze the cost of each item**
| Description | Cost (Auto calculated) |
| :---------- | :--------------------- |
| 3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack)(Base of the brdige) | $23.76                 |
| 3 ft. x 1/4 in. Basswood Square Dowel (for sides and other small uses) | $5.52                  |
| 3/8 in. x 3/8 in. Poplar Square Dowel (for sides and other small uses) | $5.88                  |
| Syringe100ml and 120cm handy plastic tubing (hydraulic system) | $17.70                 |
| Syringe 100CC/ML (hydraulic system) | $15.98                 |
| 8 oz. Carpenter's Wood Glue | $3.47                  |
**Step 3: Identify the most cost-effective way to reduce the total cost**
To reduce the total cost by 20%, we can consider the following options:
*   Negotiate a discount on the most expensive items: The most expensive item is the 3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack) with a cost of $23.76. Negotiating a 20% discount on this item would save $4.75.
*   Find cheaper alternatives: We can look for cheaper alternatives for the items. For example, we can consider using a different type of wood for the dowels or find a cheaper supplier for the syringes and plastic tubing.
*   Reduce the quantity of items: We can review the quantity of each item and see if we can reduce it without affecting the project. For example, we can consider using one syringe and plastic tubing set instead of two.
**Step 4: Suggest a way to reduce the total cost by 20%**
One possible way to reduce the total cost by 20% is to find cheaper alternatives for the syringes and plastic tubing. We can consider using a different supplier or a similar product with a lower cost.
For example, if we can reduce the cost of the Syringe100ml and 120cm handy plastic tubing (hydraulic system) from $17.70 to $10.00 and the cost of the Syringe 100CC/ML (hydraulic system) from $15.98 to $8.00, the total cost would be:
$23.76 + $5.52 + $5.88 + $10.00 + $8.00 + $3.47 = $56.63
This represents a reduction of $15.68 (from $72.31 to $56.63), which is approximately 21.7% of the original cost.
Therefore, finding cheaper alternatives for the syringes and plastic tubing could be a way to reduce the total cost of the project by 20%.<|eot|>
 
Llama 3.2 Response:
To reduce the total cost of the project by 20%, we need to calculate the total cost and then find ways to lower it. The total cost is $72.31.
 
**Step 1: Calculate the 20% reduction amount**
$72.31 * 0.20 = $14.46
**Step 2: Find ways to reduce the cost**
The cost of the 8 oz. Carpenter's Wood Glue is $3.47. Reducing this item to $2.78 would save $0.69, which is more than the reduction amount needed.
**Step 3: Calculate the new total cost**
$72.31 - $0.69 = $71.62
 
By reducing the cost of the 8 oz. Carpenter's Wood Glue to $2.78, the total cost of the project can be reduced by 20%.
 
Llama 3.2 was incorrect in its calculations because it reduced the price of a single component instead of taking the whole project account. In this scenario, Llama 4 was better at combining math and visual analysis, showing an area of strength when compared to 3.2.
 
Prompt: Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, but it must bounce off the rotating walls realistically.
The code generated by Llama 4 generated a zero division error.
 
llama 4 coding
 
 
llama 4 multimodal
Prompt: How many people in this image?
Llama 4 Response: There are three people in this image. 
 
The first person is a woman with long brown hair, wearing a black blazer and maroon pants, sitting on a couch and working on a laptop. The second person is another woman with long dark hair, wearing a pink shirt and black pants, also sitting on a couch and working on a laptop. The third person is a man with short brown hair, wearing a blue suit jacket, standing by a counter and drinking from a cup.<|eot|>
Llama 3.2 Response: There are three people in this image. One person is standing and two are sitting.
In this scenario, we see that Llama 4 is unnecessarily verbose.


llama 4 scout
Prompt: Translate to English
Both Llama 4 and Llama 3.2 translated the German text correctly as “Dear Guest, Welcome!”
 
Overall, Llama 4 Scout delivered mixed performance in our model tests, it does well at math but performs worse when compared to Llama 3.2 in certain image analysis scenarios. However, Llama 4’s massive context window opens up new avenues in terms of applying multimodal and vision models. We’re looking forward to trying out Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is currently being trained.
 

Let your AI world span with Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:
 

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.