Meta’s Llama foundation models have accelerated AI innovation, empowering countless developers and startups with unprecedented access. With over a billion downloads, the Llama series has emerged as the most widely adopted open-source AI model ecosystem. In this article, we’ll demonstrate how to deploy Meta’s new generation of foundation models, Llama 4, on the Ori AI cloud, and provide a comparison with its predecessor, the multimodal Llama 3.2.
Here’s a brief overview of Llama 4’s key specifications:
|
Llama 4 Scout and Maverick
|
Architecture
|
Mixture-of-Experts (MoE), Natively multimodal (Early fusion integrates text and visual tokens for pre-training)
|
Model Variants
|
Scout (16 experts with Instruction-tuned and Base versions)
Maverick (128 experts with Instruction-tuned and Base versions)
|
Parameters
|
Scout: 17B active parameters out of ~109B total,
Maverick: 17B active parameters out of ~400B total
|
Capabilities
|
Instruction-tuned: Optimized for code generation, visual reasoning, document summarization, multimodal assistant tasks
|
Sequence length
|
Scout (10 million for 16E Instruct, 256k for 16E)
Maverick (1 million for 128E Instruct, 256k for 128E)
|
Licensing
|
|
Meta's Llama 4 models demonstrate notable performance across various AI benchmarks compared to leading models like OpenAI's GPT-4o and Google's Gemini 2.0 Flash.
Llama 4 utilizes a Mixture-of-Experts (MoE) architecture to enhance efficiency and scalability. Unlike dense models like Llama 3, where all parameters are activated for each token, Llama 4 activates only a subset of specialized "experts" per token. For instance, Llama 4 Scout comprises 16 experts with a total of 109 billion parameters but activates only 17 billion parameters per token during inference, reducing computational costs significantly while maintaining high performance. This design allows Llama 4 to achieve comparable or superior results to larger dense models, such as OpenAI's GPT-4o, but with lower inference costs and improved scalability.
Llama 4 Scout features a 10-million-token context window, significantly surpassing GPT 4o's 128K tokens and Gemini 2.0 Flash's 1 million tokens. This extensive context capability enables Scout to handle long-context tasks effectively.
Llama 4 Maverick, with its 17 billion active parameters, excels in multimodal reasoning and coding tasks. These results highlight Llama 4's advancements in handling complex tasks and extended contexts, positioning it as a strong contender in the AI model landscape.

.png?width=800&height=160&name=Discord%20Banner%20(2).png)
How to run Llama 4 Scout with Hugging Face Transformers on an Ori virtual machine
Pre-requisites
Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the 4x NVIDIA H100 SXMs with 80 GB VRAM and 90GiB of system memory for this demo. However, it is recommended to use 8x H100s to achieve the full context window. We’ve chosen Ubuntu 22.04 as our OS, however Debian is also an option. For this demo we’ll be loading the Llama 4 Scout Instruct model.
apt install python3.10-venv
python3.10 -m venv llama-env
Activate the virtual environment
source llama-env/bin/activate
Step 2: Install Pytorch if you didn’t use the corresponding init script
pip3 install torch torchvision torchaudio
Step 3: Install Hugging face CLI and log in
pip install -U "huggingface_hub[cli]"
huggingface-cli login
Step 4: Install Transformers and other dependencies
pip install transformers==4.51.0
pip install accelerate
pip install hf_xet
pip install auto-gptq bitsandbytes
Step 5: Spin up a Jupyter server and open it on your browser with your VM’s IP address
pip install notebook
jupyter notebook --allow-root --no-browser --ip=0.0.0.0
Step 6: Create a notebook and load the model with this script
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="sdpa",
device_map="auto",
torch_dtype=torch.bfloat16,
)
Note: We ran into errors using Flex Attention, as an alternative we chose SDPA which ran quite smoothly. This will take a while since the weights are getting loaded for the first time

Here’s a snapshot of our memory usage

Let’s analyze two images and understand their similarities and differences, one is an image of a beach the other is that of a mountain. This code snippet is based on the example provided by Hugging Face here.
url1 = "https://cdn.pixabay.com/photo/2019/03/02/18/43/beach-4030372_1280.jpg"
url2 = "https://cdn.pixabay.com/photo/2021/06/28/04/46/mountain-6370590_1280.jpg"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
Here’s the response from Llama 4 Scout:

This example can be scaled to compare multiple images for visual-intensive workloads.
Trying out Llama 4’s Industry-leading Context Length with vLLM
One of Llama 4’s standout features is its super long context window (10 million for Scout). Since we only used 4 GPUs we couldn’t try out the full context length and managed to do only about 200,000 tokens which is still higher than other models.
Step 1: Install vLLM
pip install -U vllm
pip install flashinfer
import os
from vllm import LLM, SamplingParams
#Read in our example file
def read_file_to_string(file_path):
try:
with open(file_path, "r") as file:
content = file.read()
return content
except FileNotFoundError:
print(f"File {file_path} not found.")
return "File_Path_Error"
#Please remember to set `attn_temperature_tuning` to `True` for best long context performance
def load_llm():
llm = LLM(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
enforce_eager=False,
tensor_parallel_size=4,
max_model_len=200000,
override_generation_config= {
"attn_temperature_tuning": True,
}
)
return llm
llm = load_llm()
file_content = read_file_to_string("/root/book.txt")
PROMPT = f"""Write a couple of paragraphs about Anne's house and her environs\n\n\n{file_content} """
print("Showing long content")
if len(file_content) > 100:
print(file_content[:100])
else:
print(file_content)
conversations = [
[
{
"role": "user",
"content": PROMPT
}
],
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=1, top_p=0.95, max_tokens=4000)
# Remember to use `chat` function and not `generate` :)
outputs = llm.chat(conversations, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f" Generated text: {generated_text}")
Prompt: Write a couple of paragraphs about Anne's house and her environs
Llama 4 Response:
How good is Llama 4 Scout and how does it compare with Llama 3.2 11B
We ran a series of visual language (VL) prompts to put Llama 4 through the paces.

Prompt: How many types of berries are there in this image?
Llama 4 Response: Llama responded incorrectly as 4 types of berries.

Llama 3.2 Response: There are 3 types of berries in this image: raspberries, blueberries, and red currants.

Prompt: what is the mountain in this image?
Llama 4 Response: The model gave us a response where it broadly identified the geography, without naming the exact mountain peak

Llama 3.2 Response: The mountain in this image is Kudremukh, a mountain range located in the Western Ghats of Karnataka, India. We see better accuracy from Llama 3.2, since this image depicts the Kudremukh peak.
Prompt: What is the maximum speed at this intersection?
Llama 4 Response: The speed limit at this intersection is 20. This information is clearly indicated on the sign to the right of the railroad crossing sign, which states "SPEED LIMIT 20".<|eot|>
Llama 3.2 Response: The maximum speed at this intersection is 20 miles per hour.
Although no units are specified in the image, Llama 3.2 has the more accurate response because the image is from Alaska, USA where mph is used.
Prompt: What is the maximum speed on this road?
Llama 3.2 Response: The maximum speed on this road is 100 kilometers per hour.
Llama 4 Response: The maximum speed limit on this road is 100 kilometers per hour. This information is clearly indicated by the speed limit sign on the right side of the road, which features a red circle with a white interior and the number "100" in black text.

Prompt: Suggest a way to reduce the total cost of the project by 20%
Llama 4 Response:
**Step 1: Calculate the target cost and the current total cost**
The current total cost of the project is $72.31.
To reduce the total cost by 20%, we need to calculate the target cost:
Target cost = Current total cost - (20% of current total cost)
= $72.31 - (0.20 * $72.31)
= $72.31 - $14.462
= $57.85
**Step 2: Analyze the cost of each item**
| Description | Cost (Auto calculated) |
| :---------- | :--------------------- |
| 3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack)(Base of the brdige) | $23.76 |
| 3 ft. x 1/4 in. Basswood Square Dowel (for sides and other small uses) | $5.52 |
| 3/8 in. x 3/8 in. Poplar Square Dowel (for sides and other small uses) | $5.88 |
| Syringe100ml and 120cm handy plastic tubing (hydraulic system) | $17.70 |
| Syringe 100CC/ML (hydraulic system) | $15.98 |
| 8 oz. Carpenter's Wood Glue | $3.47 |
**Step 3: Identify the most cost-effective way to reduce the total cost**
To reduce the total cost by 20%, we can consider the following options:
* Negotiate a discount on the most expensive items: The most expensive item is the 3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack) with a cost of $23.76. Negotiating a 20% discount on this item would save $4.75.
* Find cheaper alternatives: We can look for cheaper alternatives for the items. For example, we can consider using a different type of wood for the dowels or find a cheaper supplier for the syringes and plastic tubing.
* Reduce the quantity of items: We can review the quantity of each item and see if we can reduce it without affecting the project. For example, we can consider using one syringe and plastic tubing set instead of two.
**Step 4: Suggest a way to reduce the total cost by 20%**
One possible way to reduce the total cost by 20% is to find cheaper alternatives for the syringes and plastic tubing. We can consider using a different supplier or a similar product with a lower cost.
For example, if we can reduce the cost of the Syringe100ml and 120cm handy plastic tubing (hydraulic system) from $17.70 to $10.00 and the cost of the Syringe 100CC/ML (hydraulic system) from $15.98 to $8.00, the total cost would be:
$23.76 + $5.52 + $5.88 + $10.00 + $8.00 + $3.47 = $56.63
This represents a reduction of $15.68 (from $72.31 to $56.63), which is approximately 21.7% of the original cost.
Therefore, finding cheaper alternatives for the syringes and plastic tubing could be a way to reduce the total cost of the project by 20%.<|eot|>
Llama 3.2 Response:
To reduce the total cost of the project by 20%, we need to calculate the total cost and then find ways to lower it. The total cost is $72.31.
**Step 1: Calculate the 20% reduction amount**
$72.31 * 0.20 = $14.46
**Step 2: Find ways to reduce the cost**
The cost of the 8 oz. Carpenter's Wood Glue is $3.47. Reducing this item to $2.78 would save $0.69, which is more than the reduction amount needed.
**Step 3: Calculate the new total cost**
$72.31 - $0.69 = $71.62
By reducing the cost of the 8 oz. Carpenter's Wood Glue to $2.78, the total cost of the project can be reduced by 20%.
Llama 3.2 was incorrect in its calculations because it reduced the price of a single component instead of taking the whole project account. In this scenario, Llama 4 was better at combining math and visual analysis, showing an area of strength when compared to 3.2.
Prompt: Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, but it must bounce off the rotating walls realistically.
The code generated by Llama 4 generated a zero division error.
Prompt: How many people in this image?
Llama 4 Response: There are three people in this image.
The first person is a woman with long brown hair, wearing a black blazer and maroon pants, sitting on a couch and working on a laptop. The second person is another woman with long dark hair, wearing a pink shirt and black pants, also sitting on a couch and working on a laptop. The third person is a man with short brown hair, wearing a blue suit jacket, standing by a counter and drinking from a cup.<|eot|>
Llama 3.2 Response: There are three people in this image. One person is standing and two are sitting.
In this scenario, we see that Llama 4 is unnecessarily verbose.
Prompt: Translate to English
Both Llama 4 and Llama 3.2 translated the German text correctly as “Dear Guest, Welcome!”
Overall, Llama 4 Scout delivered mixed performance in our model tests, it does well at math but performs worse when compared to Llama 3.2 in certain image analysis scenarios. However, Llama 4’s massive context window opens up new avenues in terms of applying multimodal and vision models. We’re looking forward to trying out Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is currently being trained.
Let your AI world span with Ori
Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways: