Tutorial

How to run Magistral Small on a cloud GPU

Learn how to deploy Mistral’s open‑source Magistral Small model on a cloud GPU using Ollama with OpenWebUI and our analysis of the Magistral model.

Jun 12, 2025

Tutorial

Mistral AI has launched Magistral, its first series of reasoning models, available in two versions: Magistral Small (open-source) and Magistral Medium (enterprise-grade, access via API and Mistral’s Le Chat). These models are based on a transformer architecture fine-tuned through Mistral’s proprietary Reinforcement Learning from Verifiable Rewards (RLVR) framework, which replaces external critics with a generator–verifier setup. This approach yields transparent, step-by-step “chain‑of‑thought” reasoning at scale.

Here’s a brief overview of Magistral Small’s specifications:

Magistral Small
Architecture	Reinforcement Learning from Verifiable Rewards (RLVR)Group with Relative Policy Optimization (GRPO) as the RL algorithm
Parameters	24 billion
Context window	128k tokens maximum, 40.9k tokens recommended
Licensing	Apache 2.0: Commercial and research

Magistral Small’s benchmarks demonstrate strong overall performance, exceeding Llama 4 but trailing DeepSeek R1 and Qwen 3 series of models.

	AIME24	AIME25	GPQA Diamond	Livecodebench
Magistral Small	70.68	62.76	68.18	55.84
Qwen 3 32B (Dense)	81.4	72.9	N/A	65.7
Qwen 3 30B A3B (MoE)	80.4	70.9	65.8	62.6
DeepSeek R1	79.8	70	71.5	65.9
DeepSeek V3	39.2	28.8	59.1	36.2
Llama 4 Maverick	N/A	N/A	69.8	43.4
Llama 4 Scout	N/A	N/A	57.2	32.8

Source: Llama 4, Qwen 3, Magistral & DeepSeek

How to run Qwen 3 with Ollama

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose a set up with an NVIDIA L40S GPU and Ubuntu 22.04 as our OS, since we ran the Q8_0 quantized version, however you might need to use an H100 GPU if you choose the FP16 version of the model.

Quick tip

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

Step 1: SSH into your VM, install Python and create a virtual environment

apt install python3.11-venv python3.11 -m venv mistral-env

Step 2: Activate the virtual environment

source mistral-env/bin/activate

Step 3: Install Ollama and specify the number of GPUs to be used

curl -fsSL https://ollama.com/install.sh | sh

Step 4: Run Magistral 24B Small (Quantized Q8_0)

ollama run magistral:24b-small-2506-q8_0 set verbose

Step 5: Install OpenWebui on the VM via another terminal window and run it

pip install open-webui open-webui serve

Step 6: Access OpenWebUI on your browser through the default 8080 port.

http://”VM-IP”:8080/

Click on “Get Started” to create an Open WebUI account, if you haven’t installed it on the virtual machine before.

Magistral Openwebui

Step 7: Choose magistral:24b-small-2506-q8_0 from the Models drop down and chat away!

Is Magistral Small better than Mistral Small 3?

We tried out the Mistral Small 3 model a few months ago. So, we tested Magistral with the prompts on which Small 3 didn’t do too well

Prompt: How many ‘r’s in “strawberry” ?

Mistral Small 3: The word "strawberry" contains 2 letter “r”s

Magistral Small: 3

Magistral Performance

Prompt: How many ‘l’s in “strawberry” ?

Mistral Small 3: The word "strawberry" contains 2 letter “l”s

Magistral Small: 0

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Mistral Small 3: 7

Magistral Small: 3

The correct answer is 3 (or 3 square units).

Magistral Math

Overall, Magistral Small shows a significant leap over Mistral Small 3 in terms of performance. The benefits of a reasoning model are quite evident here with the enhanced accuracy in models, indicating that reasoning models are the way forward for stronger performance.

Our take on Magistral Small

Speed

Magistral is comparable with frontier open source models such as the Qwen 3 in terms of speed with more than 26 tokens per second.

Both models answered the question below correctly but Magistral took only 1 minute and 0.4 seconds whereas Qwen 3 took 1minute and 38 seconds.

Prompt: What is larger: 134.59 or 134.6?

Magistral:

Magistral Speed

Qwen 3:

Accuracy

In our observation, Magistral Small is nearly as good as Qwen 3 with some exceptions.

Prompt: Exactly how many days ago did the French Revolution start? Today is June 11th, 2025.

Magistral got this question completely wrong with its response being 460 days. This response also took 17 minutes.

The Magistral Small failed to generate the perfect code for the Tetris game whereas Qwen 3 got it right in one shot.

Mistral Magistral

Both models failed to generate the code that could satisfy this prompt

Prompt: "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"

Magistral Response:

Flexibility

The absence of a non-reasoning mode in Magistral Small makes it less flexible when compared to Qwen 3. Magistral goes into very long reasoning loops for several minutes which makes it difficult for several use cases, especially when its responses to those prompts are incorrect.

Overall, Magistral is an impressive reasoning model from Mistral and a preview of stronger reasoning models that are set to emerge from leading AI labs. Although it is quite accurate and fast in terms of performance, the lack of a non-reasoning mode makes it less flexible especially for simple prompts.

Build your enterprise AI on Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables AI teams and businesses to deploy their AI models and applications in a variety of ways:

Deploy Private Clouds for flexible and secure enterprise AI.
Experiment and run ML research with GPU Instances.
Train foundation models on GPU Clusters
Customize models quickly and efficiently with Finetuning Studio
Deploy inference effortlessly with Serverless and Dedicated Endpoints for effortless inference.

Similar posts

Tutorial

How to run Qwen 3 235B on a cloud GPU

Discover how to deploy Qwen 3 235B model with Ollama and OpenWebUI on a cloud GPU and check out our model analysis.

Deepak Manoor May 6, 2025

Tutorial

How to run Mistral Small 3 on a cloud GPU with vLLM

Discover how to easily deploy Mistral Small 3 on a cloud GPU with vLLM and our model analysis with verbal, math and coding prompts.

Deepak Manoor Feb 5, 2025

Tutorial

How to run Genmo Mochi 1 video generation on a cloud GPU

Discover how to deploy Genmo Mochi 1 with ComfyUI on an Ori GPU instance, and read our analysis of this new open source video generation model.

Deepak Manoor Nov 12, 2024