Tutorial

How to run Mistral Small 3 on a cloud GPU with vLLM

Discover how to easily deploy Mistral Small 3 on a cloud GPU with vLLM and our model analysis with verbal, math and coding prompts.

Feb 5, 2025

Tutorial

Mistral AI has established itself as a key player in the generative AI space by developing open, portable, and highly customizable models. A defining strength of their model portfolio is the ability to deliver exceptional performance with relatively compact models, such as the Pixtral 12B and the Ministral series. With the launch of Mistral’s Small 3 model, they have taken it a step further by delivering a compact yet performant model for a variety of language, coding and math tasks.

Mistral Small 3 supports multilingual inputs and output in 11 languages: English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish. Mistral is looking to target a range of business use cases with this model: fraud detection, virtual customer service, sentiment analysis, feedback systems, customer triaging, and more.

Here’s a quick rundown of Mistral Small 3 24B specifications:

Mistral Small 3 24B
Architecture	Designed for low latency, Tekken Tokenizer with a 131k vocabulary size
Variants	Pretrained Base model and Instruction tuned
Parameters	24 billion
Context Window	32k tokens
Licensing	Apache 2.0: Commercial and Non-commercial

Benchmark results shared by Mistral indicate that Small 3 exceeds GPT4o-mini and Gemma 2 27B, while being close to Llama 3.3 70B and Qwen 2.5 32B in terms of performance.

Mistral Small 3 Performance

Source: Mistral Blog

How to use Mistral Small 3 with vLLM on an Ori virtual machine

Prerequisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 SXM GPU because the vLLM server for Mistral Small 3 Instruct in BF16 format needs about 71 GB of memory, which includes model weights, Pytorch activation and KV cache. If you’re using smaller, quantized versions of the model, we recommend using the NVIDIA L40S GPU for its strong inference performance.

Quick tip

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

Step 1: SSH into your VM, install Python and create a virtual environment

apt install python3.11-venv python3.11 -m venv mistral-env

Activate the virtual environment

source mistral-env/bin/activate

Step 2: Install the Hugging Face CLI if you haven’t before and log in. Make sure to have your Hugging Face token ready and to have accepted model terms of conditions .

pip install -U "huggingface_hub[cli]" huggingface-cli login

Step 3: Install vLLM and update tokenizer

pip install vllm pip install --upgrade mistral_common

Step 4: Load the model and run the vLLM server. We chose the Mistral Small 24B Instruct variant for its better performance.

vllm serve "mistralai/Mistral-Small-24B-Instruct-2501"

The default port for vLLM is 8000, however you can use the –port option to change it.

Note: It'll take a while to load all the model shards for the first time.

Step 6: Query the model in another terminal window and use the JSON utility JQ to format model responses

sudo apt install jq curl -s "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mistral-Small-24B-Instruct-2501", "temperature": 0.15, "messages": [ {"role": "user", "content": "How many times does the letter r appear in the word strawberry?"} ] }' | jq -r '."choices"[0]."message"."content"'

NVIDIA H200 GPU

How good is Mistral Small 3?

We threw down a few verbal, math and coding reasoning challenges at Mistral Small 3 24B. The model did quite well with verbal intelligence, however at times it struggled with our coding and math questions.

Prompt: What is the best French cheese?

Mistral Verbal Analysis

Mistral Small 3 handled this subjective prompt quite well, it laid out some of the options and their strengths, and let the person asking the question decide based on these attributes.

Prompt: Translate this sentence to French, German and Italian Visit our website to learn more about end-to-end AI infrastructure. Go to www.ori.co.

Mistral Small 3 Translations

The model understood the question correctly and translated the statement with reasonable accuracy.

Prompt: How many ‘r’s in “strawberry” ?

Mistraal Small 3 Strawberry Test

The model got the answer to this question wrong because the word strawberry has 3 ‘r’s.

Prompt: How many ‘l’s in strawberry”?

Mistral Small 3 answered incorrectly to this prompt too.

Prompt: Write the game Snake in Python

Creating a simple version of the classic Snake game in Python can be a fun and educational project. We'll use the `tkinter` library for the graphical interface and `random` for generating food positions.

Below is a basic implementation of Snake using `tkinter`:

import tkinter as tk import random class SnakeGame: def __init__(self, master): self.master = master self.master.title("Snake Game") self.master.resizable(False, False) self.canvas = tk.Canvas(self.master, width=400, height=400, bg="black") self.canvas.pack() self.snake = [(100, 50), (90, 50), (80, 50)] self.food = self.spawn_food() self.direction = "Right" self.canvas.bind_all("<KeyPress>", self.change_direction) self.game_loop() def change_direction(self, event): if event.keysym == "Up" and self.direction != "Down": self.direction = "Up" elif event.keysym == "Down" and self.direction != "Up": self.direction = "Down" elif event.keysym == "Left" and self.direction != "Right": self.direction = "Left" elif event.keysym == "Right" and self.direction != "Left": self.direction = "Right" def game_loop(self): head = self.snake[0] if self.direction == "Up": new_head = (head[0], head[1] - 10) elif self.direction == "Down": new_head = (head[0], head[1] + 10) elif self.direction == "Left": new_head = (head[0] - 10, head[1]) elif self.direction == "Right": new_head = (head[0] + 10, head[1]) self.snake.insert(0, new_head) # Check for collisions if ( new_head in self.snake[len(self.snake) - 1:len(self.snake)] # Self collision or new_head[0] < 0 or new_head[0] >= 400 or new_head[1] < 0 or new_head[1] >= 400 ): self.game_over() return if new_head == self.food: self.food = self.spawn_food() else: self.snake.pop() self.canvas.delete("all") self.draw_snake() self.canvas.create_oval(self.food[0], self.food[1], self.food[0] + 10, self.food[1] + 10, fill="red") self.master.after(100, self.game_loop) def draw_snake(self): for segment in self.snake: self.canvas.create_rectangle( segment[0], segment[1], segment[0] + 10, segment[1] + 10, fill="white" ) def spawn_food(self): while True: food = (random.randint(0, 39) * 10, random.randint(0, 39) * 10) if food not in self.snake: return food def game_over(self): self.canvas.create_text(200, 200, text="Game Over", font=("Helvetica", 24), fill="white") if __name__ == "__main__": root = tk.Tk() game = SnakeGame(root) root.mainloop()

The code ran successfully and generated a basic game of Snake quite well. Here’s a snapshot of the game from the code generated by Mistral Small 3.

Prompt: What is Agentic AI? Explain like I’m five.

Prompt: Write the Tetris game in Python.

The code generated by Mistral failed to run at first pass.

Prompt: Give me a concise summary of the Collatz_conjecture.

Mistral Verbal Summary

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections.

Problem source: UC Davis

Mistral Small 3 Math Performance

Mistral Small 3 Math

The answer provided by Mistral Small 3 is incorrect. The correct answer is 3 (or 3 square units).

Prompt: Quelles sont les sept merveilles du monde?

Overall, our perception of Mistral Small 3 was that it excelled at verbal analysis and translation, but delivered mixed results for coding and math prompts.

Chart your own AI reality with Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:

GPU instances, on-demand virtual machines backed by top-tier GPUs to run AI workloads.
Inference Endpoints to run and scale your favorite open source models with just one click.
GPU Clusters to train and serve your most ambitious AI models.
Serverless Kubernetes helps you run inference at scale without having to manage infrastructure.
Private Cloud provides a single-tenant environment, top-notch performance and secure infrastructure for enterprise AI builds.

Train, Scale and Serve AI on Ori

Subscribe for more news and insights

Similar posts

Tutorial

How to run Llama 3.3 70B on a cloud GPU

Learn how to deploy Meta’s new text-generation model Llama 3.3 70B with Ollama and Open WebUI on an Ori cloud GPU.

Deepak Manoor Dec 10, 2024

Tutorial

How to run DeepSeek R1 on a cloud GPU with Ollama

Learn how to easily deploy DeepSeek R1 Distill 70B on an H100 GPU with Ollama and OpenWebUI, plus our thoughts about the model and its innovative...

Deepak Manoor Jan 28, 2025

Tutorial

How to run Qwen 3 235B on a cloud GPU

Discover how to deploy Qwen 3 235B model with Ollama and OpenWebUI on a cloud GPU and check out our model analysis.

Deepak Manoor May 6, 2025