Tutorial

How to run Qwen 3 235B on a cloud GPU

Discover how to deploy Qwen 3 235B model with Ollama and OpenWebUI on a cloud GPU and check out our model analysis.

May 6, 2025

Tutorial

Alibaba’s Qwen series of AI models has rapidly emerged as a strong open-source alternative to state-of-the-art (SOTA) models often rivaling and in some benchmarks exceeding their performance. The latest version of these models, Qwen 3 offers a versatile family of generative AI models that blend high performance with broad accessibility. These models are designed with hybrid reasoning capabilities, allowing them to efficiently handle simple tasks while dynamically shifting to tackle more complex problems. The Qwen 3 lineup includes both dense and Mixture-of-Experts (MoE) architectures, ranging from 0.6 billion to 235 billion parameters, all available under the permissive Apache 2.0 license.

Here’s a brief overview of Qwen 3’s key specifications:

Qwen 3
Architecture	Dense and Mixture-of-Experts (MoE) Transformers; Hybrid Reasoning Modes (Thinking & Non-Thinking)
Parameters	Dense: 0.6B, 1.7B, 4B, 8B, 14B, 32B; MoE: 30B (3B active), 235B (22B active)
Model Variants	Dense, MoE
Context length / Generation length	Dense (0.6B-4B): 32K tokens; Dense (8B-32B) & MoE: 128K tokens
Licensing	Apache 2.0: Commercial and research

Performance benchmarks from Artifical Analysis indicate that Qwen 3 235B A22B compares well with other top of the line models from Open AI, Google and DeepSeek.

Qwen 3 LLM Performance

Source: Artificial Analysis

How to run Qwen 3 with Ollama

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose a set up with 4x NVIDIA H100 SXM and Ubuntu 22.04 as our OS, however 2x H100’s are enough since Ollama needs about 143 GB of VRAM to run Qwen 3 235B.

Quick tip

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

Step 1: SSH into your VM, install Python and create a virtual environment

apt install python3.12-venv python3.12 -m venv qwen-env

Step 2: Activate the virtual environment

source qwen-env/bin/activate

Step 3: Install Ollama and specify the number of GPUs to be used

curl -fsSL https://ollama.com/install.sh | sh export OLLAMA_GPU_COUNT=4

Step 4: Run Qwen 3 235B with Ollama

ollama run qwen3:235b –verbose

Here’s what our setup looks like with Ollama running

Qwen 3 system VRAM memory requirements

Step 5: Install OpenWebui on the VM via another terminal window and run it

pip install open-webui open-webui serve

Step 6: Access OpenWebUI on your browser through the default 8080 port.

http://”VM-IP”:8080/

Click on “Get Started” to create an Open WebUI account, if you haven’t installed it on the virtual machine before.

Qwen 3 Openwebui

Step 7: Choose qwen3:235b from the Models drop down and chat away!

Comparing Thinking and Non-Thinking modes

Being a hybrid model, Qwen 3 235B A22B is able to switch between thinking and non-thinking modes. Append “/no_think” or “/think” tag in your prompts to choose the mode you want to use.

Here is a comparison of thinking and non thinking responses to our prompt

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Qwen 3 got the answer (3) right in both modes. However, the thinking mode took far too long (4m 16s vs 15s) with the model second-guessing itself continuously.

Thinking Mode

Qwen 3 Math

Non-thinking Mode

Qwen3 Math

Prompt: What is larger: 134.59 or 134.6?

Although both modes returned the correct answer that 134.6 is larger, the thinking variant took 12 times more time than the non-thinking ones.

Thinking Mode

Non-thinking Mode

Our thoughts on Qwen 3

Speed

We tried a few coding and math prompts on Qwen 3 with Ollama’s verbose mode. In terms of speed, we noticed strong performance with 23-25 tokens per second, when running on our NVIDIA H100 SXM setup.

Accuracy

Qwen 3 got most of our prompts right such as Python code to generate Snake and Tetris games.

However, it did struggle with the prompt below

Prompt: "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"

The Python code created the visual where ball was either bouncing outside the hexagon

Qwen 3 Coding Performance

Reasoning

Qwen 3’s hybrid operation( thinking and non-thinking), where users can turn on the thinking mode only for very hard problems. However, Qwen 3 is prone to “overthinking”, which means it tends to reason for too long even when encountering fairly straightforward prompts.

For example, for the math problem below, Qwen 3 reasoned for several minutes longer than DeepSeek R1 70B Distill.

Qwen3

Qwen 3

Qwen 3 Reasoning

Qwen 3 Thinking

Qwen3 thinking

Qwen 3 Hybrid

Qwen 3 is an impressive step forward for open-source AI. It’s fast, flexible, and capable of handling everything from simple queries to complex reasoning, thanks to its hybrid architecture. Running the 235B model on Ori’s H100 GPU instances with Ollama was smooth and efficient, even with its hefty requirements. The ability to toggle between "thinking" and "non-thinking" modes gives users control over speed and depth, though it’s clear the model can sometimes overthink when it doesn’t need to. For teams looking to experiment, build, or deploy powerful AI models on secure infrastructure, Qwen 3 on Ori is a solid combination.

Chart your own AI reality with Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:

Deploy Private Clouds for flexible and secure enterprise AI.
Leverage GPU Instances as on-demand virtual machines.
Operate Inference Endpoints effortlessly at any scale.
Scale GPU Clusters for training and inference.
Manage AI workloads on Serverless Kubernetes without infrastructure overhead.

Similar posts

Tutorial

How to run DeepSeek R1 on a cloud GPU with Ollama

Learn how to easily deploy DeepSeek R1 Distill 70B on an H100 GPU with Ollama and OpenWebUI, plus our thoughts about the model and its innovative...

Deepak Manoor Jan 28, 2025

Tutorial

How to run Mistral Small 3 on a cloud GPU with vLLM

Discover how to easily deploy Mistral Small 3 on a cloud GPU with vLLM and our model analysis with verbal, math and coding prompts.

Deepak Manoor Feb 5, 2025

Tutorial

How to run Llama 3.3 70B on a cloud GPU

Learn how to deploy Meta’s new text-generation model Llama 3.3 70B with Ollama and Open WebUI on an Ori cloud GPU.

Deepak Manoor Dec 10, 2024