Tutorial

How to run Qwen 3 235B on a cloud GPU

Discover how to deploy Qwen 3 235B model with Ollama and OpenWebUI on a cloud GPU and check out our model analysis.

Alibaba’s Qwen series of AI models has rapidly emerged as a strong open-source alternative to state-of-the-art (SOTA) models often rivaling and in some benchmarks exceeding their performance. The latest version of these models, Qwen 3 offers a versatile family of generative AI models that blend high performance with broad accessibility. These models are designed with hybrid reasoning capabilities, allowing them to efficiently handle simple tasks while dynamically shifting to tackle more complex problems. The Qwen 3 lineup includes both dense and Mixture-of-Experts (MoE) architectures, ranging from 0.6 billion to 235 billion parameters, all available under the permissive Apache 2.0 license. 
 
Here’s a brief overview of Qwen 3’s key specifications:

Qwen 3
Architecture
Dense and Mixture-of-Experts (MoE) Transformers; Hybrid Reasoning Modes (Thinking & Non-Thinking)
Parameters
Dense: 0.6B, 1.7B, 4B, 8B, 14B, 32B; MoE: 30B (3B active), 235B (22B active)
Model Variants
Dense, MoE
Context length / Generation length
Dense (0.6B-4B): 32K tokens; Dense (8B-32B) & MoE: 128K tokens
Licensing
Apache 2.0: Commercial and research

Performance benchmarks from Artifical Analysis indicate that Qwen 3 235B A22B compares well with other top of the line models from Open AI, Google and DeepSeek.

Qwen 3 LLM Performance

 

How to run Qwen 3 with Ollama

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose a set up with 4x NVIDIA H100 SXM and Ubuntu 22.04 as our OS, however 2x H100’s are enough since Ollama needs about 143 GB of VRAM to run Qwen 3 235B.

  

 

Quick tip
Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

 

  

Step 1: SSH into your VM, install Python and create a virtual environment
apt install python3.12-venv python3.12 -m venv qwen-env
 
Step 2: Activate the virtual environment
source qwen-env/bin/activate
 
Step 3: Install Ollama and specify the number of GPUs to be used
curl -fsSL https://ollama.com/install.sh | sh  export OLLAMA_GPU_COUNT=4
 
Step 4: Run Qwen 3 235B with Ollama
ollama run qwen3:235b –verbose
 

Here’s what our setup looks like with Ollama running


Qwen 3 system VRAM memory requirements

Step 5: Install OpenWebui on the VM via another terminal window and run it
pip install open-webui open-webui serve
 
Step 6: Access OpenWebUI on your browser through the default 8080 port.
http://”VM-IP”:8080/
 
Click on “Get Started” to create an Open WebUI account, if you haven’t installed it on the virtual machine before.

Qwen 3 Openwebui

Step 7: Choose qwen3:235b from the Models drop down and chat away!

Comparing Thinking and Non-Thinking modes

Being a hybrid model, Qwen 3 235B A22B is able to switch between thinking and non-thinking modes. Append “/no_think” or “/think” tag in your prompts to choose the mode you want to use. 

Here is a comparison of thinking and non thinking responses to our prompt

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Qwen 3 got the answer (3) right in both modes. However, the thinking mode took far too long (4m 16s vs  15s) with the model second-guessing itself continuously.

Thinking Mode

Qwen 3 Math

Non-thinking Mode

Qwen3 Math

 

Prompt: What is larger: 134.59 or 134.6?

Although both modes returned the correct answer that 134.6 is larger, the thinking variant took 12 times more time than the non-thinking ones.

Thinking Mode

Non-thinking Mode

Our thoughts on Qwen 3

Speed

We tried a few coding and math prompts on Qwen 3 with Ollama’s verbose mode. In terms of speed, we noticed strong performance with 23-25 tokens per second, when running on our NVIDIA H100 SXM setup.

Accuracy

Qwen 3 got most of our prompts right such as Python code to generate Snake and Tetris games.
 
However, it did struggle with the prompt below
 
Prompt: "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"
 
The Python code created the visual where ball was either bouncing outside the hexagon

Qwen 3 Coding Performance

Reasoning

Qwen 3’s hybrid operation( thinking and non-thinking), where users can turn on the thinking mode only for very hard problems. However, Qwen 3 is prone to “overthinking”, which means it tends to reason for too long even when encountering fairly straightforward prompts. 
 
For example, for the math problem below, Qwen 3 reasoned for several minutes longer than DeepSeek R1 70B Distill.

Qwen3



Qwen 3

Qwen 3 Reasoning

Qwen 3 Thinking

Qwen3 thinking


Qwen 3 Hybrid

Qwen 3 is an impressive step forward for open-source AI. It’s fast, flexible, and capable of handling everything from simple queries to complex reasoning, thanks to its hybrid architecture. Running the 235B model on Ori’s H100 GPU instances with Ollama was smooth and efficient, even with its hefty requirements. The ability to toggle between "thinking" and "non-thinking" modes gives users control over speed and depth, though it’s clear the model can sometimes overthink when it doesn’t need to. For teams looking to experiment, build, or deploy powerful AI models on secure infrastructure, Qwen 3 on Ori is a solid combination.
 

Chart your own AI reality with Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:
 

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.