Alibaba’s Qwen series of AI models has rapidly emerged as a strong open-source alternative to state-of-the-art (SOTA) models often rivaling and in some benchmarks exceeding their performance. The latest version of these models,
Qwen 3 offers a versatile family of generative AI models that blend high performance with broad accessibility. These models are designed with hybrid reasoning capabilities, allowing them to efficiently handle simple tasks while dynamically shifting to tackle more complex problems. The Qwen 3 lineup includes both dense and Mixture-of-Experts (MoE) architectures, ranging from 0.6 billion to 235 billion parameters, all available under the permissive Apache 2.0 license.
Here’s a brief overview of Qwen 3’s key specifications:
Qwen 3
|
Architecture
|
Dense and Mixture-of-Experts (MoE) Transformers; Hybrid Reasoning Modes (Thinking & Non-Thinking)
|
Parameters
|
Dense: 0.6B, 1.7B, 4B, 8B, 14B, 32B; MoE: 30B (3B active), 235B (22B active)
|
Model Variants
|
Dense, MoE
|
Context length / Generation length
|
Dense (0.6B-4B): 32K tokens; Dense (8B-32B) & MoE: 128K tokens
|
Licensing
|
|
Performance benchmarks from Artifical Analysis indicate that Qwen 3 235B A22B compares well with other top of the line models from Open AI, Google and DeepSeek.

How to run Qwen 3 with Ollama
Pre-requisites
Create a GPU virtual machine (VM) on Ori Global Cloud. We chose a set up with 4x NVIDIA H100 SXM and Ubuntu 22.04 as our OS, however 2x H100’s are enough since Ollama needs about 143 GB of VRAM to run Qwen 3 235B.
apt install python3.12-venv
python3.12 -m venv qwen-env
Step 2: Activate the virtual environment
source qwen-env/bin/activate
Step 3: Install
Ollama and specify the number of GPUs to be used
curl -fsSL https://ollama.com/install.sh | sh
export OLLAMA_GPU_COUNT=4
Step 4: Run Qwen 3 235B with Ollama
ollama run qwen3:235b –verbose
Here’s what our setup looks like with Ollama running

Step 5: Install
OpenWebui on the VM via another terminal window and run it
pip install open-webui
open-webui serve
Step 6: Access OpenWebUI on your browser through the default 8080 port.
http://”VM-IP”:8080/
Click on “Get Started” to create an Open WebUI account, if you haven’t installed it on the virtual machine before.

Step 7: Choose qwen3:235b from the Models drop down and chat away!
Comparing Thinking and Non-Thinking modes
Being a hybrid model, Qwen 3 235B A22B is able to switch between thinking and non-thinking modes. Append “/no_think” or “/think” tag in your prompts to choose the mode you want to use.
Here is a comparison of thinking and non thinking responses to our prompt
Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections
Qwen 3 got the answer (3) right in both modes. However, the thinking mode took far too long (4m 16s vs 15s) with the model second-guessing itself continuously.
Thinking Mode

Non-thinking Mode

Prompt: What is larger: 134.59 or 134.6?
Although both modes returned the correct answer that 134.6 is larger, the thinking variant took 12 times more time than the non-thinking ones.
Thinking Mode

Non-thinking Mode

Our thoughts on Qwen 3
Speed
We tried a few coding and math prompts on Qwen 3 with Ollama’s verbose mode. In terms of speed, we noticed strong performance with 23-25 tokens per second, when running on our NVIDIA H100 SXM setup.
Accuracy
Qwen 3 got most of our prompts right such as Python code to generate Snake and Tetris games.
However, it did struggle with the prompt below
Prompt: "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"
The Python code created the visual where ball was either bouncing outside the hexagon

Reasoning
Qwen 3’s hybrid operation( thinking and non-thinking), where users can turn on the thinking mode only for very hard problems. However, Qwen 3 is prone to “overthinking”, which means it tends to reason for too long even when encountering fairly straightforward prompts.
For example, for the math problem below, Qwen 3 reasoned for several minutes longer than DeepSeek R1 70B Distill.






Qwen 3 is an impressive step forward for open-source AI. It’s fast, flexible, and capable of handling everything from simple queries to complex reasoning, thanks to its hybrid architecture. Running the 235B model on Ori’s H100 GPU instances with Ollama was smooth and efficient, even with its hefty requirements. The ability to toggle between "thinking" and "non-thinking" modes gives users control over speed and depth, though it’s clear the model can sometimes overthink when it doesn’t need to. For teams looking to experiment, build, or deploy powerful AI models on secure infrastructure, Qwen 3 on Ori is a solid combination.
Chart your own AI reality with Ori
Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways: