Tutorial

How to run Genmo Mochi 1 video generation on a cloud GPU

Discover how to deploy Genmo Mochi 1 with ComfyUI on an Ori GPU instance, and read our analysis of this new open source video generation model.

 

Video generation is the next frontier for generative AI. Unlike generating images or text, generative video is harder because it needs more compute, has less accessible training datasets, and includes more variables such as smooth motion, temporal coherence and frame aesthetics.

Models like Llama, Pixtral, Flux, and many others have demonstrated how open-source AI drives faster and more widespread innovation across the field. That’s why Genmo’s announcement of the Mochi 1 model is a key step forward in advancing generative AI. 

Here’s a snapshot of Mochi specs:

  Mochi 1 Specifications
Model Architecture
Diffusion models built on Asymmetric Diffusion Transformer (AsymmDiT) architecture
Parameters
10B
Context
Context window of 44,520 video tokens
Resolution
480p
Frames
Up to 30 frames per second and a length of 5.4 seconds
Licensing
Apache 2.0 - Personal and Commercial

Genmo AI's benchmark results showcase state of the art (SOTA) performance in both prompt adherence and ELO scores. These metrics indicate how close the result is to the user’s prompt and fluidity in motion. 

  • prompt_adherence
  • Genmo Mochi 1 Video performance
Source: Genmo AI
 

Ori Global Cloud Discord Server

Deploy Genmo Mochi Video with ComfyUI on an Ori GPU instance

ComfyUI is an open-source AI tool developed and maintained by Comfy Org for running image and video generation models.  Check out their Github here.

Pre-requisites:

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 SXM with 80 GB VRAM for this demo, but the optimized ComfyUI version also runs on GPUs with lesser memory. A powerful GPU with more memory enhances the variational autoencoder (VAE), but ComfyUI will switch to tiled VAE automatically if memory is limited. 

Install ComfyUI:

Step 1:
sudo apt update sudo apt install git
 
Step 2: Download the ComfyUI files
git clone https://github.com/comfyanonymous/ComfyUI.git  cd ComfyUI
 
Step 3: If you didn’t add the init script for Pytorch, when creating the virtual machine
  pip install torch torchvision torchaudio
 
Step 4: Install dependencies
pip install -r requirements.txt
 
Step 5: Install ComfyUI Manager which helps you manage your custom nodes and instance.
cd custom_nodes git clone https://github.com/ltdrdata/ComfyUI-Manager.git
 

Run Mochi 1 on ComfyUI

Step 1: Download Mochi weights  from Hugging Face
cd ComfyUI/models/diffusion_models wget https://huggingface.co/Comfy-Org/mochi_preview_repackaged/resolve/main/split_files/diffusion_models/mochi_preview_bf16.safetensors wget https://huggingface.co/Comfy-Org/mochi_preview_repackaged/resolve/main/split_files/diffusion_models/mochi_preview_fp8_scaled.safetensors 
 
Step 2: Download the FP 16 orFP8 text encoders, or both.
cd ComfyUI/models/clip wget https://huggingface.co/Comfy-Org/mochi_preview_repackaged/resolve/main/split_files/text_encoders/t5xxl_fp16.safetensors wget https://huggingface.co/Comfy-Org/mochi_preview_repackaged/resolve/main/split_files/text_encoders/t5xxl_fp8_e4m3fn_scaled.safetensors 
 
Step 3: Download VAE
cd ComfyUI/models/vae wget https://huggingface.co/Comfy-Org/mochi_preview_repackaged/resolve/main/split_files/vae/mochi_vae.safetensors 
 
Step 4: Run ComfyUI
 
python3.10  main.py --listen=0.0.0.0  --port=8080 
ComfyUI uses port 8188 by default, but we configured it to use a default open port on the SXM node for easier access.
 
Step 5: In your browser open http://”virtual machine address”:8080. Drag the wolf image in this link to your ComfyUI tab which will automatically load the workflow for FP 16. 
 
Your ComfyUI workflow will resemble the image below. Adjust length (frame count), frames per second, and model iterations, then click "Queue Prompt" to generate the video.
 

Mochi 1 with ComfyUI

How good is Genmo Mochi 1?

 
Mochi 1 Preview demonstrated strong prompt adherence and impressive video dynamics. Reducing the iteration count and frame count can significantly shorten video generation time, which could otherwise take 45 minutes for a 5-second clip at 200 iterations and 30 fps. We recommend testing the model several times to find a frame rate and iteration count that balance your needs and efficiency.
 
We observed that using detailed prompts—including specifics on camera angles, motion type, lighting, and environment—yielded better results. With the right prompts, Mochi 1 showed remarkable flexibility in frame aesthetics. While the motion occasionally appeared glitchy, it was generally smooth and fluid. The Genmo Mochi AI model also excelled in executing close-up shots with impressive clarity.
 
The model struggles with text insertion, a challenge that has also affected image generation models in the past. However, recent improvements in image models show promise, and we expect Genmo to enhance this feature in future updates. Each model iteration takes about 14 seconds which means generating a 5 second clip with 200 iterations takes more than 45 minutes (14*200 seconds), whereas closed-source video generation models are typically much faster.
 
Though Genmo currently limits the model to 480p resolution and 5-second videos, we’re excited about their upcoming 720p version of the text-to-video model and the future of open source video generation. 
 
Here are some example of video content generated by Mochi 1:
 
Prompt: A panda waiting for a train is reading a book in a picturesque setting of a train station during the tranquil Afternoon hours. Set against the backdrop of a serene summer landscape characterized by a Sunny atmosphere, enhance the scene with the inclusion of benches strategically placed to complement the surroundings. As the scene unfolds, use a gradual zoom in camera movement to gradually reveal the intricacies of the environment.

 
Prompt: 
A streetcar runs slowly through an avenue during autumn. Trees on both sides of the avenue are decked with fall colors. The camera focuses on the vivid details of the streetcar as it passes through the street. The lighting that comes through the trees creates a magical environment.
 
 
Prompt:
A movie trailer featuring a 30 year old woman on Mars who is walking in a cool spacesuit, cinematic style, shot on 35mm film. A large, circular building with glass windows is clearly visible in the backdrop. The scene is brightly lit with tones of red. The environment evokes awe and creates excitement.
 
 
Prompt: A large, majestic dragon with olive green scales and flaming red eyes is set against the backdrop of a serene, snowy valley. The video begins with a tracking shot of the valley and then the camera zooms in on the dragon highlighting the details on its face. To maintain visual clarity of this video, every element within the frame is crisp and discernible. 
 
 
 Prompt: A close up of cold pressed fruit juice being poured into a glass bottle.
 
 
Prompt: Floor tiles lit up in neon blue in the style of PCB wiring
 
 
Prompt: Top view of a vibrant rainforest with a macaw sitting on a tree. As the scene progresses, the camera zooms in on the macaw.
 
 
Prompt: A title screen with a rugged and grainy grey backdrop. The lighting creates an environment of wonder and excitement. The word “ORI” appears on the screen in slow motion and the camera moves closer to the text.
 
 
As you can see in the video above, Mochi1 failed to generate the text “ORI”
 
Genmo AI has also launched a playground that allows users to create a limited number of videos per day. Check out community creations to see what users have created with this new AI text-to-video generator.
 
Mochi 1 is an excellent tool for generating short video clips with impressive motion quality. While it has some limitations, including lower resolution, challenges in text handling, and longer generation times, it marks a significant milestone in open source AI. As generative video continues to grow in popularity, we look forward to exploring Mochi 1 HD soon.
 

Alternative ways to run Mochi 1 Preview

If you want to run the full version as released by Genmo, with Gradio or directly from the CLI, check out these steps:
 

Genmo Mochi Requirements

 Genmo recommends using 4 NVIDIA H100 GPUs to run the full version of Mochi 1. We suggest using Flash attention to optimize memory usage.
 

Pre-requisites:

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose 4x NVIDIA H100 PCIe GPUs with 320 GB VRAM totally for this demo. Select Pytorch + Cuda 12.6 init script when creating the virtual machine so you don’t have to install them later.
 
Step 1:
python3.10 -m venv mochi1-env source mochi1-env/bin/activate: 
 
Step 2: Clone the Genmo Github models
git clone https://github.com/genmoai/models cd models
 
Step 3: Install dependencies:
pip install setuptools pip install -r requirements.txt sudo apt install ffmpeg
 
Step 4: Set up flash attention dependencies. This is optional and needed only if you are using flash attention to optimize memory usage.
pip install packaging pip install ninja
 
Step 5: Build with flash attention, which installs significantly faster when VRAM exceeds 96 GB.
pip install -e .[flash] --no-build-isolation
 
Step 6: Download weights from Genmo Mochi Hugging Face link, it might take several minutes as the files are quite large. 
python3 ./scripts/download_weights.py /root/models
 
Step 7: Run the model via Gradio UI or directly from the terminal. If you run the Gradio demo, enable public share mode in the demo Python file, as shown in the screenshot below.
Genmo Mochi AI
 
python3 ./demos/gradio_ui.py --model_dir /root/models
or
python3 ./demos/cli.py --model_dir /root/models  --num_steps 200

In CLI mode, you can change the text prompt, number of frames, and model iterations either as arguments or in cli.py. 

The screenshot below shows sampling in progress and the average iteration time at the bottom right side of the terminal.

Step 8: Create a Jupyter server from another terminal window and download the video file in a browser window.
pip install notebook jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0. 

If you chose Gradio, you should see a link in the terminal after you run the demo file.

Genmo Mochi 1

Imagine another AI reality. Build it on Ori.

Ori Global Cloud is the first AI infrastructure provider with the native expertise, comprehensive capabilities and end-to-endless flexibility to support any model, team, or scale. Here’s what you can do on Ori: 
 

 

Subscribe for more news and insights

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.