Product updates

Introducing Ori Inference Endpoints

Say hello to Ori Inference Endpoints, an easy and scalable way to deploy state-of-the-art machine learning models as API endpoints.

Dec 17, 2024

Product updates

We’re excited to announce Ori’s newest AI infrastructure service, Ori Inference Endpoints, an easy and scalable way to deploy state-of-the-art machine learning models as API endpoints. Designed for rapidly growing AI businesses and enterprises, Ori Inference Endpoints blends simplicity, flexibility, and cost-efficiency to help you deliver AI that delights your customers.

Effortless AI inference at any scale

Deploy the model of your choice: Whether it’s Llama 3, Qwen or Mistral, deploying a multi-billion parameter model is just a click away.

Select a GPU and region. Unlock seamless inference: serve your models on NVIDIA H100 SXM, H100 PCIe, L40S or L4 GPUs, with more GPU options coming soon, and deploy in a region that helps minimize latency for your users. Not sure which GPU suits your needs? We’ll recommend one, helping you balance model performance and GPU costs.

Autoscale without limits: Ori Inference Endpoints dynamically scales up or down based on demand. Specify the minimum and maximum number of replicas you expect to need to serve your requests, and we’ll handle the scaling for you. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.

Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.

HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use.

Pay for what you use, by the minute: Starting at $0.021/min, our per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.

Why you’ll love Ori Inference Endpoints

No DevOps expertise needed: Ori Inference Endpoints’ one-click deploy makes it easy to run and scale AI inference without the overhead associated with infrastructure management.
Designed to delight your users: Deliver low-latency experiences for your users with high performance GPUs, and ensure high availability by scaling from zero to thousands of GPUs.
Scale your inference, not your costs: Scale up when you need to, scale to zero when you don’t. Ori’s per-minute pricing model helps your infrastructure budget stay optimized. Whether for steady-state workloads or bursty demand, you’ll only pay for what you use.

See Inference Endpoints in action

Check out how easy it is to deploy and scale an AI/ML inference endpoint.

Find out more about Ori Inference Endpoints in our technical documentation.

Launch whole AI worlds with just a few clicks

Serve state-of-the-art AI models to your users in minutes, without breaking your infrastructure budget. Looking to scale inference across thousands of GPUs? Contact our sales team.

Launch Inference Endpoint

Subscribe for more news and insights

Similar posts

Tutorial

How to deploy an interactive chatbot with Ori Inference Endpoints and Gradio

Learn how to deploy chatbots based on LLMs with Ori Inference Endpoints and Gradio

Adrian Matei Feb 10, 2025

LLM

Deploy and scale Qwen 2.5 with just one click on Ori Inference Endpoints

Learn how to deploy and scale Qwen 2.5 1.5B effortlessly with Ori Inference Endpoints.

Deepak Manoor Jan 6, 2025

Company News

Wa’ed Ventures, Saudi Aramco's venture arm, invests in Ori

Ori secures strategic investment from Wa’ed Ventures to fuel expansion in Saudi Arabia and the Middle East

Daniel Van Den Berghe Feb 17, 2025