Product updates

Introducing Ori Inference Endpoints

Say hello to Ori Inference Endpoints, an easy and scalable way to deploy state-of-the-art machine learning models as API endpoints.

We’re excited to announce Ori’s newest AI infrastructure service, Ori Inference Endpoints, an easy and scalable way to deploy state-of-the-art machine learning models as API endpoints. Designed for rapidly growing AI businesses and enterprises, Ori Inference Endpoints blends simplicity, flexibility, and cost-efficiency to help you deliver AI that delights your customers.
 

Effortless AI inference at any scale

Deploy the model of your choice: Whether it’s Llama 3, Qwen or Mistral, deploying a multi-billion parameter model is just a click away. 

Select a GPU and region. Unlock seamless inference: serve your models on NVIDIA H100 SXM, H100 PCIe, L40S or L4 GPUs, with more GPU options coming soon, and deploy in a region that helps minimize latency for your users. Not sure which GPU suits your needs? We’ll recommend one, helping you balance model performance and GPU costs.

Autoscale without limits: Ori Inference Endpoints dynamically scales up or down based on demand. Specify the minimum and maximum number of replicas you expect to need to serve your requests, and we’ll handle the scaling for you. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.

Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.

HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use. 

Pay for what you use, by the minute: Starting at $0.021/min, our per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.

Ori Global Cloud Discord Server

Why you’ll love Ori Inference Endpoints

  • No DevOps expertise needed: Ori Inference Endpoints’ one-click deploy makes it easy to run and scale AI inference without the overhead associated with infrastructure management. 

  • Designed to delight your users: Deliver low-latency experiences for your users with high performance GPUs, and ensure high availability by scaling from zero to thousands of GPUs.

  • Scale your inference, not your costs: Scale up when you need to, scale to zero when you don’t. Ori’s per-minute pricing model helps your infrastructure budget stay optimized. Whether for steady-state workloads or bursty demand, you’ll only pay for what you use.

See Inference Endpoints in action

Check out how easy it is to deploy and scale an AI/ML inference endpoint.

 
 
Find out more about Ori Inference Endpoints in our technical documentation.

Launch whole AI worlds with just a few clicks

Serve state-of-the-art AI models to your users in minutes, without breaking your infrastructure budget. Looking to scale inference across thousands of GPUs? Contact our sales team.

 

Subscribe for more news and insights

 

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.