Case study

nCompass (YC W24) leverages Ori Serverless Kubernetes to make AI inference 2x cost-effective

Find out how Ori Serverless Kubernetes is helping nCompass run cost-effective LLM inference at scale.

Oct 11, 2024

Case study

Ori’s Serverless Kubernetes platform has been crucial in allowing us to dynamically scale our inference workloads while reducing costs. With Ori, we’ve been able to focus on growing our platform and making AI model inference accessible to everyone.

nCompass simplifies the hosting and acceleration of large-scale, open-source AI models. Founders, Aditya Rajagopal and Diederik Vink, Imperial College London PhD graduates who specialized in hardware acceleration of AI models, quickly realized that Inference API providers heavily rate-limit users which hinders both the prototyping and deployment of AI solutions. By focusing on cost efficiency and scalability, nCompass aims to make AI model inference more accessible to developers and businesses.

Reimagining the inference API

After being a part of the YC Winter 2024 batch, nCompass secured $1.7 M in seed funding and has since focused on building its inference platform. nCompass’ new self-serve inference API, powered by their proprietary technology, optimizes GPU utilization for inference, enabling users to scale up their AI services and manage 100s of requests per second on a single GPU, while cutting hardware expenses by up to 2x.

nCompass’ initial optimizations ensure that at high request rates, the time-to-first-token (TTFT) doesn’t explode the same way that it does with vLLM. Performance results shared by nCompass portray up to 18x better TTFT than vLLM at high workloads. This means that as their user base scales, the availability and quality-of-service of their API offering won’t degrade. Alongside an API, nCompass also provides dedicated model instance deployments as well as on-prem deployments that minimize users’ GPU bills.

Source: Benchmark testing provided by nCompass

Ori Serverless Kubernetes, a balance of simplicity and flexibility

Although Kubernetes is a powerful tool that is an industry standard in container orchestration, implementations can turn complex very quickly. nCompass experimented with a few cloud providers before choosing Ori Serverless Kubernetes for its balance of simplicity and flexibility.

Unlike other GPU Kubernetes providers whose fully managed platforms create vendor lock-in, nCompass found Ori’s Vanilla Kubernetes with customization options a refreshing change. Ori Serverless Kubernetes’ flexibility is also helping nCompass plan for future innovations within their platform such as on-premise deployment with minimal changes to their API architecture.

We were able to deploy sophisticated clusters with help from Ori’s client engineering team in under a few weeks, which would have been impossible with bare-metal Kubernetes. At the same time, we were able to keep our platform feature-rich, which is usually hard to achieve on other managed Kubernetes services.

Achieving inference economics at scale with Ori

The nCompass team started with virtual machines for internal testing and prototyping, but quickly realized that Ori Serverless Kubernetes was the perfect choice to achieve the economics of an inference API service, since it offers a good combination of access to a large number of GPUs, usage based pricing, and the ability to scale up and down as needed.

“Serverless Kubernetes was a natural fit for our API service, it allows us to scale compute capacity as demand bursts and optimize costs" says Aditya.

nCompass also loves the per-minute pricing model for making their operations more dynamic and they enjoy the ability to scale up and down as needed.

Ori’s GPU costs have been very competitive and customer support has been superior to many other cloud providers we’ve tried.

As nCompass accelerates its go-to-market journey, the founders envision a future of innovation to expand their selection of LLMs, including closed-source models, extend their API service to vision models, and improve the economics of AI inference. Get started with the nCompass inference API!

Chart your own AI reality with Ori Serverless Kubernetes

Ori Serverless Kubernetes is an AI infrastructure service that combines powerful scalability, simple management and affordability enabling you to scale into the future of AI.

Top-tier GPUs with autoscaling: Choice of NVIDIA H100 PCIe, H100 SXM, L40S, and L4 GPUs, with seamless scaling of inference clusters.
Pay as you go: Transparent and predictable pricing with per-minute billing.
Simple management: Ori takes care of cluster management and load balancing, so you can focus on serving your models faster and wider.

nCompass (YC W24) leverages Ori Serverless Kubernetes to make AI inference 2x cost-effective

Reimagining the inference API

Ori Serverless Kubernetes, a balance of simplicity and flexibility

Achieving inference economics at scale with Ori

Chart your own AI reality with Ori Serverless Kubernetes

Subscribe for more news and insights

Similar posts

Empowering SMBs with AI: How Emediately is building powerful LLM solutions on Ori’s AI Native GPU cloud

AI Meets Rugby: Framesports automates match analysis on Ori Serverless Kubernetes

Introducing Ori Serverless Kubernetes

nCompass (YC W24) leverages Ori Serverless Kubernetes to make AI inference 2x cost-effective

Reimagining the inference API

Ori Serverless Kubernetes, a balance of simplicity and flexibility

Achieving inference economics at scale with Ori

Chart your own AI reality with Ori Serverless Kubernetes

Subscribe for more news and insights

Similar posts

Empowering SMBs with AI: How Emediately is building powerful LLM solutions on Ori’s AI Native GPU cloud

AI Meets Rugby: Framesports automates match analysis on Ori Serverless Kubernetes

Introducing Ori Serverless Kubernetes

Join the new class of AI infrastructure!