At nCompass, we’re building IP that can reduce the costs of serving AI models at scale by 50%. We enable a single GPU to handle up to 4x more requests/s without degrading quality-of-service.

State-of-the-art serving systems such as vLLM run benchmarks that demonstrate reasonable QoS when serving just 1-20 requests/s. When LLM inferences is stressed at request rates greater than this, response times on a single GPU explode catastrophically. Currently, the only solution is to scale up the number of GPUs which is expensive.

Reduce AI GPU Infrastructure Bills by 50%

Our hardware aware request scheduler and Kubernetes autoscaler enables users to serve >100 requests/s without compromising QoS on a single GPU.

Improve AI model responsiveness by up to 18x

We maintain a TTFT of <1s on a single GPU while serving up >100 requests/s and improve performance compared to vLLM by up to 18x.