Performance

Raw latency doesn't matter. Latency under load does.

Verified 21,803 requests per second at sub-2ms median on a single node, while Python alternatives fail below 1,000.

LATENCY vs LOAD slow fast low traffic 21,000 RPS Python proxy Vidai · sub-2ms, flat Everyone is fast at low traffic. Only one stays flat at the peak.

Why load is now the only number

A benchmark used to be a vanity metric. Agents made it a survival metric.

In the chat era, your AI traffic was paced by people: one prompt, one call, human typing speed as the ceiling. Agentic AI removes that ceiling. One task now fans out into ten or more autonomous calls, and the traffic through your in-path layer is climbing 10× and still rising. The layer in front of every AI request is no longer handling chat-pace load. It is handling agent-pace load, and its speed under that load is now your AI's speed.

See what 10× looks like at your scale →

1 Human era paced by people 2 Chat era paced by people 10× Agent era paced by the agent and still growing

What it means for the business

Fast enough that nobody notices it's there.

Every 100ms of added latency measurably reduces engagement. Vidai sits in the path of every AI request and adds deterministic, sub-millisecond overhead. It doesn't become the bottleneck when traffic spikes 10×.

Performance is a property of being a control plane, not a feature you tune. What's a control plane? →

Verified benchmark

Under a 21,803 RPS stress test.

GatewayPeak RPSMedianp95p99Status
Vidai (Rust)21,8031.95 ms 23.5 ms40 ms Stable
Python proxies< 200197 ms 1.3 s6.0 s Failing
Cloud-managed gateways~ 640100 ms 395 ms1.2 s At limit

A single legacy 8-core node, an Intel Xeon E3-1240 v3 from 2013, chosen so the figure is a floor, not a best case. Full methodology and reproduction steps in the technical write-up.

Per-core scaling

Why a single node carries so much.

The whole-machine number above is the headline. The number underneath it is the more telling one. Each core in the test added roughly 7,000 requests per second of headroom in a linear scaling curve, up to about 29,000 RPS in total before shared-hardware contention (NIC, cache) prevented further increase. On the same harness, Python-based AI proxies bottlenecked around 75–110 RPS per core: a per-core efficiency gap of roughly 94×.

That gap is the reason a fleet of Python nodes does not bridge the difference: 94 of them, behind a load balancer, replicate the bottleneck 94 times. The fastest path through this is not more nodes, it's a faster runtime.

94× per-core throughput gap
Vidai ~7,000 Python proxy ~75–110
Requests per second per core. Same benchmark harness.

Infrastructure footprint

What decides your node count: headroom, or a ceiling.

Nobody runs 21,803 requests per second sustained; that is a stress-test peak. The number that matters is what one node clears with room to spare. With that much headroom, real traffic, even agent-pace traffic, fits inside one node, so you add nodes for redundancy, not to keep up. The slower the runtime, the sooner each node redlines, and then node count is set by load, not by choice.

See how it scales out →

Adding nodes is never free. "Just load-balance it" is a real answer, with a real bill: each node is compute, load-balancer capacity, and one more thing to patch and page for. The question is whether you add nodes by choice or by necessity.
Vidai nodes scale for HA, not headroom. A single node already carries the load with margin. The second node is the one you fail over to, not the one that stops you redlining.
A redlining node has no margin for the spike. When traffic jumps 10× for an hour, headroom absorbs it. A node already near its ceiling has nowhere to put it, so the fleet has to be sized for the peak, all the time.
Efficiency per node is the lever. Cut what one node clears and you multiply the fleet, the load balancer and the on-call surface with it. Agent-pace traffic only sharpens that maths.

Why each node carries more

The headroom is a property of a Rust hot path.

The per-node margin that lets you scale by choice is not a tuning trick. It falls out of language and architecture choices that hold the same shape under 10× the load.

See the platform-engineering use case →

No garbage collection. No tail-latency spikes when the collector runs mid-request.
No GIL. Real multi-core throughput on every node, not a single-threaded process that scales only by adding more processes.
Zero-copy networking. The request isn't re-serialised three times on the way through.
~25MB engine. Small footprint, low hardware bill, trivial to replicate.
Vidai Mesh In progress Scale out for HA by adding a server; peers gossip directly, with no Redis or coordination tier to stand up. More on the engineering page →

Want the benchmark methodology.

A 20-minute technical walkthrough, including how to reproduce the numbers.