Rust vs. Python for AI Infrastructure: Bridging a 3,400x Performance Gap
As enterprises move from generative AI pilots to production-scale agentic workflows, a new bottleneck has emerged: The Performance Tax. Every 100ms of latency can kill up to 7% of user engagement, yet current orchestration layers often force a choice between safety and speed.
When we began building VIDAI, our thesis was that an AI gateway should be "invisible"—providing routing, guardrails, and telemetry without becoming the bottleneck. To test this, we benchmarked our Rust-native engine against the most popular gateways in the ecosystem: Bifrost (Go), LiteLLM (Python), and Portkey (NodeJS)
For Transparency, Scripts used as well configs are available here under Apache License.
The Methodology: Transparency in Testing
To ensure a fair comparison, all gateways were hosted on a single bare-metal machine to eliminate network variance.
-
Hardware: Intel Xeon E3-1240 v3 @ 3.40GHz (8 cores/8 threads) with 31GB RAM. This 2013 circa hardware was picked up at LowendTalk forums at 99USD/Year. (Not a production machine, but a good performance floor for these experiments)
- The Backend: We used VidaiMock, a 7MB LLM native Rust binary capable of 50,000+ RPS, to simulate realistic LLM behaviors like Server-Sent Events (SSE) and per-token timing. We had to built this to test our infrastructure, as the existing Mocks were not tuned for LLM testing. Most were too bloated and targetted towards the standard API testing and we wanted some thing more invisible, performant and suit our usecase
Configuration & Feature Baseline
To understand the results, it is important to see what each gateway was actually doing during the test. While all gateways were set to their absolute "minimal" proxy mode, VidaiServer was tested with its production features active. Config files are as is in the github repo.
|
Gateway |
Config Level |
Functions Enabled |
Runtime |
|
Bifrost |
Minimal |
Pure passthrough routing |
Go |
|
Portkey |
Minimal |
Header-based routing |
Node.js |
|
LiteLLM |
Minimal |
Basic model mapping |
Python |
|
VidaiServer L1 |
Base |
Auth, API key validation, Rate limiting, Routing |
Rust |
|
VidaiServer L2 |
Guardrails |
All L1 + Guardrails |
Rust |
|
VidaiServer L3 |
Enterprise |
All L2 + PostgreSQL telemetry & 100% logging |
Rust |
1. The 1,000x Performance Gap
Source Data: Vidai L1 adds +1.7ms overhead vs. LiteLLM at +5,788.8ms.
The data reveals a stark performance tiering based on memory management. At 5,000 RPS, Rust and Go gateways operate with negligible overhead , while interpreted runtimes collapse under the weight of Garbage Collection (GC) pauses and high-level serialization overhead.
2. Scaling to 10,000+ RPS
Real-world production environments for agentic workflows require extreme density. We tested p95 latency as load increased from 500 to 12,000 RPS. The 3400x Gap is very telling.
Source Data: VidaiServer and Bifrost maintain sub-50ms p95 latency at 10k RPS, while Portkey and LiteLLM exceed this threshold early. At 10000+ RPS, its very clear Rust(Vidai) is edging over Go(Bifrost).
3. The "Inverted Catastrophe"
A counter-intuitive discovery emerged when testing gateways against "fast" (50ms) vs. "slow" (500ms) backends.
⚠️ Portkey excluded - does not forward custom headers for latency injection
⚠️ Medium (200ms) profile was not run due to an oversight. Only fast and slow profiles were tested.
In a typical system, a faster backend improves total response time. However, for interpreted gateways like LiteLLM, a fast backend triggered a catastrophe: overhead jumped from 9 seconds to 20 seconds. This is a classic manifestation of the Global Interpreter Lock (GIL) bottleneck: as response frequency increases, the overhead of acquiring the lock for JSON parsing becomes a serial choke point that cannot be solved by adding more threads. In short, the faster your LLM is, the more your gateway slows you down.
The Battle of High-Performers: Why Rust is Flatter than Go
|
RPS Level |
VidaiServer L1 (Rust) |
Bifrost (Go) |
Gap (ms) |
|
5,000 RPS |
~4ms p95 |
~4ms p95 |
Negligible |
|
10,000 RPS |
~20ms p95 |
~35ms p95 |
+15ms |
|
12,000 RPS |
~35ms p95 |
~50ms p95 |
+15ms |
While the gap between VidaiServer and Python is a chasm, the race against Bifrost (Go) was a technical sprint. At 10,000+ RPS, the data shows VidaiServer maintaining a flatter, more deterministic latency curve compared to Bifrost's gradual climb.
-
Garbage Collection (GC) Jitter: Go relies on a background garbage collector. At extreme throughput (10k+ RPS), the sheer volume of short-lived objects forces the GC to work harder, leading to microscopic "Stop-the-World" pauses that manifest as unpredictable spikes in tail latency. Rust’s compile-time memory management removes this background overhead entirely.
-
True Parallelism vs. Goroutines: VidaiServer leverages the tokio runtime to manage thousands of concurrent LLM streams using stackless futures. This provides higher "Work-per-Core" density compared to Go's stackful goroutines, which begin to suffer from increased context-switching costs at the edge of saturation.
-
Zero-Cost Abstractions: Unlike Go, where runtime interface lookups can add overhead, Rust’s abstractions compile down to lean machine code. This allows VidaiServer to perform active authentication and rate limiting while remaining faster than a pure passthrough proxy in other runtimes
4. Are Enterprise Features "Free"?
We isolated the cost of safety and visibility by moving from Vidai L1 to L3. Guardrails (L2) proved nearly free in Rust, adding less than 0.1ms. Even with PostgreSQL telemetry (L3), the overhead only increased by ~9ms. By utilizing Rust’s zero-cost abstractions, VIDAI performs complex regex guardrails and async, non-blocking writes without a runtime performance tax. The user-facing request path remains untouched while telemetry is offloaded to background tasks with near-hardware efficiency.
|
RPS |
L1 p95 |
L2 p95 |
L3 p95 |
L2 overhead |
L3 overhead |
|
5K |
~4ms |
~4ms |
~4ms |
~0ms |
~0ms |
|
8K |
~10ms |
~10ms |
~11ms |
~0ms |
~1ms |
|
10K |
~20ms |
~22ms |
~25ms |
~2ms |
~5ms |
5. Learning from the Industry: The Kong Comparison
Our findings regarding the performance ceilings of Python and Node.js are independently corroborated by Kong’s 2025 AI Gateway Benchmark. Their testing showed that C/Lua-based architectures achieved over 800% higher throughput than LiteLLM. However, a direct comparison reveals that VidaiServer’s Rust core pushes efficiency even further.
The Efficiency Gap: Work vs. Resources
To understand the difference, we must look at Work-per-Core. Kong’s benchmarks were conducted on modern AWS EKS clusters using 12 dedicated vCPUs in "pure-proxy" mode (no auth or policies). In contrast, VidaiServer was tested on a decade-old, bare-metal CPU shared with the load generator and database, while performing active authentication and rate limiting.
|
Metric |
Kong AI Gateway (Lua/C) |
VidaiServer (Rust) |
|
Hardware Context |
Modern AWS c5.4xlarge (16 vCPU) |
Legacy Xeon E3-1240 v3 (8 Core) |
|
Resource Logic |
Dedicated / Isolated |
Shared / Contended |
|
Workload |
Pure Passthrough |
Auth + Rate Limiting + Routing |
|
Throughput Density |
~670 RPS / Core (Extrapolated) |
~1,250+ RPS / Core |
In fact, when we tested VidaiServer in a OVH, VPS-5, with k6 running independent on another seperate VPS-3, We were able to push to 6000+RPS per core, without a sweat, easily reaching upto 29000+ RPS, beyond that, we suspsect load generator was already saturated (as we had only used 500% CPU in our main machine, which had plenty of additional cores to spare). Thats really high density.
Why Rust Edges Out Lua
While Kong’s LuaJIT is exceptionally fast for traditional API management, it still relies on a Garbage Collector and a high-level scripting abstraction. VidaiServer’s Rust implementation removes these abstractions, providing:
-
Zero-Cost Safety: Performing complex regex guardrails and API key validation without the overhead of a virtual machine or interpreter.
-
Predictable Tail Latency: Avoiding the "jitter" common in garbage-collected languages, even under 90%+ CPU saturation.
-
True Parallelism: VIDAI leverages the tokio runtime to dynamically schedule thousands of concurrent LLM streams across all cores using work-stealing efficiency. This outperforms the pre-forked worker-process model used in Nginx/Lua, which often suffers from imbalanced CPU utilization under heavy, asymmetric LLM traffic.
The Bottom Line: Kong is a powerful general-purpose engine, but VidaiServer is a purpose-built "performance specialist." We are achieving nearly double the throughput-per-core on hardware that is four generations older. As mentioned earlier, we were able to get 6000RPS/core performance easily, thats 4X.
Conclusion
Engineering is fundamentally about selecting the right tool for the specific job. No single gateway is the "best" for every stage of the AI lifecycle.
-
For Development & Prototyping: LiteLLM and Portkey remain the gold standard for developer experience and rapid iteration. If your priority is immediate access to 100+ model providers and your throughput is low, these Python and Node.js-based tools are the correct choice.
-
For High-Performance Routing: Bifrost is a formidable, highly capable Go-based alternative. It handles high-throughput routing with ease and is an ideal fit for teams already optimized for the Go ecosystem. While it showed a minor latency climb at extreme 10k+ RPS loads due to standard Garbage Collection (GC) behavior, it remains in the "excellent" performance tier.
-
For Production-Scale Density: VIDAI is purpose-built for environments where the gateway must be "invisible". It is the choice for teams that need to layer on authentication, rate limiting, and regex guardrails without paying a "performance tax" or dealing with the GC jitter inherent in other runtimes.
When your agentic workflows scale to thousands of requests per second, the choice of infrastructure becomes the difference between a responsive application and a 20-second "catastrophe".
Notes on Methodology
|
Aspect |
Details |
|---|---|
|
Self-hosted |
All gateways on same machine (no cloud variance) |
|
Shared resources |
k6, gateways, VidaiMock, PostgreSQL compete for CPU/memory |
|
PostgreSQL |
Basic Docker container, default settings, no PgBouncer or other optimisations |
|
VidaiMock latency |
Configurable per-test (0ms for throughput, 50-500ms for realistic) |
|
Step duration |
20 seconds per RPS level to reach steady state |
|
Log scale |
Y-axis spans 4 orders of magnitude (0.1ms to 10,000ms) |
Portkey Custom Headers Limitation
The version of Portkey tested does not forward custom headers to the backend. Tests using X-Response-Size, X-Vidai-Latency, or X-Vidai-Chaos-Drop headers show Portkey receiving default VidaiMock responses instead of the configured test profiles. Perhaps custom header is only for cloud version. Atleast we could not figure out, how to get it working. This affected our payload size test, Realistic latency tests, Chaos tests. These tests and charts are in the github for reference.