It’s Tuesday, May 26th: Welcome to another edition of The Byte.

In this essay, Sriranjani Ramasubramanian argues that the future of economically viable AI agents depends less on better demos or model capability and more on inference infrastructure built for agentic workloads. Leveraging her background in Product at AMD, she argues that, because agents require repeated planning, tool calls, memory retrieval, and long-running reasoning loops, their costs scale very differently from simple chatbots.

The Infrastructure Shift That Makes AI Agents Economically Viable

What Task Completion Rate Cannot See

by Sriranjani R.

Introduction

Several promising AI agent demos eventually collide with the same wall, the prototype works, the pilot version looks good, the real users start using, the context windows grow, tool calls multiply, and finally, inference bill becomes a business problem. A startup recently documented this: their fraud detection agent cost $5,000 per month serving 50 users. Four months later, with 500 users (roughly one-tenth of enterprise scale), the bill had tripled. The agent was working, but the economics weren’t.

This pattern is showing up across industries, and it isn't a model or an architecture problem, but more so an infrastructure problem. The hardware layer is quietly solving in ways that most operators building on top of it haven't registered yet. Understanding that shift is the difference between an agent product that scales and one that gets killed in a quarterly budget review. 

Why Agents Break the Economics of Traditional Inference

When a user sends a single query to a chatbot, the compute cost is simple: one input, one output, one bill. Agents don't work that way. An agentic workflow chains together planning steps, tool calls, memory retrievals, and intermediate reasoning loops. Each one generates a token and incurring costs. A task that looks like one request is actually ten or more. Multi-agent orchestration multiplies this further. Moving from a single-agent system to a multi-agent system is often five to ten times more expensive, especially once you account for orchestration logic, failure handling, and shared context. 

A deeper issue is concurrency. A chatbot serves users in small batches. An always-on agent like the one that monitors a process, runs in the background and responds to events, needs to be available continuously alongside thousands of other identical agents serving several users. This creates a fundamentally different demand profile: sustained, high-concurrency inference over long sessions rather than short, intermittent bursts. The infrastructure that works for a chatbot starts breaking down here, and the costs that look manageable at 50 users become existential at 5,000.

Source: Industry operator data (TechAhead, 2026); disaggregated efficiency estimates based on SemiAnalysis InferenceX v2 benchmarks. Illustrative projections - actual costs vary by model, context length, and provider.

The Two-Phase Problem Nobody Talks About

Every AI response, whether from a chatbot or an agent, goes through two fundamentally different computational stages. The first is prefill: the model reads the prompt or the conversation history or the tool results and builds a compressed memory of it called a key-value (KV) cache. This is compute-intensive and it happens fast. The second is decode: the model generates a response, one token at a time, leveraging the cache memory at every step. This stage is memory-intensive and runs as long as the output runs. 

These two stages have almost nothing in common from an infrastructure perspective. Prefill is a parallel operation that benefits from raw processing speed. Decode is sequential, and it's bottlenecked by how fast memory can be accessed, and how much of it is available. The longer the output; the more an agent reasons, explains, or plans, the harder the decode is to run efficiently. 

Traditional serving infrastructure handles both stages on the same hardware, which is an acceptable tradeoff when sessions are short and outputs are predictable. It stops being acceptable when you're running long-context agents at high concurrency. The same hardware that's optimized to start responses fast becomes a bottleneck when it needs to sustain long outputs across thousands of simultaneous sessions. You end up paying for capabilities you don't use while being throttled by ones you need most. 

What Disaggregated Serving Actually Changes

The infrastructure response to this problem is called disaggregated serving i.e splitting the two phases across dedicated hardware pools rather than running both on the same system. One cluster handles the prefill, building kv cache and passing it downstream. A different cluster handles the decode phase, focused entirely on generating the output at scale. Each pool is tuned for its task. 

Several organizations like Meta, LinkedIn, Mistral are running disaggregated serving in production. Recent academic research describe this prefill-decode (PD) disaggregation as the standard architecture for LLM inference, acknowledging the core advantage as the ability to independently scale each phase and use hardware heterogeneity across the cluster.

For agents, the implication is what you can now hold in memory simultaneously. Decode nodes in a disaggregated system are dedicated to one job: sustaining ongoing generation across many concurrent sessions. The number of users you can serve before performance degrades is determined by how many of those cached sessions fit in memory at once. When that constraint loosens, the cost-per-user math changes. When it loosens significantly, agent products that were previously uneconomical at scale become viable.


This is the shift that matters for anyone building on top of inference infrastructure: not that hardware got faster, but that it got more specialized. And specialized hardware, applied correctly, compresses the cost curve in ways that generalist hardware cannot.

The Compounding Effect of FP8 and Multi-Token Prediction

Two additional techniques are combining with disaggregated serving to further close the economics gap for agents, and they're worth understanding at a high level because they interact.

FP8 (Floating-Point) — a numerical format that represents model weights and activations at half the precision of the previous standard, has become the effective baseline for production inference. The accuracy tradeoff is negligible; the efficiency gain is real. Models run faster, more sessions fit in memory, and cost per token drops. For a single-turn query, that efficiency gain is useful. For an agent running hundreds of steps per session, it compounds across every one of them.

Multi-token prediction (MTP) takes this further by generating several output tokens simultaneously rather than one at a time. MTP puts significant additional pressure on memory bandwidth and capacity, the dimension that decode hardware is being optimized for. The techniques reinforce each other: disaggregation creates the right environment for FP8 and MTP to deliver their full benefit, because decode nodes can be built specifically for the memory profile these techniques require.

The trajectory here matters. Gartner's March 2026 forecast projects that inference costs for large models will fall more than 90% by 2030 relative to 2025 levels. 

Sources: Gartner (March 2026) — projects 90%+ cost reduction by 2030 vs. 2025; H100 FP8 baseline ~$0.18/M tokens at on-demand pricing (Zylos Research, 2026); historical Claude/GPT-4 pricing (public API records). Projections are illustrative; actual costs vary by model size, provider, and utilization.

That drop doesn't happen through model improvements alone. These changes happen through stack-level optimization like disaggregation enabling specialization, specialization enabling FP8 and MTP, all of it reducing cost-per-token at scale. The applications that are economically marginal today become commercially viable as that curve moves.

What This Means for Operators Building Agent Products

If you're building an agent product, the practical questions this raises aren't about hardware specs. They're about how you architect for the infrastructure underneath you.

The first question is whether your inference provider is running disaggregated serving. Not all of them are, and the cost-per-token difference at scale is not marginal. The foundational academic work, DistServe, showed disaggregated systems handling 7.4x more requests within the same latency constraints compared to monolithic serving. AWS's production llm-d implementation reported 70% throughput gains when it launched in March 2026. The gains are most pronounced for exactly the traffic pattern agents generate: long inputs, heavy outputs, high concurrency. The caveat worth naming: these gains aren't automatic. Disaggregation requires meaningful engineering investment to configure correctly, which is part of why widespread production adoption only accelerated in 2025 despite the research existing since 2024. 

The second question is how you're measuring cost. Most teams track cost per API call or cost per token. The more useful metric for agents is cost per completed task: a customer issue resolved, a report generated, a process step executed. That framing normalizes across the variable session lengths and step counts that agents produce, and it's what determines whether the unit economics of your product hold at scale. A team that doesn't measure this until they're in production is the team that discovers, at 500 users, that their pricing model doesn't work.

The third question is about session architecture. Agents that reload full conversation history at each step pay full price for that context on every call. The discount for avoiding that is substantial: Anthropic prices cached input tokens at $0.30 per million versus $3.00 for fresh input, a 90% reduction on the repeated portion. In production, ProjectDiscovery's multi-agent security platform reported 59–70% overall cost reduction after implementing caching on workflows where a 20,000-token system prompt was previously being re-sent at every one of 26 average steps. These are application-level decisions, but they interact directly with infrastructure-level improvements: the more efficiently the infrastructure handles memory at the serving layer, the more leverage application-level caching decisions give you. 

The Compounding Effect of FP8 and Multi-Token Prediction

Two additional techniques are combining with disaggregated serving to further close the economics gap for agents, and they're worth understanding at a high level because they interact.

FP8 (Floating-Point) — a numerical format that represents model weights and activations at half the precision of the previous standard, has become the effective baseline for production inference. The accuracy tradeoff is negligible; the efficiency gain is real. Models run faster, more sessions fit in memory, and cost per token drops. For a single-turn query, that efficiency gain is useful. For an agent running hundreds of steps per session, it compounds across every one of them.

Multi-token prediction (MTP) takes this further by generating several output tokens simultaneously rather than one at a time. MTP puts significant additional pressure on memory bandwidth and capacity, the dimension that decode hardware is being optimized for. The techniques reinforce each other: disaggregation creates the right environment for FP8 and MTP to deliver their full benefit, because decode nodes can be built specifically for the memory profile these techniques require.

The trajectory here matters. Gartner's March 2026 forecast projects that inference costs for large models will fall m

ore than 90% by 2030 relative to 2025 levels. 

The Window That's Opening

The economics of always-on agents have been a real constraint. Not a theoretical one. Products have been killed, pilots abandoned, roadmaps deferred, not because the technology didn't work, but because the cost of running it at scale made the business case collapse. That constraint is loosening, driven by infrastructure changes that most people building AI products haven't fully priced in.

Disaggregated serving, combined with FP8 precision and multi-token prediction, is compressing the cost curve for exactly the workload profile that agents generate. The applications that couldn't survive a quarterly budget review at 500 users will have a different economic profile at the infrastructure layer they'll be running on in 2027.

The product question is whether you're building for the economics of today's infrastructure or the infrastructure that's already being deployed. The gap between those two is where the next generation of agent businesses will be built.

FROM COLLECTIVE HQ

🚀 Humans in AI Week is coming!

This June, AIC is hosting 100+ events in one week, all built around a single question: what does it mean to be human in the AI era? It's the largest human-centered AI gathering we've ever run, across every chapter, on six continents.

Read the announcement, and pledge your voice below.

The AI Collective is built by volunteers across 180+ chapters in 40 countries.

Thank you to the thousands of volunteers around the world who make this work possible. We truly could not do this without you.

🧑‍💻 About the Author

Sriranjani Ramasubramanian is a product and AI infrastructure leader working across semiconductors, inference acceleration, developer ecosystems, and go-to-market strategy. With experience at Qualcomm, Ampere, Untether AI, and AMD, she has moved from physical design and hardware execution into product management, product marketing, and developer enablement for AI platforms. Her work focuses on turning complex hardware-software infrastructure, from accelerators and SDKs to ROCm, local AI, and inference tooling, into products developers and enterprise customers can use. Her perspective emphasizes that the AI infrastructure race is not just about the best chip, but about making the full stack accessible, measurable, and commercially useful.

✍️ Editors

About Josh Evans

Josh is a Managing Editor at The AI Collective Newsletter and leads content for The Byte. Outside of AIC, Josh works in Content Protection at Spotify.

Add Your Thoughts

Avatar

or to participate

Keep Reading