AI Budget Overruns? Cut Cost per Token

The Wall Street Journal reported in late May that enterprises are hitting their annual AI budgets in as little as three months. Others have watched AI bills double or triple. Uber blew through its annual budget for agentic AI by March. At one major financial institution, employees were burning hundreds of thousands of dollars a month on tokens, sometimes using premium-tier models for small talk. Corporate leaders, per the Journal, are now scrambling to ration AI use across their organizations.

Rationing is the wrong fix. The problem was never how much AI your teams use. The problem is how inefficiently your infrastructure produces each token.

Tokens are the new datacenter currency

Every AI interaction your company runs, every chatbot answer, every agentic workflow, every code completion, is measured in tokens. The datacenter has become a factory: power goes in, tokens come out. Jensen Huang made the economics explicit on NVIDIA's most recent earnings call: "inference tokens per watt translates directly to the revenues" of the providers. NVIDIA's own guidance to enterprises now says it plainly: compute cost and FLOPS per dollar are input metrics. Cost per token is the output metric, the one that determines whether AI scales profitably.

The corollary for enterprises is uncomfortable. Every idle GPU is a direct hit to margin. And most GPUs are idle: enterprise GPU utilization runs near 5% on average, what VentureBeat has called a $401 billion infrastructure problem. You are paying for 100% of the hardware and collecting a sliver of the output.

Why agentic AI broke your budget model

Budgets set last fall assumed chatbot economics: a person asks, the model answers, done. Agents work differently. They chain model calls, re-reading their full working context at every step. Manus, the agent company Meta acquired for $2 billion, reports that AI agents generate up to 100x more tokens than human users. Google now processes 3.2 quadrillion tokens a month, seven times last year's volume. Demand is not slowing. It is compounding.

And if you consume those tokens through cloud APIs, the meter structure punishes the agentic pattern twice. Anthropic's published pricing makes the mechanics visible: a cached input token costs $0.50 per million, while an uncached one costs $5 per million. Same token, 10x the price, depending on whether the infrastructure remembered it. Agents re-send mostly identical context every step. Infrastructure that forgets converts most of your bill into recomputed work.

The WSJ also surfaced what employees did with frictionless access: "tokenmaxxing," burning compute to look AI-forward. That is real waste, and worth governing. But Meta CTO Andrew Bosworth's internal guidance cuts to the actual metric: "token usage alone is not a measure of impact." The question is not how many tokens your company consumes. It is what each one costs and returns.

The infrastructure answer

There is a better conversation to have with your CFO, and it starts with three numbers.

First: 60-75% lower total cost per token. That is what becomes possible when GPU utilization rises from single digits toward full use, when repeated context is cached in memory instead of recomputed at 10x the cost, and when batch sizes grow because memory is no longer the constraint. These gains compound because they attack different parts of the same waste.

Second: 4-6 months to break even on-prem versus cloud. Organizations running steady inference above roughly 60% GPU utilization reach the crossover in months, not years. After break-even, your token cost is power plus depreciation. The per-token cloud bill disappears.

Third: 4-16x more tokens per GPU dollar. NVIDIA's own published analysis shows why output, not hardware price, decides the economics: its newest platform costs about twice as much per GPU-hour as the prior generation yet delivers roughly 35x lower cost per million tokens, because delivered throughput, not sticker price, is the denominator. The same logic applies inside your datacenter. Utilization is the denominator you control.

The mechanism is composability. Liqid's software-defined Memory and GPU pooling happens at the fabric level instead of locking the devices inside individual servers. When a workload needs GPUs, it gets them in seconds; when it finishes, they serve the next job. Memory works the same way: up to 100TB pooled and allocated where needed, which keeps agent context cached and converts 10x recomputes into cheap reuse. Nothing sits stranded on your depreciation schedule.

Liqid is not selling you new servers. It is a software and fabric layer that makes the servers you already buy from Dell, Cisco, HPE, or Fujitsu produce more tokens per dollar invested. Liqid's measured results against static deployments: 2x more tokens per watt, 5x more tokens per dollar.

The question to ask this quarter

Your CFO read the same coverage you did. The board conversation is coming, if it has not happened already. When it does, "we told the teams to use less AI" is a weak answer. "We cut our cost per token by more than half and our AI bill is now power plus depreciation" is a strong one.

Start with the assessment: what is your actual GPU utilization today, what share of your token spend is recomputed context, and where does your cloud bill cross the on-prem break-even line? Those three answers tell you exactly how much of your AI budget is paying for waste.

The companies that win the next phase of enterprise AI will not be the ones that ration tokens. They will be the ones that produce tokens cheaper than their competitors. That is an infrastructure decision, and it is sitting on your desk right now.

Run the numbers for your own cluster. Let us help you do a token cost assessment and learn how Liqid can change your Tokenomics equation.

Your AI Budget Is on Fire. Your Hardware Lit the Match

Tokens are the new datacenter currency

Why agentic AI broke your budget model

The infrastructure answer

The question to ask this quarter

You Might Also Like

The Memory Wall Is the Real Bottleneck in Your AI Cluster

When GPUs Starve, AI Fails: How Composable CXL Memory Restores Balance for Enterprise AI

AI Models Are Outpacing Server Memory: Why Composable CXL Memory Is the Only Path to Enterprise-Scale AI

Let’s Get Started

Navigation

Resources

Speak With An Expert

Your AI Budget Is on Fire. Your Hardware Lit the Match

Tokens are the new datacenter currency

Why agentic AI broke your budget model

The infrastructure answer

The question to ask this quarter

You Might Also Like

The Memory Wall Is the Real Bottleneck in Your AI Cluster

When GPUs Starve, AI Fails: How Composable CXL Memory Restores Balance for Enterprise AI

AI Models Are Outpacing Server Memory: Why Composable CXL Memory Is the Only Path to Enterprise-Scale AI

Let’s Get Started

Navigation

Resources

Subscribe to our newsletter