When an inference cluster underperforms, the first instinct is to blame the GPUs. Buy more, buy newer, buy bigger. But trace the data path on a stalled agentic workload and you will almost always find the same culprit: the GPU is fine. It is starving.
The memory wall, the widening gap between processor speed and the memory system's ability to feed it, has been a known problem in computer architecture for decades. Large language models turned it from an academic concern into the dominant cost driver in enterprise AI.
What the wall looks like in production
Modern LLM serving leans hard on CPU-side DRAM. The KV cache for long contexts, attention state, batch coordination, embedding stores, and the accumulating tool-call state of agentic workflows all live there. A 70B-parameter model serving 128K-token contexts generates a KV cache on the order of 70GB per active context. Run concurrent agents and the demand multiplies.
Against that demand, a conventional server gives you 1-2TB of DRAM per CPU. The ceiling is physical: DIMM slots and CPU pin counts limit how much memory you can hang off the parallel bus, and high-density RDIMMs get expensive fast as you chase capacity.
When the DRAM runs out, the serving stack typically spills to NVMe. DRAM access runs around 200 nanoseconds. NVMe runs around 100,000 nanoseconds. That is a 100x latency penalty landing in the middle of your token generation loop, and GPUs cannot hide it. Pipeline stalls propagate backward: batch sizes collapse because there is nowhere to hold concurrent request state, tensor core occupancy drops, throughput falls, and energy per token climbs. The cluster looks busy. It is mostly waiting.
Why adding servers makes the economics worse
The traditional answer to a memory shortfall is to buy more servers, because servers are how you buy memory. This is exactly backwards for AI workloads.
Each added server brings CPUs you did not need, power and cooling overhead you did not want, and another 1-2TB DRAM silo that is stranded inside that box. Meanwhile measured memory utilization across server fleets typically averages 30-50%, because every box is provisioned for its own peak. You end up overbuying memory in aggregate while individual workloads still hit the wall. Stranded capacity on one node cannot help a starving workload on the next rack over.
That is the defining feature of the memory wall in enterprise AI: it is not a capacity shortage. It is an allocation problem. The memory exists; it is just locked in the wrong boxes.
Memory Pooling changes the architecture
Memory pooling is the structural answer. Built on open interconnect standards that lets memory live outside the server and be attached coherently over a high-speed, low-latency fabric. Instead of memory being a fixed property of each box, it becomes a pooled resource the fabric allocates where it is needed.
Liqid's implementation is the EX-5410C, the industry's leading memory pooling fabric.
- Up to 100TB of DRAM pooled in a single deployment
- Shared across as many as 32 server nodes
- Roughly 200 nanoseconds access latency, near-native DRAM speed
- Near native-DRAM speed and access latency
The OS sees ordinary memory. Your serving stack, your database, your scheduler all run unmodified. The difference is that when a node needs 6TB for a long-context inference job, it gets 6TB from the pool, and when the job completes, the capacity returns for the next allocation. Liqid Matrix handles the orchestration, with Kubernetes, Slurm, OpenShift, and Ansible integrations so allocation happens at the speed of the scheduler rather than the speed of procurement.
What happens downstream of the fix
Remove the wall and the second-order effects show up across the serving stack.
KV cache stays in fast memory instead of spilling, which means the 500x NVMe penalty disappears from the token loop. Batch sizes grow 4-8x because concurrent request state finally fits, which lifts tensor core occupancy and cuts per-token energy. GPU utilization rises because the accelerators spend their cycles computing instead of stalling. Liqid's measured outcome against static configurations: 2x tokens per watt and 50% higher tokens per dollar.
And the fleet-level economics invert. Instead of buying servers to get memory, you size memory independently of compute. The 30-50% of DRAM currently stranded across your fleet becomes one pool sized to actual aggregate demand. Peak-provisioning per node ends, because the peak is provisioned once, at the fabric level, and shared.
Where this goes next
Liqid Memory Pooling is shipping in production today. The CXL consortium's 3.0/3.1 roadmap targets multi-rack memory pooling for late 2026-2027, which extends the same architecture from rack scale to row scale. Liqid is already at the frontier of memory pooling and is positioned to carry deployments through that transition.
The takeaway for anyone planning 2026-2027 infrastructure: the memory wall is no longer a fact of life you design around. It is a solved architecture problem you can buy your way out of, at near-DRAM latency, on an open standard, without touching your application layer. The teams still adding servers to chase memory are paying the wall a tax it no longer has the right to collect.
Want the data path analysis for your own cluster? Request a demo review at liqid.com



