The Agent Scale Problem

Most agentic RAG systems in production don’t look like a clever prompt wrapped around a vector database. They look like a pipeline.
A typical setup starts by aggressively normalizing documents into something machines can actually reason over:
- PDFs are converted to structured text (Markdown, HTML, or JSON) for better reasoning.
- Images and scans become text with layout and tables.
- Indices and embeddings are built for the resulting text and metadata (sections, dates, parties, source authority).
Only then does the agentic RAG layer come into play: handling multi-part questions, enforcing precise output formats, grounding answers with citations, and achieving high recall with minimal tool calls.
At low document volume, this architecture works exactly as you’d expect. A standard RAG pipeline feels simple, predictable, and manageable.
At production scale, cost and complexity escalate even if the system appears to work.
Agentic RAG scales on complexity, not just page count, and that complexity ripples across APIs.
In a production environment, a single query may span thousands of pages of searchable context and require extracting many distinct fields, not just producing a summary. What looks like “one question” is actually a coordinated workflow:
- Ingestion: Convert PDFs to structured text and generate embeddings.
- Retrieval: Query large indexes to assemble relevant context.
- Extraction: Answer multiple questions with strict formatting and verification.
Services often include document parsing, embedding APIs, vector search, reranking, and LLM inference (frequently from different providers).
As queries get more complex, the pipeline does more work per question: more calls, more intermediate artifacts, more coordination.
The result isn’t just higher dollar cost. Latency increases as steps serialize. Failure points multiply across service boundaries. Operational complexity grows as retries, rate limits, and partial failures become part of the steady-state system. The models still produce answers, but the pipeline becomes slower, harder to reason about, and harder to operate.
An orchestration problem, not just a retrieval problem
Even perfectly formatted Markdown breaks down when fed into a naïve RAG pipeline at scale. The challenge isn’t just finding the right text, it’s orchestrating how reasoning happens over that text without exploding cost, latency, or operational complexity.
A naïve pipeline asks a single model to do everything: understand the query, decide which tools to call, scan thousands of pages, extract multiple data points, and format the final output.
That creates an immediate tradeoff:
- Large frontier models excel at complex reasoning and precise formatting, but they’re slow and expensive when forced to sift through thousands of pages per request.
- Smaller or open models are fast and cheap, but struggle with complex instructions, precise extraction, and reliable tool use.
This is where many production systems hit a wall. Use a frontier model for every task and your margins vanish. Use a small model end-to-end and your accuracy vanishes.
Production AI isn’t about picking a model. It’s about assigning the right models to the right stages of the pipeline.
What’s needed is clear separation of concerns:
- A workhorse to do high-volume extraction and tool execution, and
- A synthesizer to reason over already-extracted facts and produce a clean final answer.
As volume rises, the question shifts from “can we do this?” to “can we do this profitably and predictably?”
When cost, latency, and reliability become part of the product
At scale, your AI stack is part of your product.
- Cost spikes hit margin
- Latency drives churn
- Accuracy builds or erodes trust
- Reliability defines usability
These aren’t engineering problems. They are product constraints. Consider your end user’s priorities:
- The bid team cares that the answer missed the deadline. Not that the delay came from a queue.
- The compliance team cares that a missed clause introduced risk. Not that the timeout came from a model.
- The support team cares that exceptions became tickets. Not that the model “usually works.”
Shift the mindset from “this prompt works in principle” to “how do we minimize work per extracted fact?”
The shift: from static pipelines to an agent optimization engine
Instead of treating RAG as a static chain, we built an Agent Optimization Engine.
Split the pipeline by function, not convenience. That’s the core idea.
- Dynamic model selection: The system selects the best models for each task - no default.
- The workhorse layer: Highly optimized, smaller models handle bulk extraction and tool execution across large context windows. They are fast, cheap, and tuned specifically for retrieval and structured extraction.
- The synthesis layer: A larger model is invoked only at the final step to assemble extracted facts, apply reasoning, and enforce precise output formats.
Our AI Factory approach turns a cost problem into a tractable engineering system without sacrificing accuracy.
The result: lower cost, more headroom
Standard RAG pipelines work well for short documents, where token costs and orchestration overhead are negligible. At the scale of millions of pages and multi-step extraction, they break.
With a split-model architecture running on Voltage Park’s AI Factory, teams operating over large corpora see inference costs cut dramatically (up to half) while improving reliability and predictability.
The gains don’t come from better prompts. They come from reducing unnecessary reasoning work and isolating it to where it actually adds value. The AI Factory was built for this. Try it now for free.
Why this pattern shows up across industries
You can’t out-prompt a systems problem. The same failure mode appears anywhere unstructured document volume spikes:
- Legal: due diligence across thousands of contracts
- Finance: extracting specific KPIs from quarterly filings
- Healthcare: synthesizing patient history from years of unstructured notes
If your RAG agents are getting more expensive, slower, and harder to operate as usage grows, the issue isn’t hallucinations or model choice.
You need infrastructure designed for agentic orchestration, where ingestion, model selection, and cost predictability are first-class concerns.
That’s what the Voltage Park AI Factory is built for.
Talk to our engineering team for a no-pressure check on how to keep your costs predictable, and margins intact as you scale.
%201.avif)

.png)
