Back to blog
Agents3 May 2026· 8 min read

Choosing the right LLM for production agents: decision frameworks

By ArchVerify Editorial

Benchmark scores are seductive. They feel like definitive inputs to a decision. They aren't. Production agent systems rarely resemble benchmark tasks. They're IO-heavy loops where the model receives data, reasons over context, produces a decision, and waits for feedback. The differences that matter—cost-per-loop, latency budget, context handling—barely move on a leaderboard. We need a framework that starts with workload shape, not score.

Benchmark suites measure single-turn reasoning on curated questions. Production agents are iterative feedback systems. The model usually isn't the bottleneck; decision latency, token throughput, and cost per loop are. Here's what changes:

Benchmark models test single prompts with best-effort answers, latency targets under 2 seconds, and no retry or feedback logic. Agent systems run loops: trigger, prompt, parse, store, feedback. Multi-turn reasoning is common, async and batch processing are ordinary, and cost is highly sensitive to context window size and token efficiency. A model that scores three points higher on reasoning might lose an agent decision if its pricing structure forces expensive re-prompting, or if its max output length truncates our decision logic. The right question isn't 'which model is smartest?' but 'which model's cost and latency fit our decision loop?'

Two patterns emerge: Real-time classification agents (fraud detection, intent routing, content moderation) care about latency under 200ms and throughput. Context window is small. Model quality matters, but API infrastructure does too. Batch inference isn't an option. Batch decision loops (overnight report generation, content validation, multi-document analysis) care about cost-per-decision and total throughput. Latency is measured in minutes or hours. Context window is huge—we're feeding entire document sets to avoid re-fetching. Batch pricing becomes relevant.

No single model wins at both. Trying to optimise agent architecture around benchmark scores leads to wrong sizing decisions.

Token pricing is a distraction. What matters is total cost per decision loop iteration.

Consider a document-analysis agent loop: fetch a document (outside model cost), prompt the model to read 50K tokens and extract key facts (50K input), generate structured facts (2K output), store the result, and count that as one decision. Cost per decision: $5 per million tokens input ($0.25 on 50K tokens) plus $25 per million tokens output ($0.05 on 2K tokens) equals $0.30 per decision.

Now consider a second query on the same document using cached context. We pay $5 per million tokens again for the new 5K tokens, but we reuse the cached 50K from step one. If we use prompt caching on Anthropic's direct API, we pay only for non-cached tokens. On AWS Bedrock, we don't have caching; we repay the full 50K every time. The token price difference between models becomes secondary. The integration choice (direct API with caching versus Bedrock without) can shift total cost per decision by 30 to 40 percent. Architects who focus on token price and ignore integration costs will underbid agent infrastructure by significant margins.

Three levers move cost-per-decision: token count (shorter prompts, fewer re-fetches, better system-prompt design), context window (larger windows mean we load data once and ask multiple questions, amortising the input cost across decisions), and batch scheduling (if latency allows, batch inference gets 50% lower price). For multi-tenant SaaS agents (variable load, unpredictable bursts), batch is risky—we can't safely batch requests from different tenants, and scheduling adds latency. Direct cost control matters more than batch discounts.

Output token limits affect long-form generation tasks. 128,000 tokens on Claude Opus. Most agents won't hit this, but long-form work (writing multi-section documents, generating test suites) cares. Hitting an output ceiling forces fallback or chunking—both add cost and latency.

No model is universally best for agents. Fit depends on workload shape and context patterns.

Claude for long-context, batch-friendly work: Document analysis pipelines benefit from Claude's 1,000,000 tokens context window at standard pricing. Context windows at this scale are uncommon in the market. Competitors either charge premium rates above 100K tokens or don't offer them at all. For agents that analyse large document sets, codebases, or multi-page transcripts without chunking, this is load-bearing infrastructure. Example: overnight report generation with 500K tokens of quarterly data and 50 structured questions. A 100K context window forces 10 separate requests (cost: 1 million input tokens). A 1 million token context window does all 10 in one (cost: 100K input tokens). Cost difference: 10-fold. For large-context workloads, context window size is the single biggest cost lever. Pricing: $5 per million tokens input and $25 per million tokens output for Opus.

Gemini for cost-sensitive, smaller-context work: Real-time classification where context is small. $0.50 per million tokens input token price is the lowest in the market. If our agent does high-volume, low-context decisions (intent classification on short messages, simple content moderation, simple routing), Gemini's cost advantage is material. Trade-off: Gemini's context window pricing is bucketed. $2.00 per million tokens for inputs at or below 200K tokens, then higher pricing above 200K. For agents with unpredictable or growing context sizes, this bucketing becomes a constraint—we cross a boundary and cost jumps discontinuously. If our workload stays below 200K consistently, Gemini is cost-competitive. If not, plan for pricing volatility.

GPT for reasoning-heavy, real-time work: $5 per million tokens input pricing is competitive with Claude Sonnet. But we're paying for reasoning capability most agents don't require. GPT models excel at novel reasoning and multi-step chains. For agents doing simple classification or extraction, that reasoning overhead is waste. GPT's real value is complex decision logic that needs to reason across 5 or more uncertain branches.

Multi-model strategy: Many production agents use more than one model: route high-complexity decisions to Claude or GPT, simple decisions to Gemini. This requires orchestration (which model to call when?) and fallback logic. Orchestration adds latency and complexity, but can cut cost by 30 to 50 percent if workload distribution is uneven.

Three paths to production: direct API, AWS Bedrock, or Vertex AI. Each has trade-offs.

Direct API (Anthropic, OpenAI, Google): Lowest latency, access to latest model versions, and feature support (prompt caching for Claude, structured output for GPT). We manage multi-provider orchestration and custom cost attribution. Request-level cost tracking requires custom middleware unless we use a platform.

AWS Bedrock: Bedrock consolidates Claude, Anthropic, and others under AWS IAM and cost centre frameworks. $6.00 per 1 million tokens on Bedrock Standard tier (versus $5 per million tokens on direct Claude API). Bedrock's Claude is older (3.5 Sonnet, not Opus 4.7), so we're paying Bedrock markup for an older model. For agents where model version matters, this is a cost loss. Bedrock doesn't support prompt caching. If caching would improve our cost model, Bedrock forces higher costs than direct API. 50% discount Flex discount is useful if we want cheaper pricing than Standard but aren't ready for batch. For agent systems, the choice is usually between direct API (fastest iteration, best features) and Bedrock (operational simplicity if already on AWS).

Vertex AI (Google Cloud): Similar to Bedrock—consolidated GCP cost tracking, older model versions, higher pricing than direct Google API. Less mature for multi-model orchestration than Bedrock.

Multi-provider strategies (Claude plus Gemini fallback) tend to favour direct APIs because we avoid vendor markup on both. Additional charges apply: $0.10–$0.17 per 1,000 text units for content filtering on Bedrock (if we use guardrails). $1.00 per 1,000 requests for intelligent prompt routing (if we use that service).

Token pricing is transparent. Total agent cost is not.

Retry and fallback logic: If our primary model is unavailable or rate-limited, we fallback to a backup. Fallback logic adds latency (we wait for timeout before retrying) and cost (some requests get retried, doubling spend). A 1 percent failure rate and two-request fallback costs 2 percent overhead. We design retry budgets and test them with actual failure rates. This is not in vendor pricing tables.

Prompt engineering iteration: Production agents rarely ship with the first prompt. We iterate when the model doesn't understand the format, test on real data, find edge cases, and rewrite. Each iteration is cost per request. Over a thousand requests, a 10 percent improvement in prompt efficiency (shaving 500 tokens per request) saves thousands. Budget for iteration cost upfront, or we'll underprice agents by 20 to 30 percent.

Decision loop termination: Agents that don't terminate cleanly incur unbounded cost. Endless loops, stuck agents, or waiting for feedback that never arrives can cost thousands per hour. Termination logic is not model work; it's orchestration. We design explicit termination (max loop count, timeout, manual override) and monitor it.

Cost attribution: If we're multi-tenant, we need cost attribution per customer. Request-level cost logging (which model, how many tokens, how long) requires custom middleware unless we use Bedrock (which gives us cost centre allocation). If we're on direct APIs, we'll build this ourselves or miss cost tracking entirely. Plan for it.

Sources

  1. [0]Claude Opus 4.7 input token pricingAnthropic docs (Models overview)
  2. [1]Claude Opus 4.7 output token pricingAnthropic docs (Models overview)
  3. [2]AWS Bedrock Batch inference pricing discount relative to on-demand pricingAWS Bedrock Pricing
  4. [3]Claude Opus 4.7 maximum output token limitAnthropic docs (Models overview)
  5. [4]Claude Sonnet 4.6 and Claude Opus 4.7 support 1 million token context window at standard pricingAnthropic docs (Models overview)
  6. [5]Gemini 3 Flash input token pricingGoogle AI docs (Pricing)
  7. [6]Gemini 3.1 Pro input token pricing (≤200k token prompts)Google AI docs (Pricing)
  8. [7]GPT-5.5 input token pricingOpenAI API docs (GPT-5.5 Model)
  9. [8]Claude 3.5 Sonnet input token price on AWS Bedrock (Standard tier, US regions)AWS Bedrock pricing page
  10. [9]AWS Bedrock Flex pricing tier discount relative to Standard tierAWS Bedrock pricing page
  11. [10]AWS Bedrock Guardrails pricingAWS Bedrock Pricing
  12. [11]AWS Bedrock Intelligent Prompt Routing pricing per requestAWS Bedrock pricing page