Foundations11 April 2026· 18 min read

AI foundations for solution architects

By ArchVerify Editorial

You have been asked to add a chatbot to an internal knowledge base of about 8,000 Confluence pages. The CTO wants a working demo in two weeks. You have spent the last decade building cloud workloads and you have never shipped anything with an LLM in it. You do not have time to read three vendor blogs and an academic paper before the next standup, and you cannot afford to put a number into a design document that turns out to be wrong in front of the architecture review board.

This post is the mental model you need to make that call without pretending to become an ML engineer. It covers what an LLM actually is, how tokens and context windows work, what embeddings are for, where the real costs hide, and the small set of things that change weekly versus the things that are stable enough to commit to in an ADR. By the end you should be able to size a project, name the trade-offs out loud, and tell the difference between an AI problem and a problem that just sounds like one.

TL;DR

An LLM is a managed service that turns text into more text, billed by the token. The four things an architect needs to internalise are: tokens are how everything gets measured and billed; context windows are working memory not storage; embeddings are how meaning gets compared without keyword matching; and the cost model is linear in tokens, which means the bill scales with how chatty your prompts are. Everything else in the AI stack is plumbing on top of those four ideas.

Is this actually an AI problem?

The first question to answer is not which model to use. It is whether the problem actually needs a model at all. Most of the bad AI projects of the last two years are sitting on the wrong side of this line — using a stochastic, expensive, slow, and non-deterministic tool to solve a problem that a regular expression and a SQL query would have handled deterministically in a tenth of the time. The useful test is whether your existing engineers could solve the problem in a week of deterministic code, or whether they'd need six months of edge cases. If a week, the LLM is the wrong tool. If six months, the LLM is probably the right tool.

When AI is the right shape of answer

LLMs are good at problems where the input is ambiguous natural language, the desired output is also natural language or structured data extracted from natural language, the cost of being slightly wrong is acceptable, and the cost of being completely wrong is recoverable. Customer support routing, document summarisation, classification when the categories are fuzzy, code suggestions, meeting notes — all of these fit. The model is allowed to be wrong sometimes because a human is in the loop, or because the wrong answer is cheap to throw away and try again.

When AI is the wrong tool

LLMs are wrong when the input is structured and the rules are well-defined, when the output has to be exactly correct every time, or when the cost of being wrong is high and irreversible. Tax calculations, payment authorisation, access control decisions, anything with a regulator looking over your shoulder. A regular expression or a SQL query or a state machine will outperform a model on those problems and will not hallucinate at three in the morning when the support engineer cannot reproduce the issue. Architects who cannot articulate why their use case is on the AI side of this line should not be on the AI side of the line.

What an LLM actually is

Forget the academic framing for a moment. From an architect's point of view, a large language model is a managed service that takes text in and produces text out. That is the entire interface. You send a request containing some prompt text. You get back a response containing some generated text. Everything else — temperature, top-p, system prompts, function calling, structured outputs — is configuration on top of that one operation.

The most useful framing: a stateless Lambda that knows English

The analogy that lands fastest with architects is that a model is like a stateless Lambda function that happens to know how to write English. You call it, you get a response, the next call has no memory of the previous one. Any sense of conversation is something you build by sending the previous turns back in the next request. The model is not 'thinking' between calls. It is not running in the background. There is no session state on the server.

This matters because it tells you immediately how the model fits into your existing cloud mental model. It is not a database. It is not a queue. It is a request-response service with high latency, variable output, and per-call billing. You wrap it in retries, you log its outputs, you cache where you can, you put a circuit breaker around it, and you treat its responses as untrusted input to whatever code calls it next. None of that should feel exotic. The exotic part is what comes out the other end.

Tokens — the unit everything is measured in

A token is a chunk of text the model processes as a single unit. It is not a word and it is not a character. For English prose, one token is roughly three quarters of a word, so 1,000 tokens is about 750 words. For code, punctuation, or non-English text, the ratio is different and usually worse.

Why every estimate starts with tokens

Tokens are the unit everything in this world is measured in. Context window size is in tokens. Pricing is per million tokens. Rate limits are per minute in tokens. Latency is roughly proportional to tokens. If an architect is going to internalise one number, it is the rough conversion from words to tokens, because every estimate in the project will start there. Get that wrong and every downstream estimate is wrong by the same factor.

A worked example: 10,000 documents per day

A 10-page customer document is about 5,000 words, which is about 6,500 tokens. If you are sending that document plus a 500-token instruction prompt to an LLM 10,000 times a day, you are processing 70 million input tokens daily. At Claude Sonnet 4.6 pricing of $3 per million input tokens, that is over $200 a day for the input alone, before you count the model's response, before retries, and before the prompt scaffolding the orchestrator adds. The bill is real and it scales linearly with how chatty you are.

Context windows — working memory, not storage

The context window is the amount of text the model can 'see' in a single request. It includes the user's current prompt, any previous turns of the conversation, any documents you have attached, and the system prompt setting up the model's role. Everything has to fit inside the window or the request is rejected.

The desk-versus-library analogy

Think of the context window as the model's desk, not its library. The desk is finite. Anything not currently on the desk is invisible to the model on this call. If you need the model to refer to information that does not fit on the desk, you cannot just hope it remembers — you have to put it back on the desk before you ask the question. Every time.

Numbers worth memorising

The standard Claude context window is 200,000 tokens, which sounds enormous and is enormous. To put a number on it, that translates to about 150,000 words or roughly 500 pages. The Opus 4.6 and Sonnet 4.6 models support up to 1,000,000 tokens in extended context mode. For most architecture conversations, even a 200,000-token window is more than you will fill on a single request — but it fills surprisingly fast once you start stuffing in retrieved documents, conversation history, and tool definitions.

Why bigger windows are not automatically better

There is a subtlety. As the context grows, models get measurably worse at finding specific information inside it — a phenomenon documented as 'context rot'. A model with 200,000 tokens of context will not necessarily answer a question accurately if the relevant fact is buried in the middle. So the practical move is to retrieve only the documents relevant to the current question, not to dump everything you have in case it helps. This is the part of the architecture that retrieval-augmented generation (RAG) is for, and it is the subject of a separate post.

Why context costs money

Here is the part that catches teams out. Context is not free. Every token you put in the context window is an input token that you pay for, on every call. If you are running a chatbot that includes the entire conversation history in each request — which is how stateless chat works — the bill grows quadratically with conversation length, not linearly. Turn one is 100 tokens. Turn two sends 100 plus the previous response. Turn ten is sending all nine previous turns plus the original system prompt, every single time.

Input versus output pricing

Claude Sonnet 4.6 input pricing is $3 per million input tokens. Output pricing is $15 per million output tokens. The five-times multiplier on output is normal across the industry — generated tokens cost more than input tokens because they take more compute to produce. This means a chatbot that generates short responses to long prompts is mostly paying for input. A chatbot that generates long responses to short prompts is mostly paying for output. A chatbot that does both is paying for both, and the bill is the sum, not the average.

Three mitigations that actually work

There are three practical levers for keeping the bill in check. The first is prompt caching: most providers will charge a fraction of the input price for tokens that match a recent previous request, which makes it cheap to keep reusing the same long system prompt across calls. The second is context trimming: do not send the full history when you can send a summary, and do not send the summary when you can send nothing. The third is model selection — smaller and cheaper models exist for the parts of the workflow that do not need the smartest model in the family. A classification step can usually run on a model an order of magnitude cheaper than the model doing the actual generation.

Embeddings — coordinates for meaning

Embeddings are the second concept worth internalising. An embedding is a list of numbers — usually somewhere between 384 and 3,072 of them — that represents the meaning of a piece of text as a point in a high-dimensional space. Two texts that are about similar things end up with embeddings close to each other in that space, even if they share no actual words.

The postcode-for-meaning analogy

An embedding is a postcode for meaning. Two articles about the same topic live in the same neighbourhood of meaning-space, regardless of whether one uses the word 'cancellation' and the other says 'churn'. A keyword search would miss the connection. A search-by-embedding would find both because they sit close together in the coordinate system. The architect's takeaway: the model can find conceptually similar things without knowing the vocabulary in advance.

How embeddings unlock RAG

This is the entire foundation of semantic search and of retrieval-augmented generation. Convert your documents to embeddings, store them in a vector database, then when a user asks a question, convert the question to an embedding and find the documents whose embeddings are nearest. The retrieved documents go into the context window, the model answers with that context in scope, and the user gets an answer grounded in the source material instead of the model guessing.

What this means for an architect: any AI feature that needs the model to refer to your own data — your product docs, your support tickets, your internal wiki — is going to need an embedding pipeline and a vector store. That is not an optional extra. It is the way the model gets access to anything it was not trained on. Which vector store to pick is the subject of a separate deep page; for now it is enough to know the pipeline exists and it is the standard pattern.

When to reach for AI versus traditional code

A side-by-side, because architects respond to tables faster than they read prose. The comparison renders below.

The model as a managed service

The mental model that gets architects unstuck fastest is to treat the model the same way you treat any other managed cloud service. It has an SLA you do not control, latency you have to design around, a rate limit you have to respect, a pricing model that rewards efficiency, a dashboard somewhere you should be alerting on, and a vendor that will deprecate models faster than you would like.

What transfers from your existing cloud playbook

Claude, GPT-4, Gemini, and the open models on Bedrock or Vertex are all managed services in the same operational sense as DynamoDB, App Service, or BigQuery. The architectural disciplines you already have — circuit breakers, retries with exponential backoff, dead letter queues, request hedging, fallback paths, observability, cost dashboards — all transfer directly. Use them. None of this is new architecture.

What does NOT transfer

The only thing that does not transfer is the mental habit of expecting the same input to give the same output. Models are non-deterministic by default. Two identical requests will return slightly different responses. Code that expects determinism will break in subtle ways and the architect's job is to design around the non-determinism, not to fight it. A useful anchor: anything you would not let an intern with Wikipedia access do unsupervised, you should not let an LLM do unsupervised. The skill ceiling is higher than an intern's, but the failure mode is the same — confidently wrong about specific facts, fluent enough that the wrongness is not obvious until someone checks. Design the workflow so that someone always checks.

How this plays out at three scales

The same foundations apply at every scale, but the right answer to 'what should we do' changes drastically with the size of the team and the regulatory weight of the company.

At regulated-enterprise scale

Multi-million-customer financial services, insurance, healthcare, anything with a risk committee that approves architecture decisions. The right first move is almost never building. It is selecting a managed service from a vendor your procurement and compliance teams have already approved, putting it behind a strict prompt and output filter, logging every interaction for audit, and starting with a non-customer-facing internal use case that does not touch personally identifiable information. The bill is not the constraint. The audit trail and the blast radius are. Everything moves at the speed of the architecture review board, which is fortnightly, which is fine because the wrong shortcut here costs more than the project.

At mid-market scale-up scale

A few hundred employees, single cloud, lean architecture team, growing fast. The constraint is engineering time, not budget or compliance. The right move is to use whichever frontier model the team already has API access to, build a thin abstraction layer so the model provider can be swapped later, and start with one well-scoped feature that has clear success criteria. Avoid frameworks that lock you in to a particular orchestration style until you understand your own workload. Most of the architectural value at this scale comes from picking a few patterns and applying them consistently, not from using every tool in the ecosystem.

At greenfield-startup scale

A handful of engineers, founder still in the codebase, AI is the product not a feature. The constraints invert. Cost matters because credits run out, latency matters because users abandon, and lock-in matters less because the whole stack will be rewritten at least twice in the first eighteen months anyway. Use the cheapest model that meets the quality bar, batch where possible, cache aggressively, and accept that the architecture you ship in month three will be replaced in month nine. The discipline at this scale is to keep the abstraction surface small enough to rewrite, not to design for an enterprise the company is not yet.

What's stable enough to commit to in an ADR

The concepts in this post — what tokens are, what context windows are, what embeddings are, the model-as-managed-service framing, the managed-service operational disciplines — are stable. They have not changed since 2023 and they are not going to change in the next year. An architecture decision record built on these concepts will still make sense in eighteen months. Concept-level claims age slowly; you can lean on them.

What changes weekly and needs to be checked

The snapshots — specific pricing, specific context window sizes, specific model names — are not stable. They change quickly, and they change with no notice from the vendor. A bill or an SLA built on a snapshot needs to be checked at least quarterly. The right answer to 'which model should we use' last summer is often the wrong answer this autumn. ArchVerify's approach to this on the blog is to mark every quantitative claim with a source and a change-sensitivity tag, so you can see which numbers to trust for a year and which to verify before your next sprint planning. The dog-food post 'Verifying AI systems against real sources' explains the methodology.

The minimum safety reading every architect should do

There is a single document worth reading once before you ship anything with a model in it: the OWASP Top 10 for LLM Applications, currently in its 2025 edition. It is short, it is free, and it is the closest thing the AI safety field has to the cloud-security checklists you have already internalised. Treat it as the equivalent of OWASP Top 10 for web apps — you do not need to memorise it, but you need to know it exists and refer to it when you are designing something. It is the cheapest hour an architect can spend on AI safety.

Problem shape	Reach for AI when	Reach for traditional code when
Input format	Natural language, free text, messy human input	Structured data, defined schema, validated forms
Required correctness	Mostly right is acceptable; humans review edges	Must be exactly right every time; no human review
Cost of being wrong	Recoverable, low blast radius, easy to retry	Irreversible, regulated, financial or safety impact
Rules clarity	Rules are fuzzy, exceptions everywhere, hard to enumerate	Rules are clear, exceptions are few, easy to test
Engineering effort	Six months of edge cases in deterministic code	A week of code your team could write tomorrow

Common mistakes

Assuming the model has memory between calls. It does not. Every call is stateless. Conversation history is something you build by sending previous turns back in the next request, and you pay for those tokens every time.
Picking the largest model by default. Frontier models are expensive and slow. Smaller models exist for a reason and most workflows have at least one step that does not need the smartest model in the family. Classification, routing, and extraction can usually run on a model an order of magnitude cheaper than the generation step.
Treating output as trusted input. Model output is untrusted text that happens to look fluent. Code that calls an LLM and then passes the result into a downstream system without validation is the most common production AI bug, and it is the same shape as failing to sanitise user input in a 2005 web app.
Estimating cost from a single test run. A successful prototype on ten test cases tells you almost nothing about production cost. The bill scales with token volume, conversation length, retries, and prompt scaffolding. Budget by extrapolating from realistic traffic, not from the demo.
Conflating context window size with knowledge. A 200,000-token window does not mean the model knows your data. It means you can put up to 200,000 tokens of your data on the model's desk for a single call. The data is gone the next call. Any persistent knowledge requires an embedding pipeline and a retrieval step.
Skipping the safety read. The OWASP Top 10 for LLM Applications takes about an hour to read once and prevents the most common production failure modes. Architects who skip it tend to discover prompt injection the hard way.

Frequently asked

Do I need to understand the maths behind LLMs to architect with them?

No. You need to understand the operational behaviour: stateless, token-billed, non-deterministic, finite context, untrusted output. The maths matters if you are training models. It does not matter if you are calling them from your application code.

Should I host my own model or use an API?

For most teams, use an API. Self-hosting a frontier-quality model means buying or renting GPUs, paying for the operations to run them, and falling six months behind whatever the API providers ship next. Self-hosting makes sense at very high volumes where the unit economics flip, or when regulatory constraints prevent the data leaving your VPC. Otherwise the API is cheaper, faster, and lower-effort.

How do I budget for an AI feature before I have built it?

Estimate the average input tokens per request, the average output tokens per response, the requests per day, and the model's published per-million-token pricing. Multiply them together and double the result for retries, prompt scaffolding, and the things you have not thought of yet. Then track the actual numbers from day one and adjust. The estimate will be wrong but the discipline of estimating it before you ship is what stops the runaway bill.

What's the difference between RAG and fine-tuning?

RAG (retrieval-augmented generation) means looking up relevant documents at request time and putting them in the context window so the model can refer to them. Fine-tuning means changing the model's weights so it learns a new style or domain. RAG is for adding knowledge the model does not have. Fine-tuning is for changing how the model talks. Most projects need RAG. Fewer need fine-tuning. They are not alternatives — they solve different problems. There is a separate deep page on when to use which.

How non-deterministic are these models in practice?

Set the temperature to zero and you will get nearly the same output for the same input most of the time, but not always. Even at temperature zero, slight variations in tokenisation, sampling, and provider-side load can produce different outputs. Design the workflow so that small variations in output do not break the next step — usually by parsing the output strictly and retrying on parse failure, or by asking the model for structured output the orchestrator can validate.

Which model should I start with?

Pick whichever frontier model your team already has API access to, ship a prototype with it, and only revisit the choice once the prototype is working. Architects burn weeks comparing models for projects that turn out not to need the absolute best one. Get it working, measure where the actual quality gap is, and optimise for that gap rather than for general benchmarks.

Where can I find the vocabulary I do not yet know?

ArchVerify maintains a separate AI jargon page at /jargon/ai with the 30 terms an architect new to AI needs to understand. It is ordered by which terms to learn first. If you hit a word in any of these posts that does not click, that is the place to start.

Coming soon

Try the interactive AI architecture starter — coming soon

Browse more comparisons →

Sources

[0]Standard Claude context window size — Claude API Docs — Context windows
[1]Claude extended context window for Opus 4.6 and Sonnet 4.6 — Claude Platform release notes
[2]Approximate words equivalent for 200,000 tokens — AWS Bedrock — Anthropic
[3]Claude Sonnet 4.6 input pricing — Anthropic — Claude Sonnet 4.6
[4]Claude Sonnet 4.6 output pricing — Anthropic — Claude Sonnet 4.6
[5]Current published version of OWASP Top 10 for LLM Applications — OWASP GenAI Security Project — LLM Top 10