Back to blog
Dogfood9 April 2026· 10 min read

Verifying AI systems against real sources

By ArchVerify Editorial

Architects bill by the hour and hate doing the same work twice. AI writing tools that confabulate force exactly that: every claim has to be re-verified by hand before it can land in a decision record. We built ArchVerify because we got tired of paying that tax. This post explains the methodology behind every other post on this site, and why the publishing pipeline itself is the safety mechanism rather than the prompt.

The problem we built ArchVerify to solve

Most AI writing tools optimise for fluency. We optimise for traceability. When an architect asks for the p99 latency of a managed vector database at one million vectors, they do not want a confident paragraph. They want a number, the date that number was published, and a link to the page it came from. If the number changes next month, they want to know without re-reading the post.

The cost of an unverified claim in an architecture decision record is asymmetric. A correct claim saves an hour. An incorrect claim that gets baked into a design and surfaces six weeks later in review costs the team a day, plus the trust hit with whichever stakeholder spotted it first. Multiply that across a portfolio and the cheap option — confident prose with no provenance — turns out to be the expensive one.

The target reader for this site is a solution architect with five to fifteen years of cloud experience who is now being asked to make decisions about LLMs, agents, vector stores and AI governance with very little of the same accumulated tacit knowledge. They are billable. They have an hour between meetings. They do not have time to read three vendor blogs and a NIST PDF to answer one question, and they cannot afford to take a confident-sounding answer at face value when it ends up in a board paper. Everything we publish is shaped around that hour.

What fact-first actually means

Every quantitative claim on this blog is stored as a structured fact, not as prose. A fact has six fields: claim, value, source URL, source name, category, and change sensitivity. The renderer substitutes facts into sentences via tokens, then assembles a numbered Sources block at the bottom of every post. If a source URL goes 404 or the value drifts, we know — because the fact is a row in a table, not a sentence buried in two thousand words.

This is the opposite of how most AI-assisted writing works. The default flow is: model generates prose, human spot-checks, publish. Ours is: gather facts first, write prose around them, and the server rejects the post if any quantitative claim is not bound to a source. The constraint lives in the schema, not in the writer's discipline.

The practical effect on the page is that a sentence like 'the p99 latency at one million vectors is X' is never written by hand. The writer writes 'the p99 latency at one million vectors is {fact:0}', and the renderer substitutes the value at request time from a row that has its own URL, its own timestamp, and its own change-sensitivity flag. The same fact can appear in three different posts and update everywhere at once when the source moves. Inversely, a stale fact cannot hide in one post while another post quietly serves the new value, because the row is the source of truth.

Fact categories and why they matter

We classify every fact by what governs it, because that determines how it ages. The categories we use are roughly:

  • vendor-doc-bound: AWS, Azure, GCP, Anthropic, OpenAI, Google docs. Changes when the vendor ships.
  • statute-bound: EU AI Act and equivalent regulation. Changes when the law changes.
  • version-bound: NIST AI RMF, ISO standards, PCI DSS. Changes on a documented release cycle.
  • rate-bound: pricing, API limits, quotas. Changes with no notice.
  • year-bound: thresholds and deadlines that move with the calendar.
  • data-bound: public datasets that update on the publisher's schedule.

A claim about object storage pricing and a claim about EU AI Act risk tiers need different revisit cadences. Treating them the same is how stale content rots a knowledge base from the inside.

Change sensitivity and what it triggers

Each fact also carries a change sensitivity: cosmetic, minor, material, or structural. Cosmetic changes — a typo on the source page — do not trigger a regeneration. Minor changes do, automatically. Material changes regenerate the affected pages and alert us. Structural changes — a new version of a standard, a major regulatory revision — freeze the affected pages until a human reviews them.

This matters because the half-life of 'current' information varies wildly. Pricing rots in weeks. Standards rot in years. The page should know which kind of claim it is carrying, and behave accordingly when the underlying source moves.

The four-tier system also gives us a sane way to budget human attention. Cosmetic and minor changes can be handled by automation indefinitely. Material changes need to be looked at, but only the affected page and only when something actually moves. Structural changes are the only events that pull a human off something else, and they happen rarely enough that it is worth doing properly when they do. Most knowledge bases either review everything on a fixed cadence — wasteful — or never review anything at all. Tagging the source material at write time lets us spend the review budget where it matters.

How we dog-food this on the blog you are reading

We use the same pipeline on every post here. The publishing API rejects any draft that contains a numerical claim without a corresponding fact entry. It rejects banned marketing phrases at the server, not as a style guide we ask writers to follow, but as an HTTP 422 the API returns if you try to send one. The renderer auto-generates the Sources block from the fact rows, which means we cannot forget to include it — there is no manual formatting step to skip.

Building the constraints into the system is the only way to stay honest at volume. Style guides decay. Schemas do not.

A writer working under a style guide can have a bad day, miss a phrase, ship a post that drifts. A writer working against a schema cannot. The post either round-trips through the API or it does not. The same applies to the LLM that drafts the prose: it has to produce output that conforms, and when it does not, the failure is immediate and visible rather than slow and reputational. Treating publishing as a typed contract instead of a culture is the single biggest difference between a blog that ages well and one that quietly fills with stale claims.

The rules we will not break

Six rules govern everything we publish:

  1. No claim without a source. If we cannot link it, we do not write it.
  2. No quoted passage longer than fifteen words from any single source. Paraphrase and cite.
  3. One direct quote per source, maximum. Usually zero.
  4. No marketing language. The publishing API carries a list of phrases it will not accept and returns a 422 if a draft contains any of them. The list lives on the server, not in a writer's style guide, so it cannot quietly drift over time.
  5. No speculation about hypothetical future AI capabilities. Architects work with what ships this quarter, not with what some essay claims will exist later. That whole class of content is out of scope.
  6. No claims we cannot stand behind in a code review. The blog and the product live in the same repository for a reason.

Why this matters more for AI than for cloud

Cloud architects are used to vendor docs that are correct, dated, and stable for months at a time. The AWS pricing page does not lie, and when it changes, the change is small and announced. The mental model of 'read the docs once, internalise, apply for a year' works.

AI infrastructure does not behave that way. Model versions deprecate on three-month cycles. Inference pricing drops by half between quarters. Context windows double without warning. The OWASP LLM Top 10 has already been revised once. Vendor SDKs ship breaking changes between minor versions. The half-life of a confidently written 'how to use X' post in this space is measured in weeks, not years, and the cost of acting on a stale claim is the same as it ever was.

This is the gap the methodology is designed for. A blog that treats AI infrastructure facts the way it treats cloud facts will be wrong about half its claims by the end of the quarter, and the reader has no way to tell which half. A blog that tags every claim with its source, its category, and its change-sensitivity can at least surface the staleness rather than hide it. That is a lower bar than 'always correct', and it is the only honest bar to aim for in a domain that moves this fast.

What this means if you are reading a post here

If you read a post on this site and want to verify a number, scroll to the bottom. Every quantitative claim in the prose appears as a numbered fact with the URL we pulled it from and the date we recorded it. Click through. If the source has moved or the value has drifted, the comment box on every page is the right place to flag it — 'this fact is stale' is the comment we read first.

The point of ArchVerify is to give architects a place where the work of verification has already been done. Not perfectly: sources go 404, vendors ship breaking changes, regulations get amended. But traceably. You can always see what we were looking at and when, and decide for yourself whether it still holds.

If you have spent the last few years building cloud workloads and are now being asked to make decisions about LLMs, vector stores, agent frameworks and AI governance — and you have a fraction of the tacit knowledge you had built up about EC2 or App Service — this site is built for that gap. Every other post on the blog assumes the reader is exactly that person: a working architect, short on time, allergic to marketing, and unwilling to put a number into a design document without knowing where it came from.

That is what we mean by verifying AI systems against real sources. It is not a feature on a roadmap. It is the entire premise of the site, and the rest of the posts in this series exist because that premise needs filling out one decision at a time.

Common mistakes

  1. Treating 'the model said it' as a source. It is not. A source is a URL on the open web that someone other than the model is responsible for keeping accurate.

  2. Citing the model's training data as if it were current. Training cut-offs lag by months or years; pricing pages and standards both move faster than that.

  3. Mixing fact categories without noticing. A page that draws on both regulatory text and live vendor pricing in the same paragraph needs two different review cadences. Most blogs treat them the same and rot the same.

  4. Hand-formatting citations. Once a step is manual, it gets skipped. Make the renderer do it or it will not happen consistently.

  5. Writing the prose first and citing afterwards. That order produces confident sentences that then have to be backed into. Gather the fact first, write the sentence around it, and the citation is already in your hand.

Frequently asked

Why do you call this 'dog-food'?

Because the methodology we describe is the same one we use to write the post you are reading. There is no separate process for our own content.

What happens when a source goes 404?

The fact row carries a TTL and a change-sensitivity flag. When the URL fails a check or the value drifts, the affected page flags itself for review and the corresponding sentence renders with a staleness warning.

Can I trust a post that has zero facts?

Posts with no facts are always concept explainers — definitions, mental models, opinion pieces. They are labelled as such. No quantitative claim means no claim that can rot.

How is this different from a normal blog with citations?

A normal blog has citations next to claims. We have a structured fact table that the renderer reads from, which means stale facts can be detected and flagged automatically. Citations alone do not have that property.

Do you use AI to write these posts?

Yes. The constraint that every claim must be bound to a real source URL is what stops the model from inventing things. The architecture of the publishing pipeline is the safety mechanism, not the prompt.