AI governance readiness — the minimum surface area before you ship

Most AI governance programs we see started the same way. A product team shipped a model. Marketing wanted to announce the feature. Someone on the risk side asked who had signed off, and the room went quiet. From there the program either sprints to catch up or quietly defers governance until a regulator forces the issue.

There is a smaller, more useful version of this work that almost any team can put in place in a quarter. It doesn't require a dedicated AI risk committee, a SOC for AI, or a 200-page framework. It requires four artifacts, kept honest. This post is what those artifacts are, why each exists, and what auditors are starting to ask for as the EU AI Act and ISO/IEC 42001 move from "coming soon" to "next inspection."

Why most programs stall before they start

AI governance, as a discipline, has a scope problem. The frameworks describe hundreds of controls. The vendor pitches describe platforms. The internal proposal describes a committee. The team shipping the actual AI feature hears all of that and reasonably concludes the governance program will land sometime next year, after they ship.

The honest answer is that you don't need most of the framework on day one. You need enough surface area that, if a regulator showed up next month, you could show what you have in production, why it's not high-risk, how you test it, and where a human intervenes when it gets something wrong. That's four artifacts, not four hundred.

The reason teams skip these four is rarely the cost of producing them. It is the ambiguity about whether they are even required for "this kind of AI." That ambiguity is going away as the EU AI Act and ISO/IEC 42001 become operational. The good news is that the same four artifacts answer both frameworks.

The four artifacts you need on day one

The minimum surface area is small.

A model inventory. One row per deployed model. For each row: the model name and version, the owner team, whether it is hosted by a third party (OpenAI, Anthropic, AWS Bedrock, etc.) or run in-house, the inputs it sees, the intended use, and the prohibited or out-of-scope uses. That last column matters more than teams initially think — it is what stops the model from being repurposed sideways without a fresh review.

A risk tier per model. A short, written assignment to one of three tiers (minimal, limited, high — or whatever scheme matches your regulator), with a one-paragraph justification. The tier drives every downstream control: how often you re-evaluate, whether a human reviews each output, how much documentation the model needs.

An evaluation plan. For each model, what does "good enough to ship" mean, in numbers? Capture the dataset, the metrics, the thresholds, and the cadence of re-evaluation. The discipline here is to write the thresholds before measuring the model against them. Post-hoc thresholds are not thresholds.

A human-in-the-loop policy. Which decisions does the model make unsupervised, which require human confirmation, and how does the human escalate disagreement? For high-risk uses this isn't optional under the EU AI Act; for everything else it's still the cheapest way to bound the downside of a bad output.

That's the surface area. Each artifact fits on a page. The discipline is in keeping them current, not in making them elaborate.

AI governance — minimum surface area checklist with model inventory, risk tier, evaluation plan, and human-in-the-loop policy

The four artifacts that, together, answer most of what auditors and regulators are asking for. Skipping any one of them is where programs typically come apart.

Risk tiering — the part teams skip

Risk tiering is the artifact teams most often defer, and the one that pays back fastest. Without a tier, every model receives the same treatment: a DSAR chatbot drawing on a public knowledge base gets the same controls as a model deciding which CV to surface for a hiring manager. The two are not remotely comparable in risk, and pretending they are produces both under-protected high-risk uses and over-engineered low-risk ones.

A simple, defensible scheme starts with two axes: data sensitivity and decision impact.

Data sensitivity. Public data, internal-but-non-personal data, personal data, special category data (health, financial, biometric).
Decision impact. Informational only, advisory to a human, automated decision with reversal, automated decision without reversal.

Plot the model on those two axes, document the result, and apply a proportional control set. The EU AI Act effectively does this same exercise top-down; ISO 42001 wants you to do it bottom-up. Either way, the artifact is the same.

Evaluations are where governance meets engineering

The risk tier is the input; the evaluation plan is the output. For each model, the plan should answer:

What dataset are we evaluating against? Production-representative, versioned, refreshed on a schedule.
Which metrics matter? For a generative system, that usually means some combination of correctness, hallucination, bias and fairness, plus any task-specific metrics (faithfulness for RAG, tool correctness for agents).
Who judges? A human, a deterministic check, or an LLM-as-a-judge with a documented rubric. The judge belongs in the audit trail next to everything else.
What thresholds trigger action? Numeric pass/fail levels, set before you measure. A failure mode that doesn't trigger an action is not a failure mode — it's a status update.
How often does this run? At launch, on every material change, and on a recurring cadence (monthly is typical for medium-tier models, weekly or per-deploy for high-tier).

The artifact regulators and auditors are asking for isn't a score. It is the trail behind the score: who tested what, against which dataset, on which model version, with which judge, against which metrics, with which result. The score on its own is a number with no audit value.

For an in-depth discussion of evaluation methods themselves — deterministic assertions, LLM-as-a-judge, continuous evaluation on production traffic — the next post in this series goes into more detail.

How this maps to the EU AI Act and ISO 42001

The two frameworks emphasise slightly different things, but they share a spine.

The EU AI Act classifies systems by risk tier, requires technical documentation, mandates testing and quality management for high-risk systems, and demands that providers and deployers maintain post-market monitoring. The four artifacts above produce most of the documentation the Act asks for: the inventory and risk tier feed the system description, the evaluation plan feeds the testing and post-market monitoring sections, the human-in-the-loop policy feeds the human oversight requirement.

ISO/IEC 42001 asks for an AI management system: continuous, documented assurance that the AI in your portfolio keeps performing inside the risk thresholds you've defined. The same four artifacts produce most of the evidence — what is the inventory, are the tiers current, do evaluations run on cadence, is the human-in-the-loop policy actually being followed.

You don't need to choose between the two. The artifacts overlap by design; the differences show up at the management-system layer (governance bodies, review cadence, continual improvement), which is the layer you build once the artifacts exist.

What to do this quarter

If your team is shipping or planning to ship anything LLM-based and these artifacts don't exist yet, the smallest useful first step is the inventory. List every model you have in production today. For each, fill in five fields: name, owner, hosting model, inputs, intended use. One hour, one document.

Then assign a risk tier and a one-paragraph justification to each row. Most of your models will be low- or medium-tier; the exercise is largely about identifying the one or two that are actually high-risk, so the proportional controls land where they matter.

From there, the evaluation plan and human-in-the-loop policy follow from the tier. The work compounds: each artifact makes the next one easier to produce, and at the end of the quarter you have something defensible without having paused the team's shipping motion.

The cost of doing this work is small. The cost of being asked for it under inspection, and not having it, is the rest of the program done in a hurry. Which version of the same work you'd rather do is, in the end, the governance question.

Note: this is a sample post included with the Privexus blog system to illustrate the MDX format. Replace it with editorial content before launch.