Assurance a board will actually accept.
A model card is not an audit, and a vendor’s promise is not evidence; what a board will accept in a system that acts on its own is a record it can test — measured, traceable, mapped to the standards its overseers cite, and owned by the institution that has to answer for it.
Boards have started asking a harder question about AI, and most trust claims cannot survive it. It is no longer “is the model good?” It is “show me what this system did on this decision, prove it stayed inside its limits, and tell me who was answerable.” A model card does not answer that. Neither does a benchmark score, a vendor’s safety white paper, or a policy binder. Those are trust claims — marketing-grade, describing the tool in the abstract. What a board, a regulator, or an examiner will accept is a different thing entirely: evidence, produced on demand, that can be independently tested. The distance between the claim and the evidence is the whole of what assurance has to close.
That distance widens the moment AI stops advising and starts acting. When a system only recommends, “it works” is a defensible standard, because a person still sits between the output and the consequence. When an agent acts — on a customer, a transaction, a record — the standard becomes “we can prove it worked, show why it acted, and defend the decision after the fact.” Autonomous action creates a question the org chart was not built to answer: when the agent is wrong, who is accountable, and how do they show exactly what happened? Assurance is the discipline of answering that before it is asked.
Real assurance is built from the bottom up, and the foundation is measurement. You cannot assure what you do not measure. In practice that means layered evaluation rather than a single accuracy figure carried over from a demo: offline regression against ground truth that gates every change, online scoring on a canary share of live traffic to catch the drift offline tests miss, and adversarial red-teaming for prompt injection, data exfiltration, and unsafe tool use — probing the system the way an attacker would. Together they convert “it seems to work” into a defensible number with a method behind it.
Measurement shows how a system behaves in aggregate; a board also needs to reconstruct a single decision. So every consequential action must leave an immutable, replayable trail — the inputs, the context retrieved, the model and version, the reasoning, the action taken, and the human who approved it where approval was required. This is explainability of the decision path — what the system did and why — which is what an examiner actually needs, and is distinct from interpretability of model weights, which they rarely ask for. An action that cannot be reconstructed cannot be defended, however convincing it looked at the time.
Evidence only counts if it maps to the standard the overseer will cite. Layered evaluation, decision traceability, and enforced runtime policy each produce a specific artifact — test histories, immutable logs, guardrail and approval records — and an evidence model lines those artifacts up against the frameworks a reviewer already works from, the NIST AI Risk Management Framework and the EU AI Act among them. Built that way, an audit becomes a query against a system of record rather than a scramble to assemble one. The honest posture matters here as well: the discipline is to produce the evidence those frameworks expect, not to claim a certification that has not been earned.
None of this holds still. Models are updated, data shifts, dependencies change, and a point-in-time certification expires the moment the system moves beneath it. Assurance for an agentic system therefore cannot be a document filed once; it has to be an operating property, produced continuously by the same loop that runs the system — evaluation on every change, traces on every action, evidence that is always current. Point-in-time compliance is theater; continuous assurance is the real thing.
There is one further condition a board should insist on, and it is the one most often missed. Evidence that lives only inside a vendor’s platform is evidence the institution does not really control — it can be reformatted, degraded, or lost when the contract ends, and it answers to a vendor’s roadmap rather than the examiner’s question. Board-grade assurance requires that the audit trail, the evaluation history, and the evidence model belong to the institution itself. You own the IP. The cloud is just where it runs. A board can only accept assurance it is in a position to produce on its own authority.
So the thing a board will actually accept is not a reassurance; it is a record — measured, traceable, mapped to the standards its overseers use, kept current as the system changes, and owned by the firm that has to answer for it. That record is what turns governance from the function that blocks deployment into the one that licenses it — and it is the real line between AI a vendor calls trustworthy and AI an institution can defend.
Begin with a Charter.
A fixed-fee diagnostic that turns these arguments into a plan for your operation — scoped, costed, and run by the people who would operate it.