Insight · Operations

Operating agentic AI is a different job than building it.

Building an agentic system and running it for the years after launch are not the same job — the second is a standing discipline, not a longer version of the first. Evaluation, observability, drift, escalation, cost control: the operating craft is the part the market keeps pricing as an afterthought of the build.

Building an agentic system and running it are related jobs, but they are not the same job. One has an endpoint; a system passes its tests, clears its demo, and ships. The other has none: it begins the day real traffic arrives and continues for as long as the system stays live. The market tends to collapse the two — treating operation as the tail end of the build, handled by the same engineers a little later, between the next two projects. That collapse is where a great deal of production AI quietly comes undone.

The jobs diverge because the thing being operated does not hold still. Classical software, once correct, stays correct until someone edits the code; an agentic system has no such property. The model underneath it is updated on a schedule no one on your side controls, and its behavior shifts without a line of your code changing. The data it reads moves. The tools it calls evolve and occasionally break. The mix of inputs arriving months later looks nothing like the set the launch evaluation ever saw. A system certified at launch is not certified the following month — certification, for a system like this, is a rate rather than an event, which means the skill that carries a build to launch is not the skill that keeps it standing afterward.

That afterward has a shape, and most of it the build never teaches. It is observability granular enough that any single action can be reconstructed after the fact. It is continuous evaluation — scoring live behavior against ground truth on a standing basis, not a one-time test that passed once and was retired. It is incident response, with on-call rotations, runbooks, and a pinned rollback path, because a misbehaving agent is a production incident and deserves to be handled like one. It is change management that treats every model and dependency update as a controlled deployment rather than a surprise. And it is cost governance, because the meter runs on every call whether the output was right or wrong, and margin is decided in the aggregate of those calls.

A distinct part of that craft is knowing when a machine should stop and a person should decide. Well-run agentic systems are human-supervised by design — not because autonomy is unreachable, but because judgment about where to place the human is itself an operating skill: calibrated against how often the system needs correcting, tightened where the stakes are high, relaxed only where the record has earned it. That calibration is learned in production, watching real behavior over time. It is not something a build hands over finished.

Operating also means settling, in advance, on what running well is — because running well is a number or it is an opinion. Task success against acceptance criteria. The share of runs that still need a human. Cost per successful outcome, where reliability meets margin. Tail latency under load. The lag between a regression beginning and an alarm firing. These service levels are what make an operation ownable and reportable, and a build by itself produces none of them. What it produces is a system that worked on the day it was measured.

This is why operating production AI is starting to look less like a phase of engineering and more like a profession in its own right — the way site reliability became a discipline distinct from writing software, and the controller’s function became distinct from closing the sale. Each emerged when something valuable began degrading in the gap between building a thing and running it, and someone had to own the running. Agentic AI has reached that point. The operator is not a better builder; the operator does a different job, with its own instruments, standards, and cadence.

Most organizations still staff only the first job. They fund the build, mark the launch, and assume the operation falls out of it at no cost. What actually happens is that a system no one runs to a standard decays in silence — making steadily worse decisions for weeks before the quarter’s numbers expose it, by which point the loss is the compounded bad output plus the confidence that does not come back. An agent without a named owner, a service level, and a change protocol is not in production; it is in a long, unmonitored beta that everyone has agreed to call a success.

The durable asset here is not the model — that is rented, and interchangeable by design. It is the operating discipline built around it, and that discipline is best held by a party answerable to the outcome rather than to a platform’s roadmap. You own the IP. The cloud is just where it runs. Building the system is the visible achievement. Running it to a standard, for as long as it stays live, is the quieter and harder one — and it is a job unto itself.

Engage

Begin with a Charter.

A fixed-fee diagnostic that turns these arguments into a plan for your operation — scoped, costed, and run by the people who would operate it.