BeanSprout AI Research · White Paper · Agentic AI Engineering

Engineering Agentic Systems That Hold in Production

Architecture, failure modes, and reliability discipline for multi-agent operations.

Scott Jay Ringle — Chief AI Officer, BeanSprout AI
Tejesh Priyatham Kalidindi — AI Research Scientist & Senior Agentic AI Engineer, BeanSprout AI
June 2026 · Version 1.0 · Series: The Operating Stack (1 of 6)

Executive Summary

Most enterprise AI pilots demonstrate well and never reach production. The gap is rarely the model — it is reliability. An agent that succeeds 95% of the time at each step completes a twenty-step workflow barely a third of the time, and no business runs on a third. This paper argues that production reliability is an engineering discipline, not an emergent property of a larger model, and sets out the architecture and operating practices that make agentic systems hold. The commercial logic is direct: reliability is the precondition for autonomy, autonomy is the precondition for margin, and an agent that cannot be trusted unattended is a demonstration, not an operation. We give executives a concrete bar to require before an agentic system is allowed to run any real process: a measured task-success rate against a defined workload, a typed boundary on what the agent may do, a dedicated verification plane that checks its work, and a named owner accountable for the number.

Abstract

We characterize the failure modes of large-language-model agents operating over multi-step, tool-using workflows and present a reference architecture for production reliability comprising five planes: context, reasoning & planning, a typed tool/action boundary, persistent memory, and a dedicated verification plane operating under bounded autonomy. We formalize the compounding-error problem — end-to-end task success decaying approximately as pⁿ over n sequential steps — and show why per-step reliability, not raw model capability, governs outcomes on long-horizon work. We define an operating metric set (step-level reliability, task success rate, mean steps between failures, and cost per successful outcome) and an evaluation methodology suited to continuous production assurance, and close with the organizational implications for firms operating agentic systems at scale.

Keywords: agentic AI engineering · LLM agents · multi-agent systems · production reliability · compounding error · orchestration · verification · AI assurance

1The Production Gap

The defining shift of 2025–2026 was not a more capable model. It was the move from single-turn assistants, which answer, to agents, which act — decomposing a goal, calling tools, reading results, and proceeding over many steps with limited supervision. A chatbot that is wrong is an annoyance the user corrects in the next message. An agent that is wrong has already taken the action, and the next step builds on it.

This is why pilots mislead. A scripted demonstration exercises the happy path a handful of times; production exercises the full distribution of inputs thousands of times, unattended, where every percentage point of unreliability compounds into rework, oversight cost, and eroded trust. The binding constraint on enterprise agentic AI is therefore not capability but reliability under autonomy — and reliability is engineered, not summoned by scale [6].

2The Anatomy of a Production Agent

A production-grade agent is not a prompt; it is a system with separable planes, each separately testable and governable. We find five are load-bearing.

2.1  The context plane

What the agent knows when it reasons. Retrieval, grounding, and the assembly of relevant, current, permissioned information into the working context. Most "hallucination" in production is a context-plane failure — the model reasoned correctly over the wrong or missing facts [1].

2.2  The reasoning & planning plane

Goal decomposition, step selection, and routing — the model deciding what to do next, often interleaving reasoning with action [2]. This plane benefits from explicit structure: smaller, verifiable sub-goals outperform a single open-ended trajectory [6].

2.3  The tool & action boundary

Where intent becomes effect. A typed, permissioned boundary — schemas, validation, and least-privilege scopes — converts a class of free-text errors into caught, structured failures before they reach a system of record [3,7].

2.4  The memory plane

State that survives the step: episodic traces, semantic stores, and task state. Memory is what lets an agent be consistent across a long task — and, ungoverned, what lets a stale or poisoned fact propagate [5].

2.5  The verification plane

The plane most pilots omit and every production system needs: a dedicated check on the agent's work — evaluators, policy gates, and human-in-the-loop approval at defined thresholds [4]. Verification is the lever that arrests the compounding described next.

3The Compounding-Error Problem

Treat each step of a task as succeeding with probability p. If a task requires n sequential steps that each must be correct, and we take steps as approximately independent, end-to-end success is:

P(task) ≈ pn

The consequence is unforgiving. At a genuinely strong 95% per step, a 10-step task succeeds 60% of the time and a 20-step task only 36%. Even at 99% per step, a 30-step task fails roughly one time in four. The headline capability of the model barely moves this curve; the number of unverified steps moves it decisively.

100%50%0% 0102030 sequential steps (n) p = 0.99 p = 0.95 p = 0.90

Figure 1. End-to-end task success as a function of workflow length, for three per-step reliabilities. Long-horizon autonomy is unreliable by default; the practical levers are reducing the number of independent steps and inserting verification that resets the chain.

Two corrections follow directly. First, shorten the unverified chain — decompose long tasks into checkpointed sub-tasks so a failure is caught and retried locally rather than propagated. Second, break the independence assumption deliberately with verification: a check that catches and repairs an error mid-chain restores per-step reliability toward 1 at that point, flattening the curve. This is why the verification plane is not optional polish — it is the control that makes long-horizon work viable at all.

4Reliability by Construction

Reliability is achieved by mapping each failure mode to a specific control, then enforcing it in the architecture rather than hoping the model behaves.

Failure modeEngineering control
Compounding error over long horizonsDecomposition into checkpointed sub-tasks; verification plane that catches and repairs mid-chain
Malformed / unsafe actionsTyped, schema-validated tool boundary with least-privilege scopes and idempotent, reversible operations
Context errors (stale, missing, poisoned)Grounded retrieval, provenance and recency checks, permissioned context assembly
Goal / reward misspecificationExplicit acceptance criteria per task; evaluators that score against the business outcome, not task completion
Non-determinism & driftDeterministic scaffolding for logic the model needn't own; pinned versions; continuous regression evals on model updates
Unbounded blast radiusBounded autonomy: human-in-the-loop gates above defined risk/value thresholds; full tracing and rollback

A useful design heuristic underlies the table: move every responsibility the model does not need to own out of the model. Control flow, validation, permissions, and idempotency belong in deterministic code around the agent; the model is reserved for the judgment only it can supply. The most reliable agentic systems are, by design, the ones that ask the model to do the least.

5Measuring What Holds

An agentic system that is not measured is not operated — it is hoped. Four metrics form the minimum operating set:

Step-level reliability (p). The per-action success rate, measured against ground truth on a representative evaluation set. This is the lever Figure 1 magnifies.

Task success rate. End-to-end completion to the defined acceptance criteria — the number the business actually experiences.

Mean steps between failures. How far the agent runs unattended before intervention — the direct measure of how much autonomy is safe to grant.

Cost per successful outcome. Total token and tool cost divided by successful tasks — failures are not free, and this ties reliability to unit economics [8] (the subject of a companion paper in this series).

These are produced by a three-layer evaluation regime: offline regression suites that gate every change, online canary measurement on live traffic shares, and continuous re-evaluation triggered by model and dependency updates. Without this regime, a model provider's silent update can move p — in either direction — with no one the wiser until the quarter's numbers arrive.

6Business Implications

The chain from engineering to margin is short and strict. Reliability earns trust; trust permits autonomy; autonomy removes the human cost that otherwise caps the return on every agent. An agent supervised on every action saves little; an agent trusted to run unattended within a bounded scope changes the operating cost of the function it runs. Everything upstream in this paper exists to make that trust defensible.

For the executive, the practical output is a procurement and operating bar. Before an agentic system is allowed to run a real process, require: (i) a measured task-success rate against a named workload and acceptance criteria — not a demo; (ii) a typed boundary documenting exactly what the agent can and cannot do; (iii) a dedicated verification plane with defined human-in-the-loop thresholds; and (iv) a single accountable owner of the number, with the metric set above reported on a standing cadence. A vendor or team that cannot produce these is selling capability, not an operation.

This is also where the build-versus-operate question resolves. The model is a commodity input; the reliability discipline around it is the durable asset, and it is owned by whoever is accountable for the outcome — ideally an operator inside the customer's perimeter, answerable to the customer's KPI rather than a platform's roadmap [8].

7Limitations & Open Problems

The pⁿ model assumes step independence; in practice errors correlate — a poisoned context degrades many downstream steps at once — so the curve is a clarifying lower-bound intuition, not a precise predictor. Evaluation coverage is itself an open problem: a task-success number is only as honest as the representativeness of its eval set, and adversarial and long-tail inputs remain hard to enumerate. Non-stationarity — provider model updates changing behavior beneath a stable interface — makes one-time certification insufficient and continuous assurance mandatory. Finally, the frontier of genuinely long-horizon autonomy, where verification is itself performed by models, is active research; we treat human-defined acceptance criteria as the ground truth of record until that matures.

8Conclusion

Production agentic AI is won or lost in the engineering between the model and the business, not in the model itself. The compounding-error curve makes the stakes quantitative: reliability must be constructed, measured, and owned. Firms that treat it as a discipline — typed boundaries, verification planes, bounded autonomy, and a metric set with a name attached — will operate agents the business can trust unattended. Firms that treat reliability as something the next model will provide will keep shipping demonstrations. AI strategy ends where the bill begins; agentic engineering is where reliability does.

References

  1. Wei, J., et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903, 2022.
  2. Yao, S., et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629, 2022.
  3. Schick, T., et al. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761, 2023.
  4. Shinn, N., et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366, 2023.
  5. Park, J. S., et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442, 2023.
  6. Anthropic. Building Effective Agents. Engineering publication, 2024.
  7. Anthropic. Model Context Protocol (MCP) Specification. 2024.
  8. BeanSprout AI. The AI Operator's Brief, Vol. I, Issues 01–03, 2026.

About the authors

Scott Jay Ringle is Chief AI Officer of BeanSprout AI and a fractional CAIO, CEO, and corporate-development executive with more than 30 years turning frontier technologies into category-defining companies. He has co-founded and led companies to NASDAQ IPOs and strategic acquisitions — including Alteon Web Systems and AirWave Wireless (now Aruba Networks, acquired by HPE) — and works at the intersection of frontier AI and financial value creation, trusted by boards, venture investors, and private-equity sponsors. Tejesh Priyatham Kalidindi is an AI Research Scientist and Senior Agentic AI Engineer at BeanSprout AI, working across the research and full-stack engineering of production agentic systems.

About BeanSprout AI

BeanSprout AI is an agentic-AI operations firm. We advise, build, operate, and assure agentic AI in production — and run it for as long as it is live. This paper reflects methods used in our own engagements. It is drawn from primary, publicly reported sources and the authors' operating experience; it does not draw on confidential or non-public information of any current or former employer of the authors.