The 3 Most Important Principles of Building AI Agents

Learn the 3 core principles of building AI agents that actually work in production, with real-world examples, code examples, and model comparisons.

Posted May 26, 2026

Your agent worked the first ten times. On the eleventh run, it called a function that doesn't exist, or looped on itself, eating tokens until you killed the process, or returned something confidently wrong that a downstream system acted on. You searched "principles of building AI agents" because the mental model you brought from regular software engineering is missing something, and you're right, it is. The principles you need aren't a curriculum to study before you start. They're the named lessons embedded in the specific ways agents fail in production, and once you map your failure to the principle you violated, the fix is usually obvious.

This article organizes them that way. Every principle is anchored to the failure it prevents, in language you'll recognize from the bug you're currently debugging.

What Are AI Agents? Core Concepts Before You Build

Before applying the principles, it helps to be precise about what an AI agent actually is and what most people build instead.

AI agents are software systems in the field of artificial intelligence and computer science that perceive inputs, make decisions, and take actions, often without requiring human intervention at every step. What separates them from a standard script or a chatbot is the loop. An agent calls external tools, receives results, analyzes data, and decides what to do next, repeating that cycle until it reaches a stopping condition or a bound is triggered. Large language models (LLMs) and natural language processing (NLP) are the two technologies that power most modern agents. LLMs handle the reasoning and generation; NLP enables the agent to interpret human language in user queries, process text-based inputs from external systems, and produce structured outputs your code can act on.

Unlike simpler AI applications that produce one output per prompt, sophisticated AI agents can accomplish tasks that require multi-step planning, dynamic decision making, and coordination across multiple tools and external systems. That capability is what makes them useful and what makes them hard to build reliably.

The Five Agent Types You Need to Know

Computer science classifies agents by how they make decisions. Understanding agent types before you write code saves you from over-engineering simple problems and under-engineering complex ones.

Agent TypeHow It DecidesBest ForLimitations
Simple reflex agentsFollows predefined rules (if X, then Y)Repetitive tasks in stable, observable environmentsNo memory, no planning, no learning capabilities
Model-based reflex agentsMaintains an internal model of its environment to reason about unobserved statesPartially observable environments; NPC behavior in gamesCannot optimize across multiple goals
Goal-based agentsEvaluates actions against desired outcomes; plans sequences of stepsMulti-step tasks with defined end statesMay find suboptimal paths
Utility-based agentsUses a utility function to weigh multiple paths and maximize expected utilityRecommendation systems, financial trading, and resource allocationRequires an accurate utility function design
Learning agentsImproves performance over time through a structured learning process; the agent learns from feedback and past interactionsLong-running systems that need to adapt: virtual assistant applicationsRequires an evaluation infrastructure to verify learning

Multi-agent systems are a separate layer: they deploy multiple AI agents, each specializing in different sub-tasks. that share the same tools or communicate across external systems to complete tasks too complex for a single agent alone. I will cover multi-agent systems specifically later in this article.

Most modern, sophisticated AI agents built on large language models are goal-based or utility-based at the system level, with learning capabilities layered on top. A practical example: a virtual assistant that handles user queries in natural language, connects to external tools for scheduling and data retrieval, and learns from past interactions to improve response quality over time. That is a learning agent with goal-based planning and NLP at its core.

How AI Agents Work Mechanically

At the foundation, how AI agents work comes down to a perception-decision-action cycle:

  1. The agent receives input. A user query, a tool result, an external data feed
  2. The LLM processes that input alongside the accumulated context (working memory)
  3. The model generates output, either a tool call, a response to the user, or an internal reasoning step
  4. Your code parses the output and executes any tool calls
  5. Tool results are appended to the context
  6. The loop repeats until a stop condition fires

The model is not "taking actions" in any direct sense. Your code is taking actions based on what the model wrote, and feeding the results back. Computer vision extends this loop beyond text; agents that incorporate image understanding can interpret screenshots, visual data, and physical environments as inputs, enabling physical actions like navigating a UI or reading receipts as part of a broader workflow.

Every failure mode covered in this article traces back to one specific step in that loop, breaking down.

Real-World AI Agent Examples

The principles in this article apply across every industry where agents are deployed. Grounding them in real-world examples first makes the technical detail that follows easier to connect to actual use cases.

Financial Trading

AI trading agents use large language models to process market data in real time, identify patterns across historical prices and news sentiment, and execute trades without requiring human intervention on routine decisions. Multi-agent systems in financial services partition the workflow: one agent handles fundamental analysis, another monitors technical indicators, a third manages risk exposure, and a fourth handles execution. These systems share the same tools but operate on different scopes of data. Documented deployments in financial services report fraud detection accuracy improvements from 87% to 96%, with average detection times dropping to under three seconds per transaction.

Web Development and Data Engineering

Coding agents handle task automation across the software development lifecycle, generating code, running tests, calling external systems to push changes, and looping back when errors are returned. In data engineering, agents automate complex tasks like pipeline monitoring, schema validation, and anomaly detection. These are workflows that previously required human resources to manage manually, often on a shift-by-shift basis. Agents operating independently on these workloads free engineering teams to focus on architecture and edge-case handling rather than routine monitoring.

Virtual Assistants and Business Processes

Consumer-facing AI applications like virtual assistants handle user queries in natural language, store and retrieve user preferences across sessions, and connect to external tools to complete tasks like booking, scheduling, and information retrieval. In business processes, multi-agent systems route support tickets, process claims, manage compliance checks, and flag exceptions for human agents, handling repetitive tasks end-to-end while escalating genuinely complex tasks to people. The result is cost efficiency: organizations deploying these systems report 5x to 10x reductions in per-task cost for high-volume, low-complexity workflows when using tiered model routing.

Computer Vision and Physical Action Systems

Agents that incorporate computer vision extend beyond text-in/text-out. They interpret images, screenshots, and real-world environments as inputs that enable them to take physical actions like navigating a UI, reading product labels, inspecting invoices, or monitoring physical equipment as part of a multi-step workflow. Google Cloud's Vision AI is one example of a computer vision layer that engineers connect to agentic pipelines for document processing and visual quality control.

What an AI Agent Actually Does

Before applying any of the three principles, confirm you have actually built an agent. The question is simple: at each step in your system, who decides what happens next: your code, or the model?

  • If your code decides: You have a workflow.
  • If the model decides: You have an agent.
  • If there are no steps (one prompt in, one response out): You have a chained prompt.

This is not a pedantic distinction. Workflows have their own bugs, but they do not loop forever and they do not invent tools to call. Agents do both of those things because the model is in charge of the flow.

Chained PromptWorkflowAgent
Who controls flowNobody (single call)Your codeThe model
When to useSingle transformation, no branchingKnown steps, possible branchingGenuinely open-ended planning required
Failure modesOutput formatting driftWrong branch on ambiguous inputHallucinated tool calls, runaway loops, context pollution, goal drift

Roughly 80% of production "agents" should have been workflows. Dynamic agents where the model genuinely directs open-ended planning across dynamic environments with unpredictable input spaces are the exception, not the default. Building AI agents focuses on cases where the input space is too varied for a static workflow, where the task requires genuine reasoning about which steps to take, or where the user expects conversational iteration with the system. If none of those three apply to your use case, generative AI can still power your system effectively through a deterministic workflow with LLM calls at specific, bounded steps.

When you do need an agent, the loop is straightforward: the model generates text → your code parses some of that text as a structured tool call → the tool runs → the result gets appended to context → the model generates again → repeat until a stop condition fires. Every failure this article addresses is a specific thing that goes wrong somewhere in that loop.

If this section is making you reconsider whether you needed an agent in the first place, the step-by-step beginner's walkthrough of building your first agentis worth reading alongside the distinction between agentic AI and AI agents.

Principle 1: Constrain How AI Agents Work at the Tool Boundary

Failure prevents: Hallucinated tool calls

Your agent calls get_user_data() when the function is actually fetch_user_record(). Or it calls the right function but passes {user_id: "unknown"} when the parameter should be an integer. Or the JSON is technically valid but missing a required field. Your first instinct is to add more detail to the system prompt. That instinct is wrong.

Why AI Agents Hallucinate Tool Calls

The structural cause is this: the model generates tool calls token by token, exactly the same way it generates any other text. Without explicit grounding in the tool schema at decode time, nothing forces the output to match what your code expects. Adding more prompt detail nudges the probability distribution and does not constrain it.

The model is not "using" external tools the way a human agent uses software systems. It is generating tokens that describe a tool call, and your code is interpreting those tokens as instructions. Without explicit enforcement, the gap between what the model wrote and what your schema expects is where hallucinations live.

Three Fixes, in Priority Order

Fix 1: Use a function-calling-native API

If you are describing tools as plaintext in the system prompt, stop. OpenAI, Anthropic, and Google all expose structured tool-use APIs where the schema is enforced by the inference stack, not by your prompt. OpenAI's strict: true mode on tool definitions eliminates most schema drift for OpenAI models. The model output is constrained to match your JSON schema at the decoder level. Anthropic's tool-use endpoint and Google Cloud's function calling work the same way. Use them.

Fix 2: Validate every tool call against the schema before executing, then retry with the error in context

When validation fails, do not throw an exception into the void. Pass the schema violation back to the model as the next turn's input. The model usually self-corrects when shown what it got wrong, because the violation is now part of the context it is conditioning on.

Code Example: Validate-and-Retry with Error Context

result = validate(tool_call, schema)

if not result.ok:

context.append({

"role": "tool_error",

"content": f"Tool call invalid: {result.error}. Schema: {schema}"

})

tool_call = model.generate(context)

# retry once, then escalate to human supervision

This pattern handles the majority of schema violations without any prompt engineering changes. The key is that the error message must include the specific schema the model violated, not just a generic "invalid input" response.

Fix 3: For open-weight models, use constrained decoding

Libraries like Outlines and Instructor modify the decode step to only sample tokens that are valid under your schema. This is slower than native tool use but reliable. It enforces schema compliance at the generation level rather than relying on post-hoc validation.

Tool Calling and Prompt Engineering: How They Work Together

Prompt engineering is the second layer of defense after schema enforcement — not the first. The system prompt should describe what each tool does and when to use it, not what the parameters look like. Parameter constraints belong in the schema.

The pattern that works in production:

  • Schema: Constrains what the model can generate (parameter types, required fields, valid values)
  • Prompt engineering: Guides when the model chooses to call each tool, and in what sequence

This separation keeps tool calling reliable even as the system prompt changes. When you mix the two, you create a system where prompt engineering changes can inadvertently break tool call validity, and schema changes are not reflected in the model's generation behavior.

The single most common cause of hallucinated tool calls in client agents is tools described in plaintext rather than registered through the provider's tool-use API. This is a one-hour refactor that eliminates the entire failure class.

Principle 2: Treat Memory as a Pipeline, Not a Storage Bucket

Failure prevents: Context pollution and forgotten state

Two opposite failures share the same root cause. Failure A: your agent forgot the user's instruction from three turns ago and did the wrong thing. Failure B: your agent's context was filled with verbose tool outputs and the model started losing focus, ignoring earlier instructions, and repeating itself. Both are memory bugs, even though they look like opposites.

The Two Opposite Memory Failures

Most engineers initially think of memory as a database, a place where the agent stores and retrieves information. That mental model produces both failures. Storing everything produces context pollution; being selective about what to store produces a forgotten state.

The correct mental model is a pipeline: memory is a sequence of processors that decide, for every event entering context, whether it stays, gets transformed, or gets dropped.

Event (user turn/tool result/model thought)

→ Filter (drop noise)

→ Transformer (summarize, compress)

→ Context Window

The question is not "do I have memory?" The question is "what does each processor in my pipeline do with each kind of event?"

The Memory Pipeline: Processor Patterns That Work

Four processor patterns appear in every well-built production agent:

ProcessorWhat It DoesWhen to Use
TokenLimiterCaps total context size; evicts oldest non-essential turns when approaching the model's windowAlways. Set this before anything else
ToolCallFilterDrops verbose intermediate tool outputs after they have been usedWhen tool results are large (database query results, API responses with many fields)
Summarization processorCompresses old turns into a running summary once they pass an age thresholdMulti-turn conversations longer than 10 exchanges
Working memory updaterMaintains a structured summary of the current task state, updated each turnAny agent with a task that spans more than three steps

The hierarchical pattern most production agents converge on has three layers:

  1. Short-term memory: Last N raw turns (exact, recent). The model needs this verbatim for immediate context.
  2. Working memory: A maintained summary of the current task state. The model uses this to track what it is doing and why.
  3. Long-term memory: Stores facts about the user, past sessions, and completed tasks, retrieved via search when relevant, not loaded into every context window.

The rule: if information needs to be exact and recent, keep it raw in short-term memory. If it needs to persist across the task but does not need to be exact, summarize it into working memory. If it needs to outlive the session, persist it and retrieve on demand.

Working Memory, Episodic Memory, and Agent Observability

Agent observability is the practice of making the agent's internal model visible at each step of the decision-making process. It is what separates an agent you can debug from one you can only guess at.

When an agent fails the first diagnostic question is: what was in working memory when it made that decision? Without instrumentation, that question has no answer. With it, the fix is usually obvious in under five minutes. The internal model the agent holds is the primary signal for whether the memory pipeline is working correctly.

A well-instrumented memory pipeline exposes:

  • The exact contents of working memory at each model call
  • Which processor handled each incoming event, and what decision did it make
  • The delta between what was entered into the context and what was retained after filtering

In Python, LangGraph's checkpointing handles working memory persistence across turns. In TypeScript, Mastra exposes these processors as named components (TokenLimiter, ToolCallFilter) with documented behavior for each event type.

User Preferences and Long-Running Tasks

For agents that handle long-running tasks or return conversations from past sessions, user preferences are a class of episodic memory that most teams underweight in their initial build.

If a user told the agent their preferred output format, their timezone, or their industry context three sessions ago, that information should be in long-term memory and retrieved at the start of every relevant task. Agents that learn from past interactions by persisting user preferences across sessions are meaningfully more useful than agents that start fresh every time. This is part of what building AI agents focuses on at the product layer: not just whether the agent completes the task, but whether it gets better at completing tasks for each specific user over time.

The learning process for an agent learns from past interactions only if you build the infrastructure to persist, retrieve, and apply what it has learned. The learning capabilities of the underlying LLM do not help here; those weights do not change at runtime. The learning happens in your memory architecture.

Principle 3: Bound the Loop and Maintain Human Oversight

Failure prevents: Runaway agents, silent goal drift, and expensive production incidents

Three failure modes, one missing principle.

  • Infinite loop: The agent calls a tool, gets a result, decides to call a tool again, and never decides it is done.
  • Retry storm: The agent retries the same failed tool eight times with the same input, accumulating cost and latency without progress.
  • Goal drift: The agent gradually shifts what it is working on as context accumulates. By step fifteen, it has forgotten what the user originally asked.

The Four Bounds Every Agent Loop Needs

Every agent loop must be bounded along four axes.

BoundStarting ValueWhy It Matters
Max steps/tool calls per run10Prevents infinite loops
Max wall-clock time per run60 secondsPrevents latency-silent hangs
Max cost per run$0.50 in early productionThe most-skipped bound; causes the largest incidents
Progress check every N stepsEvery 3 stepsCatches goal drift before it compounds

Steps and cost decouple the moment your agent starts using a long-context model for sub-tasks or chaining expensive reasoning calls. A run that hits its 10-step limit at $0.30 is fine. A run that uses 4 steps, but each step costs $2.00, is the one that triggers a 3 am alert. Bound both, independently.

The "original goal in system prompt" pattern matters for goal drift: do not rely on the model remembering the user's request from turn 1 when you are at turn 15. Restate it in the system prompt on every turn, or in a persistent header that travels with working memory.

Every bound must trigger a defined behavior. Decide in advance:

  • What does the agent return to the user when it hits max steps?
  • Who gets notified when it hits max cost?
  • Does it return a partial result, ask for clarification, or escalate to a human?

An agent that crashes at the bound is barely better than an agent with no bound at all. The user still got a broken experience.

Human Oversight, Human Supervision, and Minimizing Risk

Bounding the loop is a technical control. Human oversight is the organizational control that sits above it. Both are required for agentic AI systems operating in dynamic environments with real consequences.

Build explicit human supervision checkpoints into any agent that touches irreversible actions:

  • Financial transactions
  • Outbound communications sent to real customers
  • Writes to production systems
  • Decisions affecting human resources

The goal is not to remove human agents from every decision. The goal is to identify where human judgment is irreplaceable and reserve those decision points for people. Repetitive tasks with well-defined desired outcomes and low failure cost are the right candidates for full automation. High-stakes decisions with ambiguous inputs and irreversible consequences need a human in the loop, not a safety-oriented system prompt, but an actual approval step in the architecture.

Security measures belong in this section, too. Every production agent should include test cases that verify:

  • The agent refuses prompt injection attempts embedded in tool results or user inputs
  • The agent does not pass sensitive data to external systems, it should not have access to
  • The agent respects predefined rules around what actions it is permitted to take, even when prompted to circumvent them
  • Human intervention is logged and tracked so you can identify which classes of tasks consistently require it

Minimizing risk in agentic AI is not primarily a prompt engineering problem. It is a bounds, observability, and oversight architecture problem.

Performance Metrics for Production Agents

Performance metrics for agents differ from performance metrics for regular AI applications. A single PASS/FAIL output quality score tells you very little about whether your agent is actually improving.

Track these metrics across production traces:

MetricWhat It Tells You
Task completion rateDid the agent accomplish the task the user requested without human intervention?
Step efficiencyHow many tool calls did it take? High step counts signal goal drift or poor tool design.
Cost per completed taskCost per task outcome. This is the number that matters for business processes.
Bound trigger rateHow often is the agent hitting max steps or max cost? High rates signal architecture problems.
Human intervention rateWhat percentage of runs required a human to step in? This is your primary quality signal.
Error recovery rateWhen a tool call fails, how often does the agent recover correctly vs. spiral?

These metrics, pulled from production traces, are the primary signal for whether your agent is improving over time. An agent that completes tasks in 6 steps this month versus 9 steps last month is genuinely getting better, even if the output quality score looks the same.

How to Choose the Right Model for Building AI Agents

The model you would pick for a chatbot is often wrong for an agent. Chatbots care about response quality on a single turn. Agents care about three properties that chatbots barely test.

What Large Language Models Actually Do Inside an Agent

Large language models are the reasoning engine inside most modern AI agents. They process natural language inputs using natural language processing, maintain context across multi-turn conversations, and generate structured outputs that your orchestration layer acts on. Generative AI has expanded what is possible here considerably: models that could barely follow a three-step instruction two years ago can now coordinate across dozens of tool calls with strong instruction-following under long context.

The choice of LLM affects three things directly:

  • 1. Tool-calling reliability. How often does the model produce malformed tool calls or hallucinate function names? This is not the same as general benchmark capability. A model that scores lower on reasoning benchmarks can be a full tier ahead on tool-calling success rate — especially on schemas with many parameters or nested objects. Test this on your specific schemas. Do not trust benchmark headlines.
  • 2. Instruction-following under a long context. Agents accumulate context fast. By turn 15, your context can contain 50K tokens of user input, system prompts, tool results, and intermediate reasoning. Some models that nominally support 200K+ context windows degrade noticeably past 30K. They start ignoring instructions buried earlier in the prompt or losing track of the original goal. Google Cloud's Gemini 2.5 Pro supports a 1M-token context window that holds up credibly at scale, which makes it a meaningful option for agents that need to analyze data across large codebases or long document collections before acting.
  • 3. Cost per agent run, not cost per token. Chatbots make one model call per user turn. Agents make 5 to 20. A model that is 2x cheaper per token but takes 1.5x more steps to reach the same answer is barely cheaper at all. A model that is 1.5x more expensive per token but reliably finishes tasks in half the steps is meaningfully cheaper per task.

Model Comparison for Agentic AI Applications

ModelTool-CallingLong ContextCost ProfileBest For
Claude Sonnet 4.5 (Anthropic)Strong; native tool use with schema validationHolds up well past 100K tokensMid-tier per-token; efficient per-runDefault choice for most production agents
Claude Opus 4 (Anthropic)StrongStrongHigh per-tokenHard reasoning sub-tasks within an agent
GPT-5 / GPT-4.1 (OpenAI)Strong, strict, and true schema enforcementGenerally strongMid-tierStrong default; preferred for OpenAI ecosystem integration
o-series (OpenAI reasoning)StrongStrongExpensive and slow per callPlanning steps only; not in tight tool-use loops
Gemini 2.5 Pro (Google Cloud)Improving rapidly1M+ context; credible at scaleMid-tierCost-sensitive agents; tasks requiring very large context windows
Gemini 2.5 Flash (Google Cloud)GoodLarge contextLowest cost among credible optionsCheap sub-tasks in tiered multi-agent systems
Llama 4 (Meta, open-weight)Variable; needs constrained decodingVariableFree if self-hostedSelf-hosted agents with controlled schemas and data residency requirements

Verify current pricing at the provider documentation before committing, as the model rates change frequently.

Read: Agentic AI vs. AI Agents: Differences & What You Need to Know

Cost Efficiency Through Model Tiering

The biggest cost efficiency lever in production agentic AI systems is model tiering: route decision making for classification, routing, and output formatting to a small fast model; reserve frontier models for the hard reasoning steps that actually require sophisticated AI agents' capabilities.

A practical tiering pattern for multi-agent systems:

  • Tier 1 (routing and classification): Gemini Flash, Claude Haiku, GPT-4.1 mini: sub-penny per 1K tokens
  • Tier 2 (standard reasoning and tool use): Claude Sonnet, GPT-4.1, Gemini Pro: mid-tier
  • Tier 3 (hard reasoning, planning): Claude Opus, GPT-5, o-series: expensive per call; use sparingly

Organizations using two-tier routing in their agentic AI systems typically report 5x to 10x cost reduction per task with no measurable quality loss on business processes that mix routine and complex tasks. The math: if 70% of your agent's tool calls are routing or formatting decisions, moving those to a Tier 1 model cuts the cost of those calls by roughly 90%. The hard reasoning calls still run on the frontier model, but they are now a smaller share of total cost.

Multi-Agent Systems: When to Deploy Multiple AI Agents

A single AI agent perceives inputs, reasons, and acts within one decision loop. Multi-agent systems deploy multiple AI agents that communicate and coordinate to complete tasks too complex for any one agent to handle reliably.

When Multiple AI Agents Outperform a Single Agent

There are four clear situations where the added complexity of multi-agent systems pays off:

1. Parallel processing across multiple data streams. One agent can analyze data from one data source while another agent monitors a second in parallel. For a financial trading system, this means fundamental analysis, technical analysis, and risk monitoring run simultaneously rather than sequentially, cutting total decision time without cutting quality.

2. Different sub-tasks require different model tiers or different external tools. A routing agent on a cheap model can delegate to other agents running on appropriate model tiers for each sub-task. The orchestrator does not need to know how each specialist agent accomplishes its task, only when to hand off and what to pass along.

3. Human oversight at different stages of a complex workflow. A supervisor agent can monitor other AI agents in the system and flag exceptions for human review at specific checkpoints, without blocking the main workflow on every step. This scales human supervision across high-volume agentic pipelines.

4. Tasks that exceed a single context window. Some complex tasks, full codebase review, large document analysis, and multi-source research, exceed what a single agent can hold in working memory reliably. Multi-agent systems split the task across agents that each operate within their context limits and report results to an orchestrator.

The Added Complexity of Other Agents Communicating

Every failure mode from the three principles above applies to each agent in a multi-agent system and then multiplies. A hallucinated tool call from one agent becomes the input to another agent, compounding the error downstream. A memory failure in a sub-agent produces an incorrect state that the orchestrator acts on as if it were correct.

The practical rule: bound, observe, and evaluate each agent individually before connecting them. A multi-agent system built from reliable individual agents has a chance of working reliably. A multi-agent system built from unreliable individual agents fails in ways that are exponentially harder to debug.

Other agents in the system should not be treated as reliable external systems. They should be treated as tools that can fail with the same validate-and-retry patterns you apply to API calls and database queries.

Multi-agent systems are not the starting point. They are the architecture you graduate to after a single reliable agent is running in production.

Making the Agent's Reasoning Visible

Failure prevents: Silent regression on prompt and model changes

You changed the system prompt to fix one failure mode. Three days later, a customer reported a different failure you

did not know you had introduced. You diffed the prompt, could not find the cause, and started debugging by reading outputs. This is what life looks like without evals.

Why Traditional Testing Does Not Work for AI Agents

Traditional unit tests do not work for agents. Outputs are non-deterministic. Strict assertion tests flake constantly; loose assertion tests miss real regressions. Evals are the regression-detection mechanism that replaces unit tests for systems whose outputs are distributions, not fixed values.

A Minimum-Viable Eval Setup

A minimum-viable eval setup has four components:

1. A fixed test set. Start with 20 to 50 real production traces like actual user inputs paired with the trajectory and outcome you would want. Pull these from your traces (which you should already have; see the tracing section below). Include at least five adversarial cases designed to trigger prompt injection attempts, test security measures, or push the agent toward goal drift.

2. A scoring function. Mix two types:

  • Deterministic checks: Did the agent call the right tool? Did the output validate against the expected schema? Did the JSON parse?
  • LLM-as-judge: Did the response satisfy the user's intent? Did the agent reach the desired outcome by a reasonable path?

3. A baseline score on the current version. Before any changes, run your eval set and record the score. This is the number you compare against.

4. A pre-change run. Re-run the eval before every prompt change, model swap, or architecture change. Compared to baseline. Block the change if the score regresses on dimensions you care about.

Every agent needs both trajectory evals (did the agent take a sensible path, right tools, right order, no unnecessary detours?) and outcome evals (did the final answer satisfy the user, regardless of path?). Trajectory evals catch problems before they manifest in outcomes. Outcome evals catch problems that trajectory evals miss.

Tooling options: Braintrust, LangSmith, Promptfoo, Inspect, OpenAI Evals. The discipline matters more than the tool choice.

Tracing: The Foundation of Agent Observability

Every agent run should produce a structured, persisted trace, but a queryable record of every event in the loop.

Bare-minimum trace fields:

run_id, timestamp, user_input, final_output

For every model call:

- model name

- full prompt

- full output

- input tokens, output tokens

- latency

For every tool call:

- tool name

- arguments

- result

- latency

- error (if any)

Aggregates:

- total tokens

- total cost

- total wall-clock time

- number of steps

- which bounds fired

You cannot debug a non-deterministic system by re-running it. You can only debug it by reading what it did. Traces are the read.

How Traces Become Performance Metrics

Prompt engineering decisions that look correct in isolation often produce subtle regressions when combined with new tool definitions or a different model version. Tracing the full decision-making process for each run is the only reliable way to catch this class of bug before it reaches users.

Production traces are also the raw material for performance metrics. Aggregate across runs: average steps to task completion, bound trigger rate, cost per task, human intervention rate, and eval score over time. These metrics make the learning process for your agent empirical rather than intuitive. You can see whether a prompt engineering change improved performance across the full eval set or only on the specific input you were debugging when you made the change.

Engineers who instrument tracing on day one ship faster than engineers who add it after their first incident. Their evals, debugging, and iteration loop all run on the same substrate. Engineers who add tracing reactively spend a week wiring it in while their incident is still open.

Multi-Agent Systems: A Dedicated Section on Multiple AI Agents

Building AI agents at scale often means thinking beyond a single agent loop. Multi-agent systems introduce coordination patterns that change how you apply the three principles, and they introduce failure modes that do not exist in single-agent architectures.

Architecture Patterns for Multiple AI Agents

The three most common patterns in production multi-agent deployments:

1. Orchestrator-Subagent. A central orchestrator agent receives the user's task, breaks it into sub-tasks, delegates each to a specialized subagent, and assembles the results. The orchestrator uses the same tools to coordinate as the subagents use to execute. This is the most common pattern for complex tasks that span multiple domains (research + writing + formatting, or analysis + risk assessment + execution in financial trading).

2. Peer-to-Peer Agent Network. Multiple AI agents with overlapping capabilities communicate directly. Each agent can hand off to other agents when it encounters a task outside its specialization. This pattern scales well but is harder to debug.

3. Supervisor-Worker A supervisor agent monitors other AI agents in the system, checks their outputs against quality criteria, and either accepts, rejects, or escalates each result. Human oversight connects at the supervisor level. The supervisor flags cases that exceed its confidence threshold for human review, rather than sending every result to a human agent.

Applying the Three Principles to Multi-Agent Systems

PrincipleSingle AgentMulti-Agent Addition
Constrain at the tool boundaryValidate tool calls for one agentValidate inter-agent message schemas the same way you validate tool call schemas
Treat memory as a pipelineOne working memory per agentDecide what state the orchestrator shares with subagents and what stays local; shared state is a coordination failure point
Bound the loopFour bounds per loopFour bounds per agent, plus a global bound on the entire multi-agent run, a subagent that hits its step limit should not silently kill the orchestrator's task

When You Should Not Build an Agent

Half the engineers who describe an "agent project" in a first coaching call are describing a problem that should be a workflow. Some are describing a problem that should not use AI at all.

Do not build an agent when:

  • The task has a fixed sequence with no real branching. If you can write the steps on a whiteboard before the agent runs, write them in code. Use a chained prompt or a script with one or two LLM calls. Model-directed flow is the wrong tool when there is no decision for the model to make.
  • The failure cost is irreversible and you cannot add a human-in-the-loop step. Financial transactions, irreversible writes to production systems, and communications sent to real customers. The right architecture here is a workflow with explicit human approval steps before any irreversible action. Security measures in a system prompt do not substitute for a human supervision checkpoint in the architecture.
  • The task is high-volume and latency-sensitive. Search, autocomplete, real-time recommendations. Agent loops add latency (multiple model calls per response) and cost (many tokens per task) that simpler AI applications avoid. If you need sub-500ms response times at high request volume, an agent is almost never the answer.

The Cost Shape, Made Concrete

A typical agent run makes 5 to 20 model calls. At frontier-model rates of roughly $5 to $15 per million input tokens and $15 to $75 per million output tokens, with 5K to 50K input tokens and 1K to 5K output tokens per run, you are looking at $0.05 to $1.00 per agent run before optimization. At 100K runs per month, that is $5,000 to $100,000. The spread is real, and where you land depends entirely on optimization choices.

Two cost levers that move the needle most:

Prompt caching: Anthropic offers up to ~90% input cost reduction on cached context. OpenAI applies an automatic 50% discount on cached input tokens. If your agent has a stable system prompt and tool definitions (it should), most of your input cost is cacheable.

Model tiering: Route simple decisions to a small, fast model; reserve frontier models for hard reasoning. A 5x to 10x cost reduction is achievable on most business processes without measurable quality loss.

Agents are the right answer when the task genuinely requires open-ended planning, the input space is too varied for a static workflow, or the user expects conversational iteration with the system. If you don't have one of those three, you probably want simpler AI automation patterns that don't require an agent.

What to Do Next, Based on What Just Failed

Map your current failure to the next action:

Failure First Move
Hallucinated tool callsSwitch to your provider's function-calling-native API today. Add validate-and-retry-with-error-context as the second layer. For open-weight models, install Outlines or Instructor.
Memory issues (forgotten state or context pollution)Instrument tracing first, so you can see what is actually in context when the failure happens. Then add the processor pipeline: TokenLimiter, ToolCallFilter, and summarization for old turns.
Runaway loops or cost spikesAdd the four bounds before your next deploy: max steps, max wall-clock, max cost per run, progress check. Define what the agent returns when each bound fires.
Silent regressions on prompt or model changesBuild a 20-input eval set this week, pulled from real traces. Run it before every change. Pick a tool (Braintrust, LangSmith, Promptfoo) and commit.
Multi-agent coordination failuresApply the three principles to each agent individually before debugging inter-agent communication. Treat messages between agents as tool calls, validate schemas, bound retries, and trace every hop.

Resources worth queuing: Anthropic’s Building Effective Agents, OpenAI's function calling guide, LangGraph quickstart, and Langfuse for self-hosted observability setup.

If you want to go deeper than architecture principles, the Leland AI Builder Program offers hands-on guidance focused on shipping production-ready AI systems. And if you want faster feedback before deployment, Leland’s AI engineering coaches can review your architecture, identify reliability risks early, and help you catch the design mistakes that usually only surface in production. For a lower-commitment starting point, free live AI strategy events connect you with practitioners actively building and scaling real-world agent workflows.

Top Coaches

Read these next:


FAQs

What's the difference between an AI agent and an AI workflow?

  • In a workflow, your code decides each step. In an agent, the model decides which tool to call and when to stop. About 80% of systems labeled "agents" should actually be workflows. Use model-directed flow only when the task genuinely requires open-ended planning.

Why does my AI agent keep hallucinating function calls?

  • The model is generating tool calls without grounding in the actual schema. Use a native function-calling API (OpenAI strict: true, Anthropic tool use, or Gemini function calling), validate every call against the schema before executing it, and feed errors back so the model can self-correct.

How do I stop my AI agent from looping forever?

  • Bound the loop across four axes: max steps (start at 10), max wall-clock time (60 sec), max cost per run ($0.50), and a progress check every N steps to confirm the agent is still on track. Each bound must return a defined result.

Should I use LangChain, LangGraph, CrewAI, or AutoGen?

  • Choose by language first, then complexity. In Python, LangGraph suits complex agents, CrewAI works well for role-based multi-agent setups, and AutoGen is Microsoft's conversational multi-agent option. In TypeScript, Mastra is the strongest native choice — and if you have fewer than three tools, no framework at all is often the right call.

How much does it cost to run an AI agent in production?

  • Expect $0.05–$1.00 per run before optimization. Cut costs with prompt caching (up to 90% off with Anthropic, 50% with OpenAI), model tiering for simple sub-tasks, and a hard cost ceiling per run.

How do I know if my AI agent is actually working?

  • Build evals: 20-50 real production traces, a scoring function (deterministic checks + LLM-as-judge), and re-run before every change. Traditional unit tests don't work, and outputs are non-deterministic.

How do AI agents work, mechanically?

  • The model generates text → your code parses it as a tool call → the tool runs → the result is added to context → repeat until a stop condition fires. The model doesn't take actions. Your code does, based on what the model wrote.

When should I NOT build an AI agent?

  • When the task has fixed steps (use a script), when wrong actions are irreversible (use a workflow with approval), or when you need low latency at high volume (agents add overhead). Build an agent only when the task genuinely requires open-ended planning.

Find your coach today.

Browse Related Articles

 
Sign in
Free events
Bootcamps