Bartek Pucek 2026-03-12 19 min read

Testing and Evaluating AI Agents in Production

AI agent evaluation is the systematic practice of measuring whether autonomous AI systems produce correct, safe, and useful outcomes in real-world operating conditions. Unlike traditional software testing — which verifies deterministic outputs against fixed assertions — agent evaluation must handle probabilistic outputs, multi-step reasoning chains, tool-use sequences, and emergent behaviors that only manifest under production workloads. Effective evaluation combines pre-deployment benchmarks, real-time monitoring, and continuous human feedback loops to maintain agent reliability over time.

The stakes are high and rising. Gartner projects that by 2028, 40% of enterprise AI failures will trace to inadequate evaluation and monitoring of agent systems rather than model capability gaps. [Source: Gartner, “AI Risk Management Predictions,” 2026] The gap exists because most teams apply traditional software testing practices to fundamentally non-deterministic systems. A function that returns different outputs for the same input is buggy in traditional software — it is normal behavior for an LLM-powered agent. This distinction demands an entirely different evaluation methodology, one that most engineering organizations have not yet built.

At The Thinking Company, we operate multi-agent systems in production across content generation, research workflows, and client-facing AI product builds. This article distills what we have learned about evaluating agents that run continuously, not just agents that pass a benchmark once.

Why Traditional Testing Fails for AI Agents

Traditional software testing rests on three assumptions that do not hold for AI agents:

Assumption 1: Deterministic outputs. Given the same input, the function produces the same output. AI agents produce variable outputs by design — temperature settings, model updates, and context variations all introduce controlled randomness. Running the same prompt through the same agent 10 times may yield 10 different (but hopefully all acceptable) outputs.

Assumption 2: Binary correctness. An output is either right or wrong. Agent outputs exist on a quality spectrum. A research summary might be 90% accurate but miss one relevant source. A code generation agent might produce working code that is functionally correct but architecturally poor. Evaluation must be graded, not binary.

Assumption 3: Static behavior. The system behaves the same way today as it did yesterday. Agent behavior shifts when foundation models are updated, when tool APIs change, or when the distribution of user inputs drifts. OpenAI’s GPT-4 showed measurable behavior changes across versions — tasks that performed at 97% accuracy in March 2023 dropped to 87% by June 2023 on the same benchmark. [Source: arXiv, “How Is ChatGPT’s Behavior Changing over Time?” Chen et al., 2023] Production agents require continuous evaluation, not point-in-time certification.

These three failures compound in multi-agent systems, where coordination between agents introduces additional variance. An evaluation framework that handles single-agent variability but ignores multi-agent interaction effects will miss the most dangerous failure modes.

The Agent Evaluation Stack: Four Layers

Robust agent evaluation operates across four layers, each catching different categories of failure.

Layer 1: Capability Benchmarks (Pre-Deployment)

Before an agent reaches production, it must pass capability benchmarks that verify it can perform its core tasks at acceptable quality levels. These benchmarks are agent-specific — a research agent has different benchmarks than a coding agent.

Benchmark design principles:

Representative: Test inputs must reflect the actual distribution of production inputs, not a curated set of easy cases.
Diverse: Include edge cases, adversarial inputs, and multilingual content if relevant.
Versioned: Benchmarks evolve as agent capabilities expand and user expectations shift.
Human-validated: Reference outputs must be reviewed by domain experts, not generated by another LLM.

For coding agents, SWE-bench provides a widely-used capability benchmark — agents must resolve real GitHub issues from popular open-source projects. The current state-of-the-art agents resolve approximately 49% of SWE-bench Verified issues, up from 12% in early 2024. [Source: SWE-bench leaderboard, March 2026] For business-process agents, no equivalent public benchmark exists, so organizations must build custom benchmarks from historical task data.

Benchmark size: We maintain 50–100 benchmark scenarios per agent, stratified by difficulty (easy/medium/hard at roughly 30/50/20 distribution). Each scenario includes the input, expected output characteristics (not exact text), and evaluation criteria with weights. Running the full benchmark takes 15–30 minutes per agent and costs USD 5–20 in API calls.

Layer 2: Integration Testing (Pre-Deployment)

Individual agents may pass their benchmarks but fail when connected to other agents or real-world tools. Integration testing verifies that agents work correctly within the broader system.

Critical integration tests:

Test Category	What It Validates	Example
Handoff fidelity	Agent A’s output matches Agent B’s expected input schema	Research agent output parses correctly as analysis agent input
Tool reliability	Agent correctly invokes tools and handles tool errors	Agent retries failed API calls, handles rate limits, processes timeouts
Context propagation	Relevant context reaches all agents that need it	Business requirements from intake propagate to final delivery
Error recovery	System handles individual agent failures gracefully	Pipeline continues or retries when one agent returns malformed output
Boundary conditions	System handles extreme inputs	Very long documents, empty inputs, adversarial prompts

Microsoft Research found that 61% of multi-agent system failures in enterprise deployments originate at agent boundaries — the handoff points between agents — rather than within individual agents. [Source: Microsoft Research, 2025] This makes integration testing disproportionately important.

Test execution: We run integration tests nightly against the full agent pipeline using a dedicated test environment. Tests use production model versions but test data. Any integration test failure blocks deployment of affected agents.

Layer 3: Production Monitoring (Post-Deployment)

Once agents are live, monitoring must detect quality degradation, behavioral drift, and novel failure modes that benchmarks did not anticipate.

Metric categories for production monitoring:

Task completion metrics:

End-to-end completion rate (target: >90% for well-defined workflows)
Partial completion rate (agent made progress but did not finish)
Failure rate by failure type (model error, tool error, timeout, quality rejection)
Human intervention rate (how often humans must override or correct agent output)

Quality metrics:

Output quality scores (automated scoring against rubrics)
Factual accuracy rate (sampled and human-verified)
Consistency score (similar inputs producing outputs within expected variation)
Regression detection (quality drops on previously-handled task types)

Operational metrics:

Latency per agent and per workflow (p50, p95, p99)
Token usage per agent and per workflow
Cost per task completion
Error rate by agent, by tool, by time period

Safety metrics:

Hallucination rate (claims not supported by provided sources)
Out-of-scope action rate (agent attempting actions outside its designated authority)
PII exposure incidents (agent including personal data in outputs where it should not)
Governance violation rate (actions exceeding agent’s authorization level)

Anthropic’s operational data shows that the most predictive leading indicator of major agent failure is a gradual increase in human intervention rate. A human override rate climbing from 5% to 12% over two weeks typically precedes a system-level quality incident within the following week. [Source: Anthropic, “Observations on Enterprise Agent Operations,” 2025]

Layer 4: Human Feedback Loops (Continuous)

Automated metrics catch measurable degradation but miss subtle quality issues — outputs that are technically correct but unhelpful, tone-inappropriate, or strategically misaligned. Human feedback closes this gap.

Feedback mechanisms:

Inline rating: Users rate each agent output on a 1–5 scale. Simple to implement but suffers from low response rates (typically 5–15%) and selection bias (users rate extreme outputs more than adequate ones).

Sampled review: Quality analysts review a random sample (5–10%) of agent outputs against detailed rubrics. More reliable than inline ratings but resource-intensive. We use this as our primary quality signal for high-stakes workflows.

Comparative evaluation: Periodically generate outputs from the current agent version and a candidate version for the same inputs. Human evaluators pick the better output without knowing which is which (blind A/B testing). This detects improvements and regressions that absolute scoring misses.

Escalation analysis: Track the patterns in cases where humans override agent outputs. Cluster the overrides by cause (factual error, wrong tone, missing context, poor formatting) to identify systematic weaknesses. This is the highest-signal feedback channel because it represents real failures that affected real users.

Building an Evaluation Framework: Step by Step

Step 1: Define Agent Success Criteria

Before building evaluation infrastructure, define what “good” means for each agent. This requires collaboration between technical teams (who understand agent capabilities) and business stakeholders (who understand outcome requirements).

Success criteria template:

Agent: [Name]
Primary objective: [What the agent should accomplish]
Quality dimensions:
  - Accuracy: [How correct must outputs be? 95%? 99%?]
  - Completeness: [What must every output include?]
  - Timeliness: [Maximum acceptable latency]
  - Safety: [What must the agent never do?]
Minimum viable quality: [Threshold below which output is rejected]
Target quality: [Quality level that represents full success]

Step 2: Build the Benchmark Suite

Construct benchmarks from three sources:

Historical task data: If the agent is automating an existing human workflow, collect 50–100 examples of human-performed tasks with their inputs and outputs. These become reference scenarios.
Edge cases: Identify the inputs most likely to cause failures — ambiguous requests, inputs requiring multi-step reasoning, inputs involving external knowledge, adversarial inputs. Create 20–30 edge case scenarios.
Synthetic generation: Use a capable LLM to generate additional test scenarios that fill gaps in the historical data. Always human-validate synthetic scenarios before including them in the benchmark.

Scoring methodology: Each benchmark scenario needs a scoring function. For tasks with objectively correct answers (data extraction, classification), automated scoring works. For tasks with subjective quality (writing, analysis, recommendations), use LLM-as-judge scoring calibrated against human evaluations.

LLM-as-judge scoring has been shown to correlate 0.85–0.92 with expert human judgments when the judge uses detailed rubrics with scoring examples. [Source: arXiv, “Judging LLM-as-a-Judge,” Zheng et al., 2024] Without rubrics, correlation drops to 0.60–0.75. The rubric is more important than the judge model.

Step 3: Implement Production Monitoring

Deploy monitoring infrastructure alongside the agent, not after. Required components:

Logging pipeline: Every agent invocation logs input, output, tool calls, latency, token usage, and any errors. Store logs in a queryable format (structured JSON to a data warehouse). Retention: minimum 90 days for debugging, 1 year for trend analysis.

Quality scoring pipeline: Run automated quality scoring on a sample of production outputs (10–100% depending on volume and criticality). Scores feed into dashboards and alerting.

Alerting rules:

Completion rate drops below threshold (critical alert)
Quality score drops below threshold (warning at -5%, critical at -10%)
Human intervention rate exceeds threshold (warning)
Latency exceeds SLA (warning at p95, critical at p99)
Cost per task exceeds budget (warning)

Dashboard: Real-time visibility into all metric categories across all agents and workflows. We use Grafana with custom panels for agent-specific metrics, though any dashboarding tool works. The key requirement is that both engineering and business stakeholders can access the dashboard.

Step 4: Establish Feedback Loops

Set up at least two feedback channels: one automated (sampled quality scoring) and one human (sampled review or inline rating). Feed insights back into agent improvement through a structured process:

Weekly review: Analyze quality metrics and human feedback from the past week. Identify the top 3 failure patterns.
Root cause analysis: For each failure pattern, determine whether the cause is prompt-related, model-related, tool-related, or data-related.
Improvement implementation: Modify agent instructions, tool configurations, or evaluation criteria to address identified failures.
Regression testing: Run the full benchmark suite after any agent change to ensure improvements do not degrade other capabilities.

Evaluating Multi-Agent Systems: Special Considerations

Multi-agent systems require evaluation at the agent level AND the system level. System-level quality is not simply the average of individual agent quality — it depends on coordination, handoff fidelity, and emergent behavior.

End-to-End Evaluation

Test the full agent pipeline from initial input to final output. Compare against reference outputs produced by humans or by a known-good system version. End-to-end evaluation catches coordination failures that individual agent testing misses.

Execution approach: Maintain a suite of 30–50 end-to-end scenarios that exercise different workflow paths. Run weekly. Score against multiple dimensions (accuracy, completeness, coherence, formatting). Track scores over time to detect drift.

Ablation Testing

Remove one agent from the system and measure the quality impact. This reveals which agents contribute most to overall quality and which are redundant. If removing an agent does not measurably degrade output quality, that agent may be unnecessary overhead.

We conduct ablation testing quarterly. In one case, ablation testing revealed that a dedicated “tone adjustment” agent in our content pipeline added zero measurable quality improvement — its function was already adequately handled by the writing agent’s instructions. Removing it cut pipeline latency by 18% and cost by 15%.

Interaction Pattern Analysis

Log the full sequence of inter-agent communications for each workflow execution. Analyze patterns to identify:

Unnecessary loops: Agents repeatedly passing work back and forth without converging.
Bottleneck agents: One agent that consistently takes the longest, dominating end-to-end latency.
Silent failures: Agents that accept low-quality inputs without flagging them, propagating errors downstream.

A16z’s analysis of multi-agent startups found that teams who instrument inter-agent communication discover and resolve performance issues 3x faster than teams who only monitor individual agents. [Source: a16z, “Debugging Multi-Agent Systems,” 2026]

The Model Update Problem

Foundation model providers update their models without notice, and these updates can change agent behavior in unexpected ways. Anthropic, OpenAI, and Google all publish model version identifiers, but the behavioral differences between versions are not documented in detail.

Impact data: A study by Scale AI found that 34% of enterprises experienced unexpected agent behavior changes following a model update in 2025, with 12% experiencing production incidents severe enough to require human intervention. [Source: Scale AI, “State of AI Evaluation,” 2025]

Mitigation strategies:

Pin model versions: Use specific model version identifiers in production rather than aliases like “latest.” This prevents unexpected updates. Upgrade deliberately after running the full benchmark suite against the new version.

Continuous benchmarking: Run capability benchmarks daily, even when no changes have been made. Sudden score changes indicate an external factor — model update, tool API change, or data drift.

Shadow testing: Before promoting a new model version to production, run it in shadow mode — processing the same inputs as the production system and comparing outputs. Promote only when shadow performance meets or exceeds production baseline.

Rollback readiness: Maintain the ability to revert to the previous model version within minutes. This requires version-pinned configurations and automated deployment infrastructure. The MLOps practices that enterprises built for traditional ML models apply directly.

Evaluation Anti-Patterns

Anti-pattern 1: Evaluating on curated inputs only. Benchmarks built from carefully selected examples consistently overestimate production performance. Include messy, ambiguous, and adversarial inputs that reflect real-world conditions. Our benchmarks allocate 20% of scenarios to deliberately difficult or ambiguous inputs.

Anti-pattern 2: Using the same model as evaluator and producer. When the same model evaluates its own outputs, it exhibits systematic blind spots — it rates its own failure modes as acceptable. Use a different model for evaluation, or better yet, combine LLM-as-judge with regular human evaluation.

Anti-pattern 3: Optimizing for benchmark scores instead of production outcomes. Agents can be tuned to score well on benchmarks while performing poorly on real tasks. Always validate benchmark improvements against production metrics. If benchmark scores improve but production quality does not, the benchmark needs updating, not the agent.

Anti-pattern 4: Measuring only accuracy. Accuracy is necessary but insufficient. An agent that produces correct outputs in 60 seconds when the user expects 10 seconds has failed. An agent that produces correct outputs containing PII has failed. Evaluate across all relevant dimensions simultaneously.

Anti-pattern 5: Treating evaluation as a one-time activity. Agent evaluation is an operational practice, not a pre-launch checklist. Teams that evaluate thoroughly before launch but stop monitoring post-launch consistently experience quality degradation within 30–60 days. Deloitte’s analysis of enterprise AI programs found that continuous evaluation reduces production incidents by 67% compared to periodic evaluation. [Source: Deloitte, “AI Ops Maturity,” 2025]

Cost of Evaluation

Evaluation infrastructure has real costs that must be budgeted.

Component	Typical Monthly Cost	Notes
Benchmark suite execution	USD 500–2,000	Depends on agent count and benchmark size
Production quality scoring	USD 1,000–5,000	LLM-as-judge on sampled outputs
Human review (sampled)	USD 2,000–8,000	5–10% sample at 15–20 min per review
Monitoring infrastructure	USD 500–2,000	Logging, dashboards, alerting
Shadow testing	USD 1,000–3,000	Duplicate inference during model transitions
Total	USD 5,000–20,000	10–25% of agent operating costs

This represents 10–25% of typical agent operating costs — a significant investment that consistently pays for itself through prevented incidents and maintained quality. Organizations that skip evaluation investment typically spend 3–5x more on incident response and quality remediation. [Source: McKinsey Digital, 2025]

Production Evaluation at The Thinking Company

Our agent evaluation practice runs on a weekly cadence. Every Monday, the evaluation pipeline executes the full benchmark suite for all production agents (currently 12 agents across 4 workflows). Results are compared against the previous week and any agent showing more than a 3% quality decline is flagged for investigation.

We sample 10% of production outputs for human review, with higher sampling rates (25%) for client-facing workflows. Reviews follow a structured rubric with 8 quality dimensions scored on a 1–5 scale. Aggregated rubric scores are tracked monthly and reported in our operational dashboard.

For organizations building their first agent evaluation practice, we recommend starting with Layer 1 (benchmarks) and Layer 3 (production monitoring) — these catch the highest-impact failures with the lowest implementation effort. Layers 2 and 4 can be added as the agent portfolio matures. Our AI Build Sprint (EUR 50–80K) includes evaluation infrastructure setup as a standard deliverable because agents without evaluation are agents waiting to fail.

Frequently Asked Questions

How often should AI agents be evaluated?

Capability benchmarks should run at minimum weekly, with daily runs recommended for high-stakes production agents. Production monitoring should be continuous — every agent invocation should be logged and key metrics should be calculated in real-time. Human review sampling should occur weekly, with results analyzed and acted upon in a weekly review cycle. Model version changes should trigger an immediate full benchmark run regardless of the regular schedule.

What metrics matter most for AI agent evaluation?

The single most important metric is task completion rate — the percentage of invocations where the agent produces a usable output without human intervention. This metric integrates accuracy, reliability, and safety into one number. The second most important metric is human override rate — how often humans need to correct or replace agent outputs. Rising override rates are the most reliable leading indicator of system-level quality problems. All other metrics (latency, cost, quality scores) are important but secondary to these two.

How do you evaluate an AI agent that produces creative or subjective outputs?

For creative or subjective tasks (writing, design, strategy recommendations), use rubric-based evaluation with multiple human evaluators. Define 5–8 quality dimensions with clear descriptions and anchor examples for each score level. Calculate inter-rater agreement to ensure rubric reliability. LLM-as-judge scoring can supplement human evaluation for high-volume assessment — research shows 0.85+ correlation with human scores when detailed rubrics are provided. Never rely solely on automated scoring for subjective outputs. [Source: arXiv, Zheng et al., 2024]

What is an acceptable failure rate for production AI agents?

Acceptable failure rates depend on the consequences of failure. For informational tasks (research summaries, data analysis), a 5–10% failure rate with human review of flagged cases is typically acceptable. For transactional tasks (sending emails, updating records, making purchases), failure rates must be below 1%, with mandatory human approval for high-value actions. For safety-critical tasks (medical, financial, legal), no autonomous failure is acceptable — human-in-the-loop verification is required for every consequential output. Define your acceptable failure rate before deployment and build monitoring to detect breaches.

How do you handle AI agent evaluation across multiple languages?

Multilingual agent evaluation requires language-specific benchmarks and language-qualified human reviewers. Machine-translated benchmarks are insufficient because they miss cultural context and natural phrasing. Build separate benchmark suites for each target language using native speakers as reference output creators. LLM-as-judge scoring works across major languages but requires language-specific rubrics. Monitor quality metrics per language — agents frequently perform worse in non-English languages, and this degradation increases for less-resourced languages. [Source: Stanford HELM, 2025]

Should we build evaluation infrastructure in-house or use third-party tools?

For most organizations, a hybrid approach works best. Use third-party tools for foundational capabilities — logging (Datadog, Grafana), LLM monitoring (Langfuse, Braintrust, Patronus AI), and alerting (PagerDuty). Build custom tooling for organization-specific evaluation criteria, benchmark management, and human review workflows. The third-party ecosystem for agent evaluation is maturing rapidly but still lacks domain-specific evaluation capabilities that enterprises need. Budget 2–4 weeks for initial evaluation infrastructure setup.

How does agent evaluation differ from traditional ML model evaluation?

Traditional ML evaluation measures model performance on a fixed test set at a point in time — accuracy, precision, recall, F1 score. Agent evaluation must additionally account for: (1) multi-step reasoning chains where errors compound, (2) tool usage correctness and reliability, (3) behavioral consistency over time as models update, (4) system-level coordination in multi-agent architectures, and (5) safety properties like authorization boundary compliance. The evaluation surface area for agents is roughly 5–10x larger than for traditional ML models, requiring proportionally more investment in evaluation infrastructure.

What role does governance play in agent evaluation?

Governance and evaluation are deeply linked. Governance defines what agents are allowed to do — their authorization boundaries, decision rights, and escalation triggers. Evaluation measures whether agents actually stay within those boundaries. Every governance requirement should map to at least one evaluation metric. For example, if governance policy states that agents must escalate decisions above USD 10,000, evaluation must include test scenarios that verify this escalation behavior. Governance without evaluation is aspirational. Evaluation without governance has no standard to measure against. Build them together.