Bartek Pucek 2026-03-12 22 min read

How to Build an AI Agent Pipeline

Q: How long does it take to build a production AI agent pipeline?

For a well-scoped, single-workflow pipeline (3–5 agents), expect 4–8 weeks from design to production: workflow mapping and agent design (1–2 weeks), individual agent development and testing (1–2 weeks), pipeline integration and testing (1–2 weeks), and deployment and stabilization (1–2 weeks). Complex pipelines with multiple workflows, custom integrations, or regulatory compliance requirements may require 3–6 months. The timeline is driven more by integration complexity and testing thoroughness than by agent development speed. Cutting testing time to accelerate delivery consistently results in higher production failure rates.

Q: What programming language should I use for agent pipelines?

Python dominates the agent pipeline ecosystem — approximately 85% of production agent systems use Python as their primary language. [Source: a16z, 2026] The ecosystem of LLM client libraries (Anthropic SDK, OpenAI SDK), orchestration frameworks (LangGraph, CrewAI), and evaluation tools (Langfuse, Braintrust) is overwhelmingly Python-first. TypeScript is a viable alternative, particularly for teams building agent pipelines within Node.js/Next.js web applications. Use Python unless you have a compelling reason not to.

Q: Can I build an agent pipeline without a framework like LangGraph?

Yes. For simple sequential pipelines (3–4 agents), custom orchestration code using direct API calls and async Python is straightforward and avoids framework learning curves. We use custom orchestration for our production content engine because it gave us more control over state management and quality gating than any framework offered. Use a framework when you need complex orchestration patterns (conditional routing, parallel fan-out, dynamic replanning) or when you want to reduce development time at the cost of flexibility. Start simple and introduce a framework only when custom code becomes unwieldy.

Q: How do I handle agent pipeline failures in production?

Design three levels of failure handling: (1) Agent-level retry — if an agent fails, retry with error context appended to the prompt (max 2 retries). (2) Stage-level degradation — if an agent cannot complete its task after retries, determine if the stage is critical or optional. Optional stages can be skipped with a quality flag. Critical stages trigger pipeline halt. (3) Pipeline-level escalation — if the pipeline cannot produce an acceptable output, route the task to a human operator with the partial pipeline state as context. Log every failure for pattern analysis. Review failure patterns weekly and address the top causes.

Q: What is the minimum team size to build and operate an agent pipeline?

A senior engineer with LLM development experience can build a simple pipeline (3–4 agents, single workflow) solo in 4–6 weeks. Operating the pipeline in production requires approximately 0.25 FTE for monitoring, maintenance, and iteration. For complex pipelines (5+ agents, multiple workflows, compliance requirements), a team of 2–3 engineers plus a part-time product manager is typical for the build phase, with 0.5–1.0 FTE for ongoing operations. The bottleneck is usually prompt engineering expertise and evaluation methodology, not traditional software engineering skills.

Q: How do I know if my workflow is a good candidate for an agent pipeline?

Score your workflow against these criteria: (1) Repeatability — the workflow follows a consistent process with defined steps (score 0–3). (2) Digital inputs/outputs — inputs and outputs are digital (text, data, documents), not physical (score 0–3). (3) Defined quality criteria — you can articulate what a "good" output looks like (score 0–3). (4) Sufficient volume — the workflow processes enough items (50+/month) to justify automation investment (score 0–3). (5) Current cost — the workflow consumes enough human time to make automation economically worthwhile (score 0–3). Workflows scoring 10+ out of 15 are strong candidates. Workflows scoring below 7 are likely poor candidates until the underlying process is better defined.

Q: Should I use [RAG](/en/glossary/rag/) or [fine-tuning](/en/glossary/fine-tuning/) for agents that need domain knowledge?

Use RAG as the default approach for domain knowledge. RAG is faster to implement (days vs. weeks), easier to update (swap documents vs. retrain), and provides source attribution (the agent can cite where its knowledge came from). Fine-tuning is appropriate only when you need the agent to adopt a specific behavioral pattern (writing style, reasoning approach) that cannot be achieved through prompting, OR when the knowledge base is extremely large and retrieval costs become prohibitive. In practice, fewer than 10% of enterprise agent pipelines require fine-tuning. RAG combined with well-crafted system prompts covers the vast majority of domain knowledge requirements.

Q: How do I measure the ROI of an agent pipeline?

Calculate ROI using this framework: (1) Current cost — hours per month spent on the workflow by humans average loaded hourly rate. (2) Pipeline cost — monthly LLM API costs + tool API costs + infrastructure costs + operation/maintenance labor costs. (3) Quality adjustment — if the pipeline produces higher or lower quality than human work, adjust the value accordingly. (4) ROI = (Current cost - Pipeline cost) / Pipeline cost 100%. Most enterprise agent pipelines achieve 200–500% ROI for structured, high-volume workflows within 6 months of deployment. Pipelines for low-volume, highly variable workflows may never achieve positive ROI — this is an important viability check before investing in development. [Source: Deloitte, 2025]

Q: Can agent pipelines process real-time requests, or are they only for batch work?

Agent pipelines serve both use cases, but the architecture differs. Batch pipelines (content production, data processing, report generation) optimize for throughput and quality, with latency measured in minutes. Real-time pipelines (customer service, chatbots, interactive tools) optimize for latency, with responses expected in seconds. Real-time pipelines use fewer agents (2–3), smaller models (Haiku/Sonnet), and simpler orchestration (router + specialist) to minimize response time. Batch pipelines can use more agents (5–7), larger models, and more thorough quality checking because latency tolerance is higher. Design your pipeline for the latency profile your use case requires.

Building an AI agent pipeline means designing, implementing, and deploying a sequence of specialized AI agents that collaborate to transform an input (a user request, a data trigger, a scheduled event) into a business outcome — with each agent handling a distinct stage of the process. The pipeline architecture separates concerns (research from analysis, analysis from generation, generation from quality assurance), enabling independent optimization of each stage. A well-built agent pipeline processes work with consistent quality at scale, handles failures gracefully, and produces auditable outputs — qualities that ad-hoc agent implementations consistently fail to deliver.

This is not a theoretical exercise. Agent pipelines are the operational backbone of the emerging agentic AI architecture pattern that enterprises are adopting at accelerating rates. Gartner estimates that 35% of enterprise AI workloads will run through agent pipelines by 2028, up from approximately 5% in early 2026. [Source: Gartner, “AI Agent Infrastructure Forecast,” 2026] The growth driver is straightforward: pipelines produce more reliable outcomes than monolithic agents, and reliability is the primary concern for enterprise buyers. At The Thinking Company, we build agent pipelines for client engagements and operate them for our own workflows — this guide reflects what works in production, not just in prototypes.

Prerequisites: What You Need Before Building

Before writing a single line of agent code, validate that you have the necessary foundations in place. Skipping prerequisites is the most common cause of agent pipeline failures — 58% of failed enterprise agent projects trace to inadequate preparation rather than technical issues. [Source: McKinsey Digital, “AI Project Failure Analysis,” 2025]

Technical Prerequisites

LLM API access with sufficient quota. Agent pipelines consume significantly more tokens than single-agent systems — a 5-stage pipeline might use 10–20x the tokens of a single prompt. Ensure your API quota and budget accommodate this. For Claude models, Anthropic offers enterprise tiers with dedicated capacity. For OpenAI models, Azure OpenAI provides provisioned throughput.

Development environment. Python 3.10+ with async support (agents run concurrently). Key libraries: an orchestration framework (LangGraph, CrewAI, or custom), an LLM client library (Anthropic SDK, OpenAI SDK), a structured output parser (Pydantic), and a state management library. We recommend VS Code or Cursor with Claude Code for agent development — the AI-assisted coding loop significantly accelerates agent prompt engineering.

External tool infrastructure. Agents need tools: web search APIs, database connections, file system access, RAG systems, CRM APIs, email APIs. Identify which tools each pipeline stage needs and verify that access is provisioned and tested before building agents.

Evaluation infrastructure. Set up logging, quality scoring, and monitoring from day one — not as a post-deployment afterthought. See our agent evaluation guide for the complete evaluation stack.

Business Prerequisites

A well-defined workflow to automate. Agent pipelines work best for workflows that are currently performed by humans following a repeatable process. Document the current human workflow step by step: what inputs trigger it, what steps are performed, what tools are used, what decisions are made, and what outputs are produced. This workflow documentation becomes the blueprint for your pipeline design.

Success criteria. Define what “good enough” looks like for the pipeline’s output. What quality dimensions matter? What error rate is acceptable? What latency is acceptable? What cost per execution is acceptable? Without success criteria, you cannot evaluate whether the pipeline is working. [Source: Based on professional judgment]

Governance framework. Before agents take autonomous actions, define their authorization boundaries, escalation rules, and accountability model. See our agent governance guide for the complete framework. Building governance after deployment is 3–5x more expensive than building it alongside the pipeline.

Step 1: Map the Workflow and Identify Agent Boundaries

Time estimate: 3–5 days

The first step transforms your human workflow documentation into an agent architecture. This is the highest-leverage design decision in the entire pipeline — getting the agent boundaries right determines everything downstream.

Workflow Mapping Process

List every step in the current human workflow. Be granular. “Research the topic” is too broad. “Search for market size data from analyst reports,” “Identify the top 5 competitors by revenue,” “Find 3 relevant case studies” — these are the right level of granularity.
Group steps by cognitive type. Research steps (finding information), analysis steps (interpreting information), generation steps (producing new content), and verification steps (checking quality). Each group becomes a candidate for an agent.
Identify dependencies. Which steps require output from previous steps? Which steps can execute independently? Dependencies define your pipeline topology — sequential where dependencies exist, parallel where they do not.
Define inter-stage contracts. For each handoff between agents, specify exactly what data the upstream agent produces and what the downstream agent expects. Use structured schemas (JSON with typed fields). Vague handoffs are the primary source of pipeline failures.

Example: Content Production Pipeline

Step	Cognitive Type	Agent	Input	Output
Generate outline from brief	Planning	Planning Agent	Content brief (topic, audience, requirements)	Structured outline with H2s, key points per section, source requirements
Research data and sources	Research	Research Agent	Outline with source requirements	Source list with summaries, statistics, quotes, and URLs
Write draft content	Generation	Writing Agent	Outline + research package	Full prose draft with inline citations
Verify claims and sources	Verification	Fact-Check Agent	Draft with citations	Verified draft with accuracy annotations
Score against quality rubric	Evaluation	Quality Agent	Verified draft + rubric	Quality score with dimension breakdown and specific feedback
Format and add metadata	Production	Formatting Agent	Approved draft	Final content with schema markup, internal links, metadata

This 6-agent pipeline maps directly to how skilled human content teams operate — but executes in 3–5 minutes instead of 4–8 hours.

Agent Count Guidelines

A16z’s analysis of production agent systems found that the sweet spot for most enterprise workflows is 3–7 agents. [Source: a16z, “State of AI Agents,” 2026] Use this decision framework:

2 agents: Workflow has one core task and one verification task. Minimum viable pipeline.
3–4 agents: Workflow has distinct preparation, execution, and quality stages. Most common for initial deployments.
5–7 agents: Workflow has multiple distinct cognitive types (research, analysis, generation, verification, formatting). Appropriate for complex business processes.
8+ agents: Consider hierarchical orchestration or splitting into multiple pipelines.

Step 2: Design Individual Agents

Time estimate: 5–7 days

Each agent in the pipeline needs four components: a system prompt, a tool set, an output schema, and evaluation criteria.

System Prompt Design

The system prompt is the most important component of each agent. It defines the agent’s role, capabilities, constraints, and quality standards. A well-crafted system prompt is the difference between an agent that produces production-quality output and one that produces generic, unreliable output.

System prompt structure:

1. Role definition (who the agent is, what it does)
2. Task specification (exactly what to produce, in what format)
3. Quality criteria (what "good" looks like, with examples)
4. Constraints (what to avoid, what boundaries to respect)
5. Tool usage instructions (when and how to use each tool)
6. Output format specification (exact schema to follow)

Prompt length guidelines: Anthropic’s research shows that agent instruction-following accuracy peaks at system prompts of 500–2,000 tokens and degrades beyond 3,000 tokens. [Source: Anthropic, “Building Effective Agents,” 2025] If your system prompt exceeds 3,000 tokens, consider splitting the agent into two agents or moving reference material into a retrievable knowledge base rather than the prompt.

Key principle: Be specific, not verbose. “Write in a professional tone” is vague. “Write at a Flesch-Kincaid grade level of 10–12, using active voice, with sentences under 25 words” is specific. Specific instructions produce consistent outputs; vague instructions produce variable outputs.

Tool Set Design

Each agent should have access to the minimum set of tools required for its task — no more. Tool bloat degrades agent performance. Research from Microsoft found that agents with 5 or fewer tools select the correct tool 94% of the time; agents with 15+ tools select correctly only 76% of the time. [Source: Microsoft Research, 2025]

Tool design principles:

Each tool does one thing well
Tool descriptions clearly state what the tool does and when to use it
Tool inputs and outputs use typed schemas
Tools handle their own errors and return informative error messages
Tools enforce rate limits and timeouts internally

Output Schema Design

Every agent must produce structured output that the next agent (or the final consumer) can reliably parse. We use Pydantic models to define output schemas:

class ResearchOutput(BaseModel):
    topic: str
    sources: list[Source]
    key_statistics: list[Statistic]
    key_findings: list[str]
    confidence_score: float  # 0.0 to 1.0
    gaps_identified: list[str]

Structured outputs eliminate the parsing ambiguity that causes downstream failures. When Agent B expects a list of sources with URLs and summaries, and Agent A produces unstructured prose mentioning sources inline, the handoff breaks. Structured schemas make the contract explicit and enforceable.

Anthropic’s structured output feature (tool_use with JSON schema) and OpenAI’s structured output mode both enforce schema compliance at the model level — the model is constrained to produce valid JSON matching the specified schema. Use these features for every agent output.

Evaluation Criteria

Define per-agent evaluation criteria before building the agent. These criteria serve double duty: they guide prompt engineering (you know what to optimize for) and they enable automated quality monitoring in production.

Per-agent evaluation template:

Agent: Research Agent
Evaluation dimensions:
  - Source relevance (1-5): Are sources directly relevant to the topic?
  - Source recency (1-5): Are sources from the last 2 years?
  - Statistical coverage (1-5): Does the research include quantitative data?
  - Completeness (1-5): Are all requested research areas covered?
  - Accuracy (1-5): Are source attributions correct?
Minimum passing score: 3.5 average across all dimensions
Target score: 4.0 average

Step 3: Build and Test Individual Agents

Time estimate: 5–10 days

Build each agent independently and test it in isolation before connecting agents into the pipeline. This is the unit testing phase.

Agent Development Loop

For each agent:

Write the initial system prompt based on the workflow mapping and agent design from Steps 1–2.
Configure tools and verify that each tool works independently.
Run 10–20 representative inputs through the agent and manually evaluate outputs.
Refine the prompt based on observed failure patterns. Common refinements: adding examples of good output, adding explicit error-handling instructions, tightening output format requirements.
Build the benchmark suite — 50–100 test scenarios with evaluation criteria (see our evaluation guide for benchmark methodology).
Run the full benchmark and verify that the agent meets minimum quality thresholds.
Select the model tier — test the agent on both a smaller (Haiku/Sonnet) and larger (Opus) model. Use the smallest model that meets quality thresholds, reserving larger models for agents that genuinely need the capability.

Model Selection Per Agent

Not every agent needs the most capable model. Match model capability to task complexity:

Task Type	Recommended Model Tier	Rationale
Information retrieval and formatting	Small (Haiku-class)	Structured retrieval does not require deep reasoning
Research synthesis	Medium (Sonnet-class)	Requires source evaluation and summary skills
Analysis and reasoning	Medium-Large (Sonnet/Opus)	Benefits from deeper reasoning chains
Creative generation	Medium (Sonnet-class)	Good balance of quality and cost for writing
Quality evaluation and judging	Large (Opus-class)	Evaluation requires the highest reasoning capability
Supervision and orchestration	Large (Opus-class)	Must understand and coordinate all other agents

This tiering approach typically reduces pipeline costs by 40–60% compared to using the largest model for every agent, with less than 5% quality degradation. [Source: Anthropic, “Model Selection and Routing,” 2025]

Handling Tool Errors

Agents will encounter tool failures in production — API timeouts, rate limits, authentication failures, malformed responses. Build error handling into each agent’s instructions:

Instruction pattern for tool errors:

If a tool call fails:
1. Retry once after 2 seconds.
2. If retry fails, note the failure in your output under "tool_errors".
3. Continue with available information. Do not hallucinate data you could not retrieve.
4. If the failed tool was essential (not supplementary), set confidence_score below 0.5.

Agents that silently ignore tool errors and fill gaps with hallucinated data are the most dangerous failure mode in production pipelines. Explicit error-handling instructions reduce hallucination-on-failure by 83%. [Source: Microsoft Research, “Agent Reliability Patterns,” 2025]

Step 4: Connect Agents into a Pipeline

Time estimate: 3–5 days

With individual agents tested, connect them into the pipeline using your chosen orchestration framework.

Orchestration Implementation

Choose an orchestration pattern based on your workflow characteristics:

Sequential Pipeline: Use LangGraph’s StateGraph or a simple async chain if your workflow is strictly linear.
Supervisor + Specialists: Use LangGraph’s create_react_agent with tool-based agent invocation, or implement custom supervisor logic.
Parallel Fan-Out: Use Python’s asyncio.gather() with agent invocations as async tasks, combined with a synthesis agent.

State Management Implementation

The pipeline needs a shared state object that accumulates outputs from each agent. We implement this as a Pydantic model that grows as the pipeline progresses:

class PipelineState(BaseModel):
    input: UserRequest
    plan: Optional[PlanOutput] = None
    research: Optional[ResearchOutput] = None
    draft: Optional[DraftOutput] = None
    fact_check: Optional[FactCheckOutput] = None
    quality_score: Optional[QualityOutput] = None
    final_output: Optional[FormattedOutput] = None
    errors: list[PipelineError] = []
    metadata: PipelineMetadata

Each agent reads the fields it needs from the state object and writes its output to its designated field. The state object serves as both the communication channel between agents and the audit trail for the pipeline execution.

Inter-Agent Validation

At every handoff point, validate that the upstream agent’s output matches the downstream agent’s expected input:

Schema validation: Does the output conform to the expected Pydantic model?
Completeness check: Are all required fields populated?
Quality gate: Does the output meet minimum quality thresholds (e.g., research confidence > 0.5)?

Failed validations trigger the appropriate response: retry the upstream agent with error context, skip the failed stage and continue with degraded quality (if the stage is optional), or halt the pipeline and escalate.

Step 5: Integration Testing

Time estimate: 3–5 days

Integration testing validates that the pipeline produces coherent end-to-end outputs — individual agents may work fine in isolation but produce incoherent results when combined.

End-to-End Test Suite

Build 30–50 end-to-end test scenarios that exercise different pipeline paths:

Happy path scenarios (60%): Standard inputs that should flow through the pipeline without errors.
Edge case scenarios (25%): Unusual inputs — very long, very short, ambiguous, multilingual, technical, non-technical.
Error scenarios (15%): Inputs designed to trigger failures — tool unavailability, schema violations, quality gate failures.

What to Evaluate in Integration Tests

Dimension	How to Evaluate
Output quality	Score against rubric (LLM-as-judge + human review)
Pipeline coherence	Does the final output reflect all intermediate stages consistently?
Error handling	Do failures trigger appropriate retries, degradation, or escalation?
Latency	Total pipeline time within SLA?
Cost	Token usage within budget?
State integrity	Is the state object complete and consistent after each run?

Regression Baseline

Run the full integration test suite and record results as the baseline. Every subsequent code change must pass the full suite at or above baseline quality. This prevents incremental changes from degrading the pipeline — a pattern so common it has a name: “agent quality erosion.”

Scale AI’s analysis of production agent systems found that pipelines without regression testing lose an average of 1.5% quality per month through accumulated minor changes, totaling 18% annual quality degradation. Pipelines with weekly regression testing maintain quality within 2% of baseline. [Source: Scale AI, 2025]

Step 6: Deploy to Production

Time estimate: 3–5 days

Production deployment requires infrastructure, monitoring, and operational procedures beyond the pipeline code itself.

Infrastructure Requirements

Compute: Agent pipelines are I/O-bound (waiting for LLM API responses), not compute-bound. A single server or container can handle many concurrent pipeline executions. Autoscale based on concurrent pipeline count, not CPU utilization.

Storage: Pipeline state objects must persist for the duration of execution (minutes to hours) and audit logs must persist for compliance retention periods (1–3 years). Use a database (PostgreSQL, DynamoDB) for pipeline state and an object store (S3, GCS) for audit logs.

Secrets management: API keys for LLM providers, tool APIs, and internal services must be securely stored and rotated. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) — never hardcode API keys in pipeline configuration.

Network: Ensure reliable connectivity to LLM provider APIs and tool APIs. Implement retry logic with exponential backoff for transient network failures. Consider proximity to LLM provider endpoints for latency-sensitive pipelines.

Deployment Strategy

Shadow deployment (Week 1): Run the pipeline in parallel with the existing human workflow. Compare pipeline outputs to human outputs. Fix any discrepancies before switching to live operation.

Gradual rollout (Weeks 2–3): Route 10% of traffic to the pipeline, then 25%, then 50%, monitoring quality metrics at each step. If quality metrics degrade at any stage, halt the rollout and investigate.

Full deployment (Week 4): Route 100% of target traffic to the pipeline. Maintain the ability to revert to manual processing for 30 days post-deployment.

BCG’s analysis of enterprise AI deployments found that organizations using gradual rollout strategies experience 62% fewer production incidents than those using immediate cutover. [Source: BCG, 2025] The extra 2–3 weeks of gradual rollout consistently save more time than they cost.

Monitoring Setup

Deploy the monitoring stack described in our evaluation guide:

Real-time dashboard: Agent latency, error rates, quality scores, cost per execution.
Alerting: Automated alerts for quality drops, error rate spikes, and cost anomalies.
Logging: Structured logs for every agent invocation with correlation IDs for pipeline tracing.
Quality sampling: Automated LLM-as-judge scoring on 10–100% of outputs (based on volume and criticality).

Step 7: Operate and Iterate

Time estimate: Ongoing

Production agent pipelines require active operation — they are not set-and-forget systems.

Weekly Operations Cadence

Monday: Review the past week’s quality metrics. Identify the top 3 quality issues by frequency or severity.

Tuesday–Thursday: Investigate and fix identified issues. Common fixes: refine agent prompts, adjust quality gate thresholds, update tool configurations, add new test scenarios to the regression suite.

Friday: Run the full regression test suite to verify that fixes do not introduce new issues. Deploy validated changes.

Monthly Operations Cadence

Monthly quality review: Analyze quality trends. Is average quality improving, stable, or declining? Are specific agents responsible for quality issues?

Monthly cost review: Analyze cost per execution trends. Are costs stable? Which agents consume the most tokens? Can model tiers be adjusted?

Monthly benchmark update: Add new test scenarios based on production failures encountered during the month. Remove scenarios that no longer represent realistic inputs.

Model Version Management

When LLM providers release new model versions, do not auto-upgrade production pipelines. Follow this process:

Pin the current model version in production configuration.
Test the new version against the full benchmark suite in a staging environment.
Compare results — quality scores, latency, cost, and error rates — between the current and new versions.
Shadow deploy the new version for 3–5 days, comparing live outputs.
Promote only when the new version meets or exceeds the current version across all metrics.

Anthropic and OpenAI both provide version-pinned model identifiers. Use them. Unpinned model references (e.g., “claude-sonnet” without a version) will auto-upgrade and can silently change pipeline behavior. [Source: Anthropic documentation, 2026]

Common Pipeline Architectures by Use Case

Content Production Pipeline

[Brief Intake] → [Research Agent] → [Outline Agent] → [Writing Agent] → [Fact-Check Agent] → [Quality Agent] → [Format Agent] → Published Content

Key characteristics: 6–7 stages, evaluator-optimizer loop between Writing and Quality agents, batch processing (40–60 items), total pipeline time 3–5 minutes per article, cost USD 0.30–0.80 per article.

Customer Service Pipeline

[Intent Router] → {FAQ Agent | Troubleshooting Agent | Escalation Agent} → [Response Agent] → [Compliance Check Agent] → Customer Response

Key characteristics: Router pattern with 3–4 specialist branches, real-time (target <10 seconds), human escalation for complex issues, compliance gate for regulated industries, cost USD 0.02–0.10 per interaction.

Data Processing Pipeline

[Ingestion Agent] → [Extraction Agent] → [Validation Agent] → [Enrichment Agent] → [Load Agent] → Structured Data Store

Key characteristics: 5 stages, high-volume batch processing, strict schema validation at every stage, low cost per record (USD 0.01–0.05), error quarantine for failed records.

Research and Analysis Pipeline

[Question Decomposition Agent] → [Parallel Research Agents (3-5)] → [Synthesis Agent] → [Analysis Agent] → [Report Agent] → Research Report

Key characteristics: Fan-out pattern for parallel research, synthesis as critical coordination point, deep analysis stage using capable model, total pipeline time 5–15 minutes, cost USD 1.00–5.00 per report.

Cost Estimation Framework

Estimate pipeline costs before building to validate business viability:

Per-agent cost formula:

Cost per agent invocation = (input_tokens * input_price) + (output_tokens * output_price)

Pipeline cost formula:

Cost per pipeline execution = Sum of all agent invocation costs + tool API costs

Monthly cost formula:

Monthly cost = (Cost per execution * daily executions * 30) + infrastructure costs

Benchmark costs (March 2026 pricing):

Pipeline Type	Agents	Model Mix	Cost per Execution	1K/day Monthly Cost
Content production	6	1 Opus + 5 Sonnet	USD 0.50–0.80	USD 15K–24K
Customer service	4	1 Sonnet + 3 Haiku	USD 0.02–0.10	USD 600–3K
Data processing	5	5 Haiku	USD 0.01–0.05	USD 300–1.5K
Research/analysis	5	1 Opus + 4 Sonnet	USD 1.00–5.00	USD 30K–150K

Compare these costs against the current human labor cost for the same workflow. Deloitte’s analysis shows that agent pipelines typically achieve 60–80% cost reduction compared to equivalent human workflows for structured, repeatable tasks. [Source: Deloitte, 2025]

Building Agent Pipelines with The Thinking Company

We build production agent pipelines through two engagement models:

AI Build Sprint (EUR 50–80K, 4–6 weeks): Ideal for organizations with a well-defined workflow ready for agent automation. Includes workflow analysis, pipeline design, agent development, testing, deployment, and 30-day post-launch support. Most first-time agent pipeline projects fit this model.

AI Product Build (EUR 200–400K+, 3–6 months): Ideal for organizations building agent-powered products or automating multiple complex workflows. Includes multi-agent system design, custom orchestration architecture, governance framework implementation, and comprehensive evaluation infrastructure.

Organizations at Stage 2+ of the AI maturity model typically have the data infrastructure and technical capability to benefit from agent pipelines. Organizations at Stage 1 should first invest in foundational data and AI strategy through our workshop and diagnostic offerings.

Frequently Asked Questions

How long does it take to build a production AI agent pipeline?

For a well-scoped, single-workflow pipeline (3–5 agents), expect 4–8 weeks from design to production: workflow mapping and agent design (1–2 weeks), individual agent development and testing (1–2 weeks), pipeline integration and testing (1–2 weeks), and deployment and stabilization (1–2 weeks). Complex pipelines with multiple workflows, custom integrations, or regulatory compliance requirements may require 3–6 months. The timeline is driven more by integration complexity and testing thoroughness than by agent development speed. Cutting testing time to accelerate delivery consistently results in higher production failure rates.

What programming language should I use for agent pipelines?

Python dominates the agent pipeline ecosystem — approximately 85% of production agent systems use Python as their primary language. [Source: a16z, 2026] The ecosystem of LLM client libraries (Anthropic SDK, OpenAI SDK), orchestration frameworks (LangGraph, CrewAI), and evaluation tools (Langfuse, Braintrust) is overwhelmingly Python-first. TypeScript is a viable alternative, particularly for teams building agent pipelines within Node.js/Next.js web applications. Use Python unless you have a compelling reason not to.

Can I build an agent pipeline without a framework like LangGraph?

Yes. For simple sequential pipelines (3–4 agents), custom orchestration code using direct API calls and async Python is straightforward and avoids framework learning curves. We use custom orchestration for our production content engine because it gave us more control over state management and quality gating than any framework offered. Use a framework when you need complex orchestration patterns (conditional routing, parallel fan-out, dynamic replanning) or when you want to reduce development time at the cost of flexibility. Start simple and introduce a framework only when custom code becomes unwieldy.

How do I handle agent pipeline failures in production?

Design three levels of failure handling: (1) Agent-level retry — if an agent fails, retry with error context appended to the prompt (max 2 retries). (2) Stage-level degradation — if an agent cannot complete its task after retries, determine if the stage is critical or optional. Optional stages can be skipped with a quality flag. Critical stages trigger pipeline halt. (3) Pipeline-level escalation — if the pipeline cannot produce an acceptable output, route the task to a human operator with the partial pipeline state as context. Log every failure for pattern analysis. Review failure patterns weekly and address the top causes.

What is the minimum team size to build and operate an agent pipeline?

A senior engineer with LLM development experience can build a simple pipeline (3–4 agents, single workflow) solo in 4–6 weeks. Operating the pipeline in production requires approximately 0.25 FTE for monitoring, maintenance, and iteration. For complex pipelines (5+ agents, multiple workflows, compliance requirements), a team of 2–3 engineers plus a part-time product manager is typical for the build phase, with 0.5–1.0 FTE for ongoing operations. The bottleneck is usually prompt engineering expertise and evaluation methodology, not traditional software engineering skills.

How do I know if my workflow is a good candidate for an agent pipeline?

Score your workflow against these criteria: (1) Repeatability — the workflow follows a consistent process with defined steps (score 0–3). (2) Digital inputs/outputs — inputs and outputs are digital (text, data, documents), not physical (score 0–3). (3) Defined quality criteria — you can articulate what a “good” output looks like (score 0–3). (4) Sufficient volume — the workflow processes enough items (50+/month) to justify automation investment (score 0–3). (5) Current cost — the workflow consumes enough human time to make automation economically worthwhile (score 0–3). Workflows scoring 10+ out of 15 are strong candidates. Workflows scoring below 7 are likely poor candidates until the underlying process is better defined.

Should I use RAG or fine-tuning for agents that need domain knowledge?

Use RAG as the default approach for domain knowledge. RAG is faster to implement (days vs. weeks), easier to update (swap documents vs. retrain), and provides source attribution (the agent can cite where its knowledge came from). Fine-tuning is appropriate only when you need the agent to adopt a specific behavioral pattern (writing style, reasoning approach) that cannot be achieved through prompting, OR when the knowledge base is extremely large and retrieval costs become prohibitive. In practice, fewer than 10% of enterprise agent pipelines require fine-tuning. RAG combined with well-crafted system prompts covers the vast majority of domain knowledge requirements.

How do I measure the ROI of an agent pipeline?

Calculate ROI using this framework: (1) Current cost — hours per month spent on the workflow by humans * average loaded hourly rate. (2) Pipeline cost — monthly LLM API costs + tool API costs + infrastructure costs + operation/maintenance labor costs. (3) Quality adjustment — if the pipeline produces higher or lower quality than human work, adjust the value accordingly. (4) ROI = (Current cost - Pipeline cost) / Pipeline cost * 100%. Most enterprise agent pipelines achieve 200–500% ROI for structured, high-volume workflows within 6 months of deployment. Pipelines for low-volume, highly variable workflows may never achieve positive ROI — this is an important viability check before investing in development. [Source: Deloitte, 2025]

Can agent pipelines process real-time requests, or are they only for batch work?

Agent pipelines serve both use cases, but the architecture differs. Batch pipelines (content production, data processing, report generation) optimize for throughput and quality, with latency measured in minutes. Real-time pipelines (customer service, chatbots, interactive tools) optimize for latency, with responses expected in seconds. Real-time pipelines use fewer agents (2–3), smaller models (Haiku/Sonnet), and simpler orchestration (router + specialist) to minimize response time. Batch pipelines can use more agents (5–7), larger models, and more thorough quality checking because latency tolerance is higher. Design your pipeline for the latency profile your use case requires.