AI AgentsJanuary 6, 202622 min read

Multi-Agent Orchestration: The Real Production Guide

Founder & CEO, Agentik{OS}

Most multi-agent demos crumble in production. Here's how to build orchestration that survives real workloads, error storms, and 3am failures.

Multi-Agent Orchestration: The Real Production Guide

One agent is a tool. Multiple agents working together are a team. And the difference between a productive team and a chaotic pile of competing individuals comes down to one thing: orchestration.

I run multi-agent systems in production. Not demos. Not proofs of concept that only work when the conditions are perfect. Systems that handle real workloads, serve real users, and need to keep running at 3am on a Sunday when nobody is watching and the API is rate-limiting and one of the agents is in a confused state.

The gap between "multi-agent demo that kills on stage" and "multi-agent system that actually works in production" is enormous. This is everything I've learned about closing that gap.

The One-Agent Trap

Every multi-agent system starts with a temptation: build one super-agent that does everything. Give it every tool, every piece of context, every capability the system needs. One prompt to rule them all.

This fails. Not always immediately. Usually at scale.

A single agent handling 50 tools makes poor tool selection decisions. The model has to reason about all 50 tools at once when selecting which one to use for each step. Tool selection quality degrades with the number of available tools.

A single agent handling every type of work conflates contexts. When the same agent writes code, reviews it, deploys it, and monitors it, the context from writing bleeds into the reviewing. The review is less critical because the agent knows what the code "should" be doing.

A single agent handling long workflows loses track of state. By the 15th step of a 20-step workflow, the early context has faded and the agent makes decisions without full awareness of what happened at the beginning.

Specialized agents solve all three problems. An agent optimized for code generation has 5-10 relevant tools and a focused system prompt. It makes better decisions because it's reasoning about a smaller, more relevant context. A separate review agent has no emotional attachment to the code it's reviewing because it didn't write it. A dedicated deployment agent doesn't carry the context of the code generation phase.

The right analogy is a software development team, not a superhero developer. Teams produce better output than individuals on complex problems because specialization and independent verification catch what generalism misses.

Architecture: What a Production Orchestrator Does

The orchestrator is the most critical component in a multi-agent system. It's also the component most tutorials gloss over.

A production orchestrator handles four responsibilities:

Task Decomposition

Take complex requests and break them into executable subtasks. This needs to be systematic, not improvised.

typescript

// src/orchestrator/task-decomposer.ts
import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic()

interface Task {
  id: string
  type: 'code_generation' | 'code_review' | 'testing' | 'documentation' | 'deployment'
  description: string
  dependencies: string[] // IDs of tasks that must complete first
  assignedAgent: string
  status: 'pending' | 'in_progress' | 'completed' | 'failed'
  result?: unknown
  error?: string
}

export async function decomposeRequest(request: string, context: {
  projectType: string
  existingFeatures: string[]
  techStack: string[]
}): Promise<Task[]> {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5',
    system: `You are a technical project manager who breaks down feature requests into executable subtasks.
    
Output a JSON array of tasks. Each task has:
- id: unique string
- type: one of [code_generation, code_review, testing, documentation, deployment]
- description: specific, executable description (not vague)
- dependencies: array of task ids that must complete before this task
- assignedAgent: which specialist agent handles this

Rules:
- Be specific. "Generate the user profile schema" not "work on database"
- Make dependencies explicit. A review task always depends on its generation task
- Testing always depends on code generation
- Deployment always depends on testing`,
    messages: [
      {
        role: 'user',
        content: `Decompose this feature request into tasks:
${request}

Project context:
- Type: ${context.projectType}
- Tech stack: ${context.techStack.join(', ')}
- Existing features: ${context.existingFeatures.join(', ')}`,
      },
    ],
  })

  const content = response.content[0]
  if (content.type !== 'text') throw new Error('Expected text response')

  return JSON.parse(content.text) as Task[]
}

Agent Routing

Match each subtask to the appropriate specialist agent:

typescript

// src/orchestrator/agent-router.ts
const AGENT_REGISTRY = {
  code_generation: {
    model: 'claude-opus-4-5',
    tools: ['read_file', 'write_file', 'search_codebase', 'run_linter'],
    systemPrompt: 'You are a senior software engineer. Write clean, tested, production-ready code.',
  },
  code_review: {
    model: 'claude-opus-4-5',
    tools: ['read_file', 'search_codebase', 'check_types'],
    systemPrompt: 'You are a code reviewer. Be critical. Find bugs, security issues, and design problems. Do not be lenient because the code compiles.',
  },
  testing: {
    model: 'claude-sonnet-4-5',
    tools: ['read_file', 'write_file', 'run_tests', 'check_coverage'],
    systemPrompt: 'You are a QA engineer. Write comprehensive tests. Aim for 95% coverage. Test happy paths and failure cases equally.',
  },
  documentation: {
    model: 'claude-haiku-3-5',  // Cheaper for doc generation
    tools: ['read_file', 'write_file'],
    systemPrompt: 'You write clear, concise technical documentation for developers.',
  },
  deployment: {
    model: 'claude-sonnet-4-5',
    tools: ['run_build', 'run_tests', 'deploy', 'check_health', 'rollback'],
    systemPrompt: 'You are a DevOps engineer. Deploy carefully. Verify before and after. Roll back at the first sign of problems.',
  },
}

export function routeTask(task: Task) {
  const agent = AGENT_REGISTRY[task.type]
  if (!agent) throw new Error(`No agent registered for task type: ${task.type}`)
  return agent
}

State Management

Track progress across all subtasks. This is where most tutorial implementations fail. They don't handle concurrent execution, partial failures, or state reconstruction after interruption.

typescript

// src/orchestrator/state-manager.ts
import { Redis } from '@upstash/redis'

const redis = Redis.fromEnv()

interface WorkflowState {
  workflowId: string
  request: string
  tasks: Task[]
  startedAt: Date
  completedAt?: Date
  status: 'running' | 'completed' | 'failed' | 'partial'
}

export class WorkflowStateManager {
  async createWorkflow(workflowId: string, request: string, tasks: Task[]): Promise<void> {
    const state: WorkflowState = {
      workflowId,
      request,
      tasks,
      startedAt: new Date(),
      status: 'running',
    }
    // Persist with 24-hour TTL
    await redis.set(`workflow:${workflowId}`, JSON.stringify(state), { ex: 86400 })
  }

  async updateTaskStatus(
    workflowId: string,
    taskId: string,
    status: Task['status'],
    result?: unknown,
    error?: string
  ): Promise<void> {
    const state = await this.getWorkflow(workflowId)
    if (!state) throw new Error(`Workflow ${workflowId} not found`)

    const task = state.tasks.find(t => t.id === taskId)
    if (!task) throw new Error(`Task ${taskId} not found in workflow ${workflowId}`)

    task.status = status
    if (result !== undefined) task.result = result
    if (error !== undefined) task.error = error

    await redis.set(`workflow:${workflowId}`, JSON.stringify(state), { ex: 86400 })
  }

  async getReadyTasks(workflowId: string): Promise<Task[]> {
    const state = await this.getWorkflow(workflowId)
    if (!state) return []

    const completedTaskIds = new Set(
      state.tasks.filter(t => t.status === 'completed').map(t => t.id)
    )

    return state.tasks.filter(task =>
      task.status === 'pending' &&
      task.dependencies.every(depId => completedTaskIds.has(depId))
    )
  }

  private async getWorkflow(workflowId: string): Promise<WorkflowState | null> {
    const data = await redis.get<string>(`workflow:${workflowId}`)
    return data ? JSON.parse(data) : null
  }
}

Error Handling at Every Level

In a multi-agent system, errors propagate. Agent A fails. Its output is missing. Agent B receives incomplete input. Agent B's output is wrong. Agent C builds on wrong output. By the time you detect the problem, multiple agents have wasted work and the user has been waiting for 20 minutes.

Three-level error handling prevents cascade failures.

Agent Level

Each agent validates inputs before starting:

typescript

// Every agent validates its input before execution
async function validateAgentInput(task: Task, context: Record<string, unknown>): Promise<void> {
  const required = getRequiredContextFields(task.type)

  for (const field of required) {
    if (context[field] === undefined || context[field] === null) {
      throw new Error(
        `Task ${task.id} (${task.type}) missing required context: ${field}. ` +
        `This dependency task may have failed.`
      )
    }
  }

  // Validate output quality from dependency tasks
  if (task.type === 'code_review' && context.generatedCode) {
    const code = context.generatedCode as string
    if (code.length < 10) {
      throw new Error(`Code generation produced suspiciously short output: ${code}`)
    }
  }
}

Orchestrator Level

Check each agent's output before passing it downstream:

typescript

// Quality gate before passing output to downstream agents
async function validateAgentOutput(task: Task, output: unknown): Promise<boolean> {
  switch (task.type) {
    case 'code_generation':
      // Code must be non-empty and not contain obvious errors
      const code = output as string
      if (!code || code.length < 20) return false
      if (code.includes('TODO: implement') && !code.includes('// TODO')) return false
      return true

    case 'testing':
      // Tests must exist and pass
      const testResult = output as { passed: number; failed: number; coverage: number }
      if (testResult.failed > 0) return false
      if (testResult.coverage < 80) return false
      return true

    case 'code_review':
      // Review must not contain CRITICAL issues
      const review = output as { severity: 'CRITICAL' | 'HIGH' | 'MEDIUM' | 'LOW'; issues: string[] }
      return review.severity !== 'CRITICAL'

    default:
      return true
  }
}

System Level

Circuit breakers remove broken agents from rotation:

typescript

// Track agent health and remove unreliable agents
class AgentHealthMonitor {
  private failureCount = new Map<string, number>()
  private disabledAgents = new Set<string>()
  private readonly FAILURE_THRESHOLD = 3

  recordFailure(agentType: string): void {
    const count = (this.failureCount.get(agentType) ?? 0) + 1
    this.failureCount.set(agentType, count)

    if (count >= this.FAILURE_THRESHOLD) {
      this.disabledAgents.add(agentType)
      console.error(`Agent ${agentType} disabled after ${count} consecutive failures`)
      this.scheduleReenablement(agentType, 5 * 60 * 1000) // Try again in 5 minutes
    }
  }

  recordSuccess(agentType: string): void {
    this.failureCount.set(agentType, 0)
    this.disabledAgents.delete(agentType)
  }

  isAvailable(agentType: string): boolean {
    return !this.disabledAgents.has(agentType)
  }

  private scheduleReenablement(agentType: string, delayMs: number): void {
    setTimeout(() => {
      this.disabledAgents.delete(agentType)
      this.failureCount.set(agentType, 0)
      console.log(`Agent ${agentType} re-enabled for testing`)
    }, delayMs)
  }
}

Resource Management: The Thing That Kills You at Scale

Five agents running concurrently, each making LLM API calls, executing tools, maintaining context. At 10 concurrent workflows, that's up to 50 simultaneous API calls. Without resource management, you hit rate limits, memory limits, and API quotas in ways that are hard to diagnose.

typescript

// src/orchestrator/resource-manager.ts
import pLimit from 'p-limit'

export class ResourceManager {
  // Limit concurrent LLM API calls
  private apiCallLimiter = pLimit(10)

  // Limit concurrent tool executions (they use more memory)
  private toolExecutionLimiter = pLimit(5)

  // Token budget per workflow (prevents runaway costs)
  private tokenBudgets = new Map<string, number>()
  private tokenUsage = new Map<string, number>()

  async executeWithRateLimit<T>(
    fn: () => Promise<T>,
    type: 'api_call' | 'tool_execution'
  ): Promise<T> {
    const limiter = type === 'api_call'
      ? this.apiCallLimiter
      : this.toolExecutionLimiter

    return limiter(() => fn())
  }

  setTokenBudget(workflowId: string, budget: number): void {
    this.tokenBudgets.set(workflowId, budget)
    this.tokenUsage.set(workflowId, 0)
  }

  recordTokenUsage(workflowId: string, tokens: number): void {
    const current = this.tokenUsage.get(workflowId) ?? 0
    this.tokenUsage.set(workflowId, current + tokens)
  }

  isOverBudget(workflowId: string): boolean {
    const budget = this.tokenBudgets.get(workflowId) ?? Infinity
    const used = this.tokenUsage.get(workflowId) ?? 0
    return used > budget
  }
}

Observability: You Can't Operate What You Can't See

Multi-agent systems fail in non-obvious ways. An agent produces subtly incorrect output that passes all quality gates. The mistake propagates through downstream agents. By the time the final output reaches the user, the error source is completely obscured.

Log everything:

typescript

// Every agent invocation produces a structured trace
interface AgentTrace {
  workflowId: string
  taskId: string
  agentType: string
  startTime: Date
  endTime: Date
  durationMs: number
  inputSummary: string
  outputSummary: string
  tokenUsage: { input: number; output: number }
  success: boolean
  error?: string
}

async function traceAgentExecution(
  workflowId: string,
  task: Task,
  fn: () => Promise<unknown>
): Promise<unknown> {
  const startTime = new Date()
  let result: unknown
  let error: Error | undefined

  try {
    result = await fn()
    return result
  } catch (e) {
    error = e instanceof Error ? e : new Error(String(e))
    throw error
  } finally {
    const trace: AgentTrace = {
      workflowId,
      taskId: task.id,
      agentType: task.type,
      startTime,
      endTime: new Date(),
      durationMs: Date.now() - startTime.getTime(),
      inputSummary: summarizeForLog(task.description),
      outputSummary: result ? summarizeForLog(result) : 'no output',
      tokenUsage: { input: 0, output: 0 }, // populated from LLM response
      success: !error,
      error: error?.message,
    }

    await persistTrace(trace)
  }
}

With complete traces, debugging a production issue means following the trace: which agent ran first, what it produced, what the next agent received, where the output diverged from expectations. It's like having a complete runtime replay.

Start Small, Scale Deliberately

Don't launch a full multi-agent system on day one. Start with two agents and an orchestrator: one specialist and one reviewer. Get coordination right. Get error handling right. Get observability right.

Add agents one at a time. Each new agent adds orchestration complexity. Validate that the capability justifies the complexity.

The best multi-agent systems I've seen in production run a small, focused set of agents. Enough specialization to produce high-quality output. Few enough to be debuggable when something goes wrong.

For the communication protocols between these agents, agent-to-agent communication patterns covers the message passing, shared memory, and negotiation patterns that keep coordinated agents productive rather than chaotic.

FAQ

Q: What is multi-agent orchestration?

Multi-agent orchestration is the coordination of multiple specialized AI agents working together on complex tasks. Instead of one general-purpose agent, specialized agents handle specific domains (coding, testing, deployment, review) and communicate through defined protocols. An orchestrator manages task distribution, dependency resolution, and conflict handling.

Q: How do you coordinate multiple AI agents in production?

Production multi-agent systems use a central orchestrator that decomposes tasks and assigns them to specialized agents, standardized communication protocols (like MCP) for agent-to-agent messaging, and shared context stores that maintain project state across agent interactions.

Q: What are the common patterns for multi-agent systems?

The five core patterns are prompt chaining (sequential agents with validation gates), routing (directing tasks to the right specialist), parallelization (multiple agents simultaneously), orchestrator-workers (coordinator with specialists), and evaluator-optimizer (self-improving loops with feedback).

Q: When should you use multi-agent systems vs a single agent?

Use multi-agent systems when tasks require different expertise domains, benefit from parallel execution, or need specialized tool access. A single agent suffices for focused tasks with clear scope. Multi-agent coordination overhead is only justified when task complexity exceeds what one agent can handle.

Sources

The One-Agent Trap

This fails. Not always immediately. Usually at scale.

The right analogy is a software development team, not a superhero developer. Teams produce better output than individuals on complex problems because specialization and independent verification catch what generalism misses.

Architecture: What a Production Orchestrator Does

The orchestrator is the most critical component in a multi-agent system. It's also the component most tutorials gloss over.

A production orchestrator handles four responsibilities:

Task Decomposition

Take complex requests and break them into executable subtasks. This needs to be systematic, not improvised.

typescript

// src/orchestrator/task-decomposer.ts
import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic()

interface Task {
  id: string
  type: 'code_generation' | 'code_review' | 'testing' | 'documentation' | 'deployment'
  description: string
  dependencies: string[] // IDs of tasks that must complete first
  assignedAgent: string
  status: 'pending' | 'in_progress' | 'completed' | 'failed'
  result?: unknown
  error?: string
}

export async function decomposeRequest(request: string, context: {
  projectType: string
  existingFeatures: string[]
  techStack: string[]
}): Promise<Task[]> {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5',
    system: `You are a technical project manager who breaks down feature requests into executable subtasks.
    
Output a JSON array of tasks. Each task has:
- id: unique string
- type: one of [code_generation, code_review, testing, documentation, deployment]
- description: specific, executable description (not vague)
- dependencies: array of task ids that must complete before this task
- assignedAgent: which specialist agent handles this

Rules:
- Be specific. "Generate the user profile schema" not "work on database"
- Make dependencies explicit. A review task always depends on its generation task
- Testing always depends on code generation
- Deployment always depends on testing`,
    messages: [
      {
        role: 'user',
        content: `Decompose this feature request into tasks:
${request}

Project context:
- Type: ${context.projectType}
- Tech stack: ${context.techStack.join(', ')}
- Existing features: ${context.existingFeatures.join(', ')}`,
      },
    ],
  })

  const content = response.content[0]
  if (content.type !== 'text') throw new Error('Expected text response')

  return JSON.parse(content.text) as Task[]
}

Agent Routing

Match each subtask to the appropriate specialist agent:

typescript

// src/orchestrator/agent-router.ts
const AGENT_REGISTRY = {
  code_generation: {
    model: 'claude-opus-4-5',
    tools: ['read_file', 'write_file', 'search_codebase', 'run_linter'],
    systemPrompt: 'You are a senior software engineer. Write clean, tested, production-ready code.',
  },
  code_review: {
    model: 'claude-opus-4-5',
    tools: ['read_file', 'search_codebase', 'check_types'],
    systemPrompt: 'You are a code reviewer. Be critical. Find bugs, security issues, and design problems. Do not be lenient because the code compiles.',
  },
  testing: {
    model: 'claude-sonnet-4-5',
    tools: ['read_file', 'write_file', 'run_tests', 'check_coverage'],
    systemPrompt: 'You are a QA engineer. Write comprehensive tests. Aim for 95% coverage. Test happy paths and failure cases equally.',
  },
  documentation: {
    model: 'claude-haiku-3-5',  // Cheaper for doc generation
    tools: ['read_file', 'write_file'],
    systemPrompt: 'You write clear, concise technical documentation for developers.',
  },
  deployment: {
    model: 'claude-sonnet-4-5',
    tools: ['run_build', 'run_tests', 'deploy', 'check_health', 'rollback'],
    systemPrompt: 'You are a DevOps engineer. Deploy carefully. Verify before and after. Roll back at the first sign of problems.',
  },
}

export function routeTask(task: Task) {
  const agent = AGENT_REGISTRY[task.type]
  if (!agent) throw new Error(`No agent registered for task type: ${task.type}`)
  return agent
}

State Management

Track progress across all subtasks. This is where most tutorial implementations fail. They don't handle concurrent execution, partial failures, or state reconstruction after interruption.

typescript

// src/orchestrator/state-manager.ts
import { Redis } from '@upstash/redis'

const redis = Redis.fromEnv()

interface WorkflowState {
  workflowId: string
  request: string
  tasks: Task[]
  startedAt: Date
  completedAt?: Date
  status: 'running' | 'completed' | 'failed' | 'partial'
}

export class WorkflowStateManager {
  async createWorkflow(workflowId: string, request: string, tasks: Task[]): Promise<void> {
    const state: WorkflowState = {
      workflowId,
      request,
      tasks,
      startedAt: new Date(),
      status: 'running',
    }
    // Persist with 24-hour TTL
    await redis.set(`workflow:${workflowId}`, JSON.stringify(state), { ex: 86400 })
  }

  async updateTaskStatus(
    workflowId: string,
    taskId: string,
    status: Task['status'],
    result?: unknown,
    error?: string
  ): Promise<void> {
    const state = await this.getWorkflow(workflowId)
    if (!state) throw new Error(`Workflow ${workflowId} not found`)

    const task = state.tasks.find(t => t.id === taskId)
    if (!task) throw new Error(`Task ${taskId} not found in workflow ${workflowId}`)

    task.status = status
    if (result !== undefined) task.result = result
    if (error !== undefined) task.error = error

    await redis.set(`workflow:${workflowId}`, JSON.stringify(state), { ex: 86400 })
  }

  async getReadyTasks(workflowId: string): Promise<Task[]> {
    const state = await this.getWorkflow(workflowId)
    if (!state) return []

    const completedTaskIds = new Set(
      state.tasks.filter(t => t.status === 'completed').map(t => t.id)
    )

    return state.tasks.filter(task =>
      task.status === 'pending' &&
      task.dependencies.every(depId => completedTaskIds.has(depId))
    )
  }

  private async getWorkflow(workflowId: string): Promise<WorkflowState | null> {
    const data = await redis.get<string>(`workflow:${workflowId}`)
    return data ? JSON.parse(data) : null
  }
}

Error Handling at Every Level

Three-level error handling prevents cascade failures.

Agent Level

Each agent validates inputs before starting:

typescript

// Every agent validates its input before execution
async function validateAgentInput(task: Task, context: Record<string, unknown>): Promise<void> {
  const required = getRequiredContextFields(task.type)

  for (const field of required) {
    if (context[field] === undefined || context[field] === null) {
      throw new Error(
        `Task ${task.id} (${task.type}) missing required context: ${field}. ` +
        `This dependency task may have failed.`
      )
    }
  }

  // Validate output quality from dependency tasks
  if (task.type === 'code_review' && context.generatedCode) {
    const code = context.generatedCode as string
    if (code.length < 10) {
      throw new Error(`Code generation produced suspiciously short output: ${code}`)
    }
  }
}

Orchestrator Level

Check each agent's output before passing it downstream:

typescript

// Quality gate before passing output to downstream agents
async function validateAgentOutput(task: Task, output: unknown): Promise<boolean> {
  switch (task.type) {
    case 'code_generation':
      // Code must be non-empty and not contain obvious errors
      const code = output as string
      if (!code || code.length < 20) return false
      if (code.includes('TODO: implement') && !code.includes('// TODO')) return false
      return true

    case 'testing':
      // Tests must exist and pass
      const testResult = output as { passed: number; failed: number; coverage: number }
      if (testResult.failed > 0) return false
      if (testResult.coverage < 80) return false
      return true

    case 'code_review':
      // Review must not contain CRITICAL issues
      const review = output as { severity: 'CRITICAL' | 'HIGH' | 'MEDIUM' | 'LOW'; issues: string[] }
      return review.severity !== 'CRITICAL'

    default:
      return true
  }
}

System Level

Circuit breakers remove broken agents from rotation:

typescript

// Track agent health and remove unreliable agents
class AgentHealthMonitor {
  private failureCount = new Map<string, number>()
  private disabledAgents = new Set<string>()
  private readonly FAILURE_THRESHOLD = 3

  recordFailure(agentType: string): void {
    const count = (this.failureCount.get(agentType) ?? 0) + 1
    this.failureCount.set(agentType, count)

    if (count >= this.FAILURE_THRESHOLD) {
      this.disabledAgents.add(agentType)
      console.error(`Agent ${agentType} disabled after ${count} consecutive failures`)
      this.scheduleReenablement(agentType, 5 * 60 * 1000) // Try again in 5 minutes
    }
  }

  recordSuccess(agentType: string): void {
    this.failureCount.set(agentType, 0)
    this.disabledAgents.delete(agentType)
  }

  isAvailable(agentType: string): boolean {
    return !this.disabledAgents.has(agentType)
  }

  private scheduleReenablement(agentType: string, delayMs: number): void {
    setTimeout(() => {
      this.disabledAgents.delete(agentType)
      this.failureCount.set(agentType, 0)
      console.log(`Agent ${agentType} re-enabled for testing`)
    }, delayMs)
  }
}

Resource Management: The Thing That Kills You at Scale

typescript

// src/orchestrator/resource-manager.ts
import pLimit from 'p-limit'

export class ResourceManager {
  // Limit concurrent LLM API calls
  private apiCallLimiter = pLimit(10)

  // Limit concurrent tool executions (they use more memory)
  private toolExecutionLimiter = pLimit(5)

  // Token budget per workflow (prevents runaway costs)
  private tokenBudgets = new Map<string, number>()
  private tokenUsage = new Map<string, number>()

  async executeWithRateLimit<T>(
    fn: () => Promise<T>,
    type: 'api_call' | 'tool_execution'
  ): Promise<T> {
    const limiter = type === 'api_call'
      ? this.apiCallLimiter
      : this.toolExecutionLimiter

    return limiter(() => fn())
  }

  setTokenBudget(workflowId: string, budget: number): void {
    this.tokenBudgets.set(workflowId, budget)
    this.tokenUsage.set(workflowId, 0)
  }

  recordTokenUsage(workflowId: string, tokens: number): void {
    const current = this.tokenUsage.get(workflowId) ?? 0
    this.tokenUsage.set(workflowId, current + tokens)
  }

  isOverBudget(workflowId: string): boolean {
    const budget = this.tokenBudgets.get(workflowId) ?? Infinity
    const used = this.tokenUsage.get(workflowId) ?? 0
    return used > budget
  }
}

Observability: You Can't Operate What You Can't See

Log everything:

typescript

// Every agent invocation produces a structured trace
interface AgentTrace {
  workflowId: string
  taskId: string
  agentType: string
  startTime: Date
  endTime: Date
  durationMs: number
  inputSummary: string
  outputSummary: string
  tokenUsage: { input: number; output: number }
  success: boolean
  error?: string
}

async function traceAgentExecution(
  workflowId: string,
  task: Task,
  fn: () => Promise<unknown>
): Promise<unknown> {
  const startTime = new Date()
  let result: unknown
  let error: Error | undefined

  try {
    result = await fn()
    return result
  } catch (e) {
    error = e instanceof Error ? e : new Error(String(e))
    throw error
  } finally {
    const trace: AgentTrace = {
      workflowId,
      taskId: task.id,
      agentType: task.type,
      startTime,
      endTime: new Date(),
      durationMs: Date.now() - startTime.getTime(),
      inputSummary: summarizeForLog(task.description),
      outputSummary: result ? summarizeForLog(result) : 'no output',
      tokenUsage: { input: 0, output: 0 }, // populated from LLM response
      success: !error,
      error: error?.message,
    }

    await persistTrace(trace)
  }
}

Start Small, Scale Deliberately

Add agents one at a time. Each new agent adds orchestration complexity. Validate that the capability justifies the complexity.

The best multi-agent systems I've seen in production run a small, focused set of agents. Enough specialization to produce high-quality output. Few enough to be debuggable when something goes wrong.

FAQ

Q: What is multi-agent orchestration?

Q: How do you coordinate multiple AI agents in production?

Q: What are the common patterns for multi-agent systems?

Q: When should you use multi-agent systems vs a single agent?

Multi-Agent Orchestration: The Real Production Guide

The One-Agent Trap

Architecture: What a Production Orchestrator Does

Task Decomposition

Agent Routing

State Management

Error Handling at Every Level

Agent Level

Orchestrator Level

System Level

Resource Management: The Thing That Kills You at Scale

Observability: You Can't Operate What You Can't See

Start Small, Scale Deliberately

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

Multi-Agent Orchestration: The Real Production Guide

The One-Agent Trap

Architecture: What a Production Orchestrator Does

Task Decomposition

Agent Routing

State Management

Error Handling at Every Level

Agent Level

Orchestrator Level

System Level

Resource Management: The Thing That Kills You at Scale

Observability: You Can't Operate What You Can't See

Start Small, Scale Deliberately

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?