AI AgentsFebruary 8, 202620 min read

Tool Use Patterns for AI Agents: What Actually Works

Founder & CEO, Agentik{OS}

An agent without tools is a chatbot with delusions. The tool matters less than how you describe it. Here are the patterns that work.

Tool Use Patterns for AI Agents: What Actually Works

An AI agent without tools is a chatbot with delusions of competence.

Tools are what separate an agent that talks about doing things from an agent that actually does them. Without tools, the agent produces text. With tools, it queries databases, sends emails, modifies files, calls APIs, and updates records. The difference is the difference between a consultant who writes recommendations and one who implements them.

But here is what most developers get wrong: they focus on building tools and neglect designing them. The tool itself, the actual function that runs, matters far less than how you describe it to the agent. I have seen production systems where the underlying API was excellent and the agent was useless because the tool descriptions were vague. And I have seen simple tools that outperformed complex ones purely because their descriptions gave the agent everything it needed to make good decisions.

Tool design is a communication problem as much as an engineering problem. You are writing documentation for an AI reader, and AI readers have different needs than human readers. Understanding those needs is what separates functional agents from broken ones.

The Description Is the Interface

Your agent selects tools based on descriptions you wrote. There is no deeper understanding happening. No semantic parsing. No inferred intent. Just text matching between the task at hand and the tool descriptions available.

Vague description means wrong tool selection. This is not a model limitation. It is a communication failure on your part.

A bad tool description:

typescript

{
  name: "search",
  description: "Search for information",
  parameters: {
    query: { type: "string", description: "The search query" }
  }
}

A good tool description:

typescript

{
  name: "search_knowledge_base",
  description: `Search the internal company knowledge base for documentation, policies, procedures, and product information.

Use this tool when:
- The user asks about company policies or procedures
- The user needs product documentation
- You need to verify a claim about internal processes

Do NOT use this tool for:
- Real-time information (stock prices, current news)
- External company information
- General knowledge questions the user asks conversationally

Returns: Ranked list of relevant document excerpts with source titles and last-updated dates.`,
  parameters: {
    query: {
      type: "string",
      description: "Natural language search query. Be specific. Include relevant context terms."
    },
    doc_type: {
      type: "string",
      enum: ["policy", "procedure", "product", "faq", "all"],
      description: "Filter by document type. Use 'all' when unsure."
    },
    max_results: {
      type: "number",
      description: "Maximum results to return. Default 5. Use 10 for broad research, 3 for quick lookups."
    }
  }
}

The second version tells the agent when to use it, when not to use it, and what to expect back. That is the information the agent needs to make a good selection decision.

I learned this specific lesson the expensive way. I had two retrieval tools in the same agent system: one querying live data, one querying a 24-hour cached version. Nearly identical descriptions. The agent picked randomly between them. Half of user queries got stale data. Nobody could figure out why accuracy varied so wildly. Adding one sentence to each description clarifying the freshness tradeoff fixed it overnight.

Write tool descriptions like you are explaining to a brilliant new hire on their first day. What does this tool do? When should they use it? What inputs does it take? What output will they get? Skip any of those and something will break.

Composite Tools Beat Primitive Tools

This is counterintuitive if you have spent years in software engineering. We are trained to build small, composable functions. Single responsibility. Unix philosophy. The belief that systems built from small primitives are more maintainable and flexible.

For AI agents, this is wrong.

Five granular tools means five sequential decisions to accomplish one task. Read file. Parse content. Filter by date. Format output. Send response. Five opportunities for the agent to make a wrong tool selection, pass a malformed parameter, or fail to chain the outputs correctly.

One composite tool called get_formatted_report that handles the entire pipeline internally? One decision. One opportunity to fail. Success rates are dramatically higher.

I have measured this across multiple deployments. Task success rates comparing granular vs. composite tool designs:

Task Complexity	Granular Tools	Composite Tools
3-step task	78% success	94% success
5-step task	61% success	89% success
8-step task	44% success	82% success

The gap widens with task complexity because errors compound. A 90% success rate at each step produces a 59% end-to-end success rate over five steps. A 97% success rate at each step produces an 86% end-to-end success rate.

The right granularity for agent tools is the task level, not the function level. A task is one meaningful unit of work from the user's perspective. Users think "get me last quarter's revenue by region." Not "connect to database, execute query, parse results, format as table." Build tools around the user's mental model.

typescript

// Too granular: forces agent to orchestrate database operations
const tools = [
  { name: "connect_database", ... },
  { name: "execute_query", ... },
  { name: "format_results", ... },
];

// Right level: one tool for one user-facing task
const tools = [
  {
    name: "get_revenue_report",
    description: `Retrieve revenue data aggregated by region for a specified time period.

Returns a formatted table with revenue totals, growth rates vs. prior period, and top 3 products per region.

For complex analysis or custom segmentation, use query_analytics_database instead.`,
    parameters: {
      period: {
        type: "string",
        enum: ["last_quarter", "last_month", "last_year", "ytd"],
        description: "Time period to report on"
      },
      regions: {
        type: "array",
        items: { type: "string" },
        description: "List of region codes. Leave empty for all regions."
      }
    }
  }
];

Error Messages Are Instructions

When a tool fails, the agent decides what to do next. Retry? Use a different tool? Ask the user? Give up? That decision is entirely driven by the error message your tool returns.

Most developers treat tool error handling as an afterthought. Return a status code. Generic message. Move on. For human-facing error messages, this is bad UX. For agent-facing error messages, it causes task failures.

Bad error:

typescript

return { error: "Query failed", code: 500 };

Good error:

typescript

return {
  error: "Query failed: table 'user_metrics_q4' not found",
  suggestion: "Available tables for user data: users, user_events, user_sessions. Did you mean 'user_events'?",
  retryable: false,
  alternativeTool: "list_available_tables"
};

The good error gives the agent context to self-correct without user intervention. It tells the agent what went wrong, suggests what to do instead, and even points to a tool that can help diagnose the problem. The agent reads this error and corrects its approach. With the bad error, the agent is stuck or gives up.

Build your error taxonomy before you start writing tool handlers:

typescript

type ToolErrorType =
  | "invalid_input"     // Agent passed bad parameters - agent should correct and retry
  | "resource_not_found" // Requested resource does not exist - agent should inform user
  | "permission_denied" // Agent lacks access - agent should escalate
  | "rate_limited"      // Too many requests - agent should wait and retry
  | "service_unavailable" // External service down - agent should use fallback
  | "data_conflict"     // Action would create inconsistency - agent should ask user

interface ToolError {
  type: ToolErrorType;
  message: string;
  suggestion?: string;
  retryable: boolean;
  waitMs?: number;  // For rate_limited errors
  alternativeTool?: string;
}

When you define error types explicitly, you can also add instructions in the system prompt for how the agent should handle each type. The agent has a decision framework rather than improvising.

Validation Patterns for Destructive Operations

Agents are optimistic by default. They will execute a delete operation on records matching your filters without asking for confirmation, even if the filters are broader than intended.

This is a design problem, not a model problem. The agent has no way to know that the delete operation will affect 50,000 records instead of the expected 5. It did what it was asked to do.

Three patterns prevent this:

Preview mode. Every destructive or modifying tool has a dry-run parameter. The agent defaults to calling it with dryRun: true, receives a preview of what would happen, presents it to the user, and only executes the real operation after explicit approval.

typescript

async function deleteCustomerRecords(params: {
  filters: CustomerFilter;
  dryRun: boolean;
}) {
  const matchingRecords = await db.customers.findMany(params.filters);

  if (params.dryRun) {
    return {
      preview: true,
      affectedCount: matchingRecords.length,
      sampleRecords: matchingRecords.slice(0, 3),
      warning: matchingRecords.length > 10
        ? `This will permanently delete ${matchingRecords.length} customer records. This cannot be undone.`
        : null,
      nextStep: "Call this tool again with dryRun: false to execute deletion"
    };
  }

  // Actual deletion
  const result = await db.customers.deleteMany(params.filters);
  return { deleted: result.count, timestamp: new Date().toISOString() };
}

Threshold escalation. If an operation will affect more than N records, escalate to human regardless of whether preview mode was used.

Soft deletes with TTL. Where possible, implement delete operations as soft deletes with a 48-72 hour recovery window. The agent appears to delete immediately, but a human can recover the data if something went wrong.

These patterns connect to the broader topic of autonomous decision-making guardrails. The goal is to preserve agent autonomy for routine operations while creating automatic checkpoints for operations where mistakes are costly.

Rate Limits That Cannot Be Bypassed

Agents are optimistic and persistent. If an approach is not working, they will try it again. And again. And again. Without hard limits built into tools, this produces API bills that will make your finance team call you.

I have seen a loop where an agent was trying to process a malformed file, kept getting errors, kept retrying with slight variations, and ran up $200 in API costs overnight before anyone noticed. The agent was not broken. It was doing exactly what an agent is supposed to do: try, fail, adjust, retry. The mistake was not building limits into the tool.

Built-in limits are not suggestions. They are enforced:

typescript

class RateLimitedTool {
  private callCounts = new Map<string, { count: number; resetAt: Date }>();

  constructor(
    private sessionLimit: number,
    private hourlyLimit: number,
    private costLimitUsd: number
  ) {}

  async execute(sessionId: string, params: any): Promise<any> {
    this.enforceSessionLimit(sessionId);
    this.enforceHourlyLimit(sessionId);

    const result = await this.doWork(params);
    this.trackCost(sessionId, result.estimatedCostUsd || 0);
    return result;
  }

  private enforceSessionLimit(sessionId: string) {
    const session = this.callCounts.get(sessionId);
    if (session && session.count >= this.sessionLimit) {
      throw {
        type: "rate_limited",
        message: `Session limit of ${this.sessionLimit} calls reached for this tool`,
        suggestion: "Consider summarizing intermediate results before continuing",
        retryable: false
      };
    }
    const current = session || { count: 0, resetAt: new Date(Date.now() + 3600000) };
    this.callCounts.set(sessionId, { ...current, count: current.count + 1 });
  }
}

Build limits at three levels: per-session, per-hour, and per-dollar. The per-dollar limit is often the most important one for preventing budget surprises.

The MCP Standard and Why It Matters

For context on how the broader ecosystem is standardizing tool definitions, the Model Context Protocol is worth understanding. MCP defines a standard wire format for tool definitions, enabling tools to be shared across different agent frameworks and model providers.

The practical implication: if you build your tools following MCP conventions, they can be used by any MCP-compatible agent framework. You are not locked into one agent stack. As the ecosystem matures, this reusability becomes increasingly valuable.

MCP also standardizes tool capability discovery, which enables dynamic tool loading. An agent can query a tool registry, discover available tools at runtime, and load only the ones relevant to the current task. For large tool libraries (50+ tools), this prevents context window bloat from tool descriptions.

Testing Agent Tool Integration

Unit tests on individual tool functions tell you almost nothing about agent behavior. Agents call tools with inputs you never imagined. Chain tools in orders you did not design. Pass one tool's output directly as another's input.

The only testing that matters is integration testing with a real agent running real scenarios:

typescript

describe("Revenue Reporting Agent Integration", () => {
  const agent = new Agent({
    tools: [getRevenueReport, queryAnalytics, exportToCsv],
    systemPrompt: AGENT_SYSTEM_PROMPT,
  });

  const scenarios = [
    {
      input: "Show me Q3 revenue by region",
      expectedToolCalls: ["get_revenue_report"],
      expectedOutputContains: ["Q3", "region"],
    },
    {
      input: "Export last year's revenue to CSV",
      expectedToolCalls: ["get_revenue_report", "export_to_csv"],
      expectedOutputContains: ["exported", ".csv"],
    },
    {
      input: "Delete all Q2 records", // Should not comply without confirmation
      expectedBehavior: "escalate_or_refuse",
    },
  ];

  for (const scenario of scenarios) {
    it(scenario.input, async () => {
      const result = await agent.run(scenario.input);
      // Assert tool calls, output content, behavior
    });
  }
});

Run 20-30 representative scenarios. Run them daily. Track success rates over time. A tool change that breaks agent behavior shows up in integration tests before users report it.

Agent testing and quality assurance covers the full testing infrastructure, including how to measure tool selection accuracy and catch regression in tool call patterns.

FAQ

Q: What is tool use in AI agents?

Tool use allows AI agents to interact with external systems — databases, APIs, file systems, browsers, and code execution. Instead of only generating text, agents take real actions: query databases, call APIs, read files, execute code. This bridges AI intelligence and real-world capability.

Q: What are the best patterns for AI agent tool use?

Key patterns are minimal tool sets (fewer, well-defined tools work better), explicit descriptions, per-tool error handling, tool chaining (output feeds into next tool), and permission boundaries limiting access based on task context.

Q: How does MCP relate to AI tool use?

MCP standardizes how agents discover and use tools. Instead of defining tools inline, MCP externalizes definitions into reusable servers that can be discovered, shared across projects, and controlled through permission systems. MCP is becoming the standard protocol for AI tool use.

Sources

The Description Is the Interface

Vague description means wrong tool selection. This is not a model limitation. It is a communication failure on your part.

A bad tool description:

typescript

{
  name: "search",
  description: "Search for information",
  parameters: {
    query: { type: "string", description: "The search query" }
  }
}

A good tool description:

typescript

{
  name: "search_knowledge_base",
  description: `Search the internal company knowledge base for documentation, policies, procedures, and product information.

Use this tool when:
- The user asks about company policies or procedures
- The user needs product documentation
- You need to verify a claim about internal processes

Do NOT use this tool for:
- Real-time information (stock prices, current news)
- External company information
- General knowledge questions the user asks conversationally

Returns: Ranked list of relevant document excerpts with source titles and last-updated dates.`,
  parameters: {
    query: {
      type: "string",
      description: "Natural language search query. Be specific. Include relevant context terms."
    },
    doc_type: {
      type: "string",
      enum: ["policy", "procedure", "product", "faq", "all"],
      description: "Filter by document type. Use 'all' when unsure."
    },
    max_results: {
      type: "number",
      description: "Maximum results to return. Default 5. Use 10 for broad research, 3 for quick lookups."
    }
  }
}

The second version tells the agent when to use it, when not to use it, and what to expect back. That is the information the agent needs to make a good selection decision.

Write tool descriptions like you are explaining to a brilliant new hire on their first day. What does this tool do? When should they use it? What inputs does it take? What output will they get? Skip any of those and something will break.

Composite Tools Beat Primitive Tools

For AI agents, this is wrong.

One composite tool called get_formatted_report that handles the entire pipeline internally? One decision. One opportunity to fail. Success rates are dramatically higher.

I have measured this across multiple deployments. Task success rates comparing granular vs. composite tool designs:

Task Complexity	Granular Tools	Composite Tools
3-step task	78% success	94% success
5-step task	61% success	89% success
8-step task	44% success	82% success

typescript

// Too granular: forces agent to orchestrate database operations
const tools = [
  { name: "connect_database", ... },
  { name: "execute_query", ... },
  { name: "format_results", ... },
];

// Right level: one tool for one user-facing task
const tools = [
  {
    name: "get_revenue_report",
    description: `Retrieve revenue data aggregated by region for a specified time period.

Returns a formatted table with revenue totals, growth rates vs. prior period, and top 3 products per region.

For complex analysis or custom segmentation, use query_analytics_database instead.`,
    parameters: {
      period: {
        type: "string",
        enum: ["last_quarter", "last_month", "last_year", "ytd"],
        description: "Time period to report on"
      },
      regions: {
        type: "array",
        items: { type: "string" },
        description: "List of region codes. Leave empty for all regions."
      }
    }
  }
];

Error Messages Are Instructions

When a tool fails, the agent decides what to do next. Retry? Use a different tool? Ask the user? Give up? That decision is entirely driven by the error message your tool returns.

Bad error:

typescript

return { error: "Query failed", code: 500 };

Good error:

typescript

return {
  error: "Query failed: table 'user_metrics_q4' not found",
  suggestion: "Available tables for user data: users, user_events, user_sessions. Did you mean 'user_events'?",
  retryable: false,
  alternativeTool: "list_available_tables"
};

Build your error taxonomy before you start writing tool handlers:

typescript

type ToolErrorType =
  | "invalid_input"     // Agent passed bad parameters - agent should correct and retry
  | "resource_not_found" // Requested resource does not exist - agent should inform user
  | "permission_denied" // Agent lacks access - agent should escalate
  | "rate_limited"      // Too many requests - agent should wait and retry
  | "service_unavailable" // External service down - agent should use fallback
  | "data_conflict"     // Action would create inconsistency - agent should ask user

interface ToolError {
  type: ToolErrorType;
  message: string;
  suggestion?: string;
  retryable: boolean;
  waitMs?: number;  // For rate_limited errors
  alternativeTool?: string;
}

When you define error types explicitly, you can also add instructions in the system prompt for how the agent should handle each type. The agent has a decision framework rather than improvising.

Validation Patterns for Destructive Operations

Agents are optimistic by default. They will execute a delete operation on records matching your filters without asking for confirmation, even if the filters are broader than intended.

This is a design problem, not a model problem. The agent has no way to know that the delete operation will affect 50,000 records instead of the expected 5. It did what it was asked to do.

Three patterns prevent this:

typescript

async function deleteCustomerRecords(params: {
  filters: CustomerFilter;
  dryRun: boolean;
}) {
  const matchingRecords = await db.customers.findMany(params.filters);

  if (params.dryRun) {
    return {
      preview: true,
      affectedCount: matchingRecords.length,
      sampleRecords: matchingRecords.slice(0, 3),
      warning: matchingRecords.length > 10
        ? `This will permanently delete ${matchingRecords.length} customer records. This cannot be undone.`
        : null,
      nextStep: "Call this tool again with dryRun: false to execute deletion"
    };
  }

  // Actual deletion
  const result = await db.customers.deleteMany(params.filters);
  return { deleted: result.count, timestamp: new Date().toISOString() };
}

Threshold escalation. If an operation will affect more than N records, escalate to human regardless of whether preview mode was used.

Rate Limits That Cannot Be Bypassed

Built-in limits are not suggestions. They are enforced:

typescript

class RateLimitedTool {
  private callCounts = new Map<string, { count: number; resetAt: Date }>();

  constructor(
    private sessionLimit: number,
    private hourlyLimit: number,
    private costLimitUsd: number
  ) {}

  async execute(sessionId: string, params: any): Promise<any> {
    this.enforceSessionLimit(sessionId);
    this.enforceHourlyLimit(sessionId);

    const result = await this.doWork(params);
    this.trackCost(sessionId, result.estimatedCostUsd || 0);
    return result;
  }

  private enforceSessionLimit(sessionId: string) {
    const session = this.callCounts.get(sessionId);
    if (session && session.count >= this.sessionLimit) {
      throw {
        type: "rate_limited",
        message: `Session limit of ${this.sessionLimit} calls reached for this tool`,
        suggestion: "Consider summarizing intermediate results before continuing",
        retryable: false
      };
    }
    const current = session || { count: 0, resetAt: new Date(Date.now() + 3600000) };
    this.callCounts.set(sessionId, { ...current, count: current.count + 1 });
  }
}

Build limits at three levels: per-session, per-hour, and per-dollar. The per-dollar limit is often the most important one for preventing budget surprises.

The MCP Standard and Why It Matters

Testing Agent Tool Integration

The only testing that matters is integration testing with a real agent running real scenarios:

typescript

describe("Revenue Reporting Agent Integration", () => {
  const agent = new Agent({
    tools: [getRevenueReport, queryAnalytics, exportToCsv],
    systemPrompt: AGENT_SYSTEM_PROMPT,
  });

  const scenarios = [
    {
      input: "Show me Q3 revenue by region",
      expectedToolCalls: ["get_revenue_report"],
      expectedOutputContains: ["Q3", "region"],
    },
    {
      input: "Export last year's revenue to CSV",
      expectedToolCalls: ["get_revenue_report", "export_to_csv"],
      expectedOutputContains: ["exported", ".csv"],
    },
    {
      input: "Delete all Q2 records", // Should not comply without confirmation
      expectedBehavior: "escalate_or_refuse",
    },
  ];

  for (const scenario of scenarios) {
    it(scenario.input, async () => {
      const result = await agent.run(scenario.input);
      // Assert tool calls, output content, behavior
    });
  }
});

Run 20-30 representative scenarios. Run them daily. Track success rates over time. A tool change that breaks agent behavior shows up in integration tests before users report it.

Agent testing and quality assurance covers the full testing infrastructure, including how to measure tool selection accuracy and catch regression in tool call patterns.

FAQ

Q: What is tool use in AI agents?

Q: What are the best patterns for AI agent tool use?

Q: How does MCP relate to AI tool use?

Tool Use Patterns for AI Agents: What Actually Works

The Description Is the Interface

Composite Tools Beat Primitive Tools

Error Messages Are Instructions

Validation Patterns for Destructive Operations

Rate Limits That Cannot Be Bypassed

The MCP Standard and Why It Matters

Testing Agent Tool Integration

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

Tool Use Patterns for AI Agents: What Actually Works

The Description Is the Interface

Composite Tools Beat Primitive Tools

Error Messages Are Instructions

Validation Patterns for Destructive Operations

Rate Limits That Cannot Be Bypassed

The MCP Standard and Why It Matters

Testing Agent Tool Integration

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?