Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
$47,000 per month on LLM calls for 2,000 users. That's $23.50 per user on inference alone. Most of it was waste. Here's how to fix agent economics.

I reviewed an AI agent startup's cloud bill last quarter. $47,000 per month on LLM API calls. Monthly active users: 2,000. That's $23.50 per user per month on inference alone, before hosting, storage, or anything else.
Their product charged $29 per month. They were losing money on every single user at any usage level above zero.
Not unusual. Agent systems are the most expensive software most engineering teams have ever built, and the majority of the spending delivers no user value. It's unnecessary computation that nobody audited.
Before optimizing anything, understand where money actually goes. Most teams have intuitions. Intuitions are wrong in ways that send optimization effort in the wrong direction.
For a typical agent system:
| Cost Category | Typical Share | Optimization Potential |
|---|---|---|
| LLM API calls | 60-80% | Very High |
| Embedding generation | 10-15% | High |
| Vector database queries | 5-10% | Medium |
| Tool execution / external APIs | 5-10% | Medium |
| Infrastructure | 5-10% | Low |
LLM costs dominate. Every optimization project that doesn't start there is optimizing the wrong thing.
async function auditCostStructure(periodStart: Date, periodEnd: Date): Promise<CostAudit> {
return {
byCategory: await getCostByCategory(periodStart, periodEnd),
byTaskType: await getCostByTaskType(periodStart, periodEnd),
byModelTier: await getCostByModelTier(periodStart, periodEnd),
byAgent: await getCostByAgent(periodStart, periodEnd),
wasted: {
uncachedDuplicates: await getUncachedDuplicateCost(periodStart, periodEnd),
oversizedPrompts: await getPromptBloatEstimate(periodStart, periodEnd),
unnecessaryHighTierCalls: await getOverqualifiedModelCost(periodStart, periodEnd),
},
};
}Build this breakdown before writing a single optimization. It tells you where the leverage is.
The highest-impact optimization available to most systems. Not incremental. Order-of-magnitude.
Most systems use one expensive model for everything. Including calls where a $0.001/1K token model produces identical results.
Typical call distribution once audited:
Weighted saving from optimized routing: approximately 70-75% reduction in LLM spend with no perceptible quality loss on user-facing outputs.
class TaskComplexityRouter {
private readonly modelTiers = {
fast: "claude-haiku-3-20240307",
balanced: "claude-sonnet-4-20250514",
powerful: "claude-opus-4-20250514",
};
async selectModel(task: AgentTask): Promise<string> {
if (FAST_TIER_TASK_TYPES.has(task.type)) return this.modelTiers.fast;
// Use cheapest model to classify complexity
const complexity = await this.classifyComplexity(task);
if (complexity.score < 0.3) return this.modelTiers.fast;
if (complexity.score < 0.7) return this.modelTiers.balanced;
return this.modelTiers.powerful;
}
private async classifyComplexity(task: AgentTask): Promise<ComplexityScore> {
return callLLM(task, {
model: this.modelTiers.fast, // Pay almost nothing to save a lot
systemPrompt: COMPLEXITY_CLASSIFIER_PROMPT,
}).then(parseComplexityScore);
}
}The routing infrastructure requires coordination/execution separation. See the scaling architecture article for how to build it.
Without measurement, most teams assume workloads are mostly unique. With measurement, they find 30-60% duplication.
class ResponseCache {
async get(input: AgentInput): Promise<CachedResult | null> {
const key = this.buildCacheKey(input);
const entry = await this.store.get<CacheEntry>(key);
if (!entry || this.isExpired(entry)) return null;
await this.metrics.record("cache_hit", {
type: "exact_match",
taskType: input.type,
savedCostUsd: entry.originalCostUsd,
});
return entry.result;
}
private buildCacheKey(input: AgentInput): string {
const normalized = this.normalizer.normalize(input);
return createHash("sha256")
.update(JSON.stringify(normalized, Object.keys(normalized).sort()))
.digest("hex");
}
}TTL varies by content type. Policy answers: 24 hours. Real-time data: 5 minutes. Document summaries: 7 days.
Near-duplicate queries. "How do I reset my password?" and "I forgot my password, how do I change it?" are semantically identical. Embed, check similarity, return cached response above threshold.
Set thresholds conservatively (0.95+). Monitor false positive rate. Even cautious thresholds add 10-20% cache hits on top of exact matching.
Cache expensive intermediate steps, not just final outputs.
async function getOrComputeContext(
query: string,
knowledgeBase: KnowledgeBase,
cache: Cache
): Promise<ProcessedContext> {
return cache.getOrCompute(
`context:${hashQuery(query)}`,
async () => {
const raw = await knowledgeBase.retrieve(query); // Expensive
return summarizeAndStructure(raw); // Also expensive
},
{ ttl: 3600 }
);
}Aggressive multi-layer caching reduces LLM costs 50-70% at scale. The difference between viable unit economics and burning money.
Every token in your system prompt is paid on every request. Bloated prompts are cash flow problems at scale.
Common prompt bloat patterns:
Redundant instructions. Three variations of "be professional and accurate." Pick one.
Over-specified examples. Three full examples in the system prompt when one would do. Examples repeat on every request.
Static context that should be dynamic. Including entire product documentation in every prompt when most of it is irrelevant. Move to dynamic injection: retrieve and inject only what's relevant to each specific request.
// EXPENSIVE: Static full context, 2000 tokens every request
const systemPrompt = `You are a customer support agent.\n\n${FULL_PRODUCT_DOCS}\n\n${ALL_POLICIES}`;
// CHEAPER: Dynamic context injection, 300-500 tokens on average
const systemPrompt = `You are a customer support agent.`;
const relevantContext = await retrieveRelevantContext(userQuery);
const messages = [
{ role: "system", content: systemPrompt },
{ role: "user", content: `Context: ${relevantContext}\n\nQuestion: ${userQuery}` }
];A system prompt reduction from 2,000 to 600 tokens is a 70% input cost reduction on that agent. At 10,000 requests per day, this is real money.
Multi-turn conversations accumulate history. Naive implementations include full history on every turn. Token costs grow linearly with conversation length.
class ConversationCompressor {
async getCompressedHistory(
fullHistory: Message[],
maxHistoryTokens: number = 1000
): Promise<Message[]> {
if (countTokens(fullHistory) <= maxHistoryTokens) return fullHistory;
const recentTurns = fullHistory.slice(-4); // Keep last 2 exchanges verbatim
const olderTurns = fullHistory.slice(0, -4);
const summary = await this.summarizer.summarize(olderTurns, {
preserveKeyFacts: true,
maxTokens: 300,
});
return [
{ role: "system", content: `Previous conversation summary: ${summary}` },
...recentTurns,
];
}
}Compressing older history reduces token cost by 60-80% for longer conversations with minimal quality impact for most use cases.
Not every task requires real-time response. Identify latency-tolerant workloads and batch them.
Provider batch API pricing typically offers 50% discounts with 24-hour turnaround. If the latency is acceptable, this is money left on the table.
Natural batch candidates: content moderation, analytical processing, background enrichment, scheduled report generation, data extraction from large document sets.
Explicit budgets at every level prevent surprises and enable graceful degradation.
class HierarchicalBudgetManager {
async checkAndApprove(request: AgentRequest): Promise<BudgetDecision> {
const checks = await Promise.all([
this.checkRequestLimit(request),
this.checkUserDailyLimit(request.userId),
this.checkSystemBudget(),
]);
const binding = checks.reduce((prev, curr) =>
curr.remainingTokens < prev.remainingTokens ? curr : prev
);
if (binding.remainingTokens < request.estimatedTokens) {
const degradedEstimate = request.estimatedTokens * 0.4;
if (binding.remainingTokens >= degradedEstimate) {
return { approved: true, modelTier: "fast", reason: "budget_constrained" };
}
return { approved: false, reason: binding.limitType, availableTokens: binding.remainingTokens };
}
return { approved: true, modelTier: "standard" };
}
}Budget exhaustion should degrade gracefully. Switch to cheaper model tier first. Increase cache aggressiveness second. Throttle new requests third. Hard stop only as last resort.
Cost optimization for agent systems isn't a project you complete. It's an ongoing operational discipline.
Monthly review process:
Teams that run this discipline consistently reduce agent costs 60-80% from initial deployment levels within 6 months. Not from a single optimization. From systematic, measured iteration monthly.
The startups spending $23.50 per user per month on inference are not unusual. The ones at $3-5 per user while maintaining quality aren't smarter. They're disciplined.
For complete cost visibility, pair optimization with proper agent observability. You can't optimize what you can't measure.
Q: How do you reduce AI agent costs?
Reduce costs through model tiering (use cheap models for simple tasks, expensive ones only for complex tasks), caching (avoid redundant API calls), prompt optimization (shorter prompts that achieve the same result), batch processing (process multiple items per API call), and smart routing (skip unnecessary agent steps when possible).
Q: What is model tiering for AI agents?
Model tiering assigns different AI models to tasks based on complexity: Claude Haiku for classification and extraction, Claude Sonnet for standard code generation and analysis, Claude Opus for complex architecture and reasoning. This approach reduces costs 60-80% compared to using the most expensive model for everything.
Q: What is the typical cost of running AI agents in production?
Production AI agent costs range from $0.01-$0.50 per task depending on complexity and model used. A development team using AI agents typically spends $500-$5,000/month on API costs — dramatically less than the salary of equivalent human developers. The key is optimizing model selection and caching.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Scaling Agent Systems: Architecture That Survives Growth
Every agent system hits a wall. The architecture decisions made on day one determine whether that wall arrives at 1,000 users or 1,000,000.

Agent Observability: Seeing Inside the Black Box
Request received. Response sent. 200 OK. Latency 3.2s. None of that tells you why the agent gave wrong advice. Here's what actually does.

Agent Deployment Patterns: What Production Actually Demands
Deploying an AI agent is nothing like deploying a REST API. Agents are stateful, expensive, non-deterministic, and slow. Every standard assumption breaks.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.