Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Picking the wrong model is the most expensive mistake nobody talks about. Here's what we learned routing millions of requests across multiple AI providers.

I routed about three million AI requests last year across Claude, GPT-4o, Gemini, and a handful of open-source models. The single biggest cost-saving action was not negotiating better rates. It was using the right model for each task.
Most teams pick one model and use it for everything. One model for customer support chatbot responses, code generation, document summarization, data extraction, and creative writing. This is like using a sports car for every trip because it is the nicest car you own. It works. It is wasteful. And for some trips, it is actually slower.
Model selection is a systems problem. Here is the framework.
The model ecosystem consolidates around three tiers:
Frontier models: The most capable, most expensive options. Used for tasks requiring the highest-quality reasoning, creative output, or nuanced judgment.
Mid-tier models: Excellent capability at significantly lower cost. Handle the vast majority of production tasks with quality indistinguishable from frontier models for those tasks.
Fast/cheap models: Sub-second latency, very low cost. Right for classification, extraction, simple generation, and routing. Wrong for complex reasoning.
| Provider | Frontier | Mid-Tier | Fast/Cheap |
|---|---|---|---|
| Anthropic | Claude Opus 4.6 | Claude Sonnet 4.6 | Claude Haiku 3.5 |
| OpenAI | GPT-4o | GPT-4o-mini | GPT-4o-mini |
| Gemini 2.0 Ultra | Gemini 2.0 Flash | Gemini 2.0 Flash Lite | |
| Meta (self-hosted) | Llama 4 Scout | Llama 3.3 70B | Llama 3.2 3B |
The specific model names and capabilities change quarterly. The tier structure is stable. Pick your model within each tier for your use cases and revisit the specific choice every six months.
The most important routing decisions:
Frontier model territory. Tasks that require synthesizing multiple considerations, weighing trade-offs, or handling genuinely ambiguous situations.
Examples: legal document review, medical literature analysis, complex code architecture decisions, financial modeling, strategic recommendations.
Why frontier: the quality gap between a frontier model and a mid-tier model on these tasks is real and measurable. A wrong answer on a complex reasoning task can be catastrophically wrong, not just slightly wrong.
Cost consideration: these tasks are expensive per-call but infrequent enough that the total cost is manageable. Do not compromise here to save money.
Mid-tier models handle the majority of code generation tasks at production quality. The frontier model advantage shows up for novel algorithm design, complex refactoring, and debugging subtle issues.
Route most code generation to mid-tier. Escalate to frontier for the hard cases.
const CODE_ROUTING = {
// Mid-tier: well-defined tasks with clear specs
simple_generation: 'claude-sonnet-4-20250514',
boilerplate: 'claude-sonnet-4-20250514',
test_writing: 'claude-sonnet-4-20250514',
documentation: 'claude-sonnet-4-20250514',
// Frontier: requires judgment and deep understanding
architecture_decisions: 'claude-opus-4-6',
debugging_complex_issues: 'claude-opus-4-6',
security_code_review: 'claude-opus-4-6',
novel_algorithm: 'claude-opus-4-6',
// Fast: mechanical tasks
syntax_check: 'claude-haiku-3-5',
variable_naming: 'claude-haiku-3-5',
simple_classification: 'claude-haiku-3-5',
};Fast/cheap models dominate here. Classifying a support ticket, extracting structured data from a document, detecting language, routing a message to the right team: these are mechanical tasks that small models handle reliably.
Running these through frontier models is like sending a package by private jet. You spend 20x more and arrive at exactly the same time.
async function classifyTicket(ticketText: string): Promise<TicketCategory> {
// Haiku: fast, cheap, sufficient for classification
const response = await anthropic.messages.create({
model: 'claude-haiku-3-5',
max_tokens: 50, // Classification needs minimal tokens
messages: [{
role: 'user',
content: `Classify this support ticket. Return ONLY one of: billing, technical, account, feature, other\n\n${ticketText}`,
}],
});
return parseCategory(response.content[0].type === 'text' ? response.content[0].text : 'other');
}
// Cost comparison for 10,000 daily ticket classifications:
// claude-haiku-3-5: ~$2-5/day
// claude-opus-4-6: ~$50-100/day
// Quality difference: negligible for simple classificationHere the comparison is more nuanced. Frontier models produce noticeably higher quality creative output. But whether that quality difference matters depends on the use case.
Marketing copy that will be reviewed by humans before publishing: mid-tier is sufficient. The human review catches quality issues.
Autonomous content generation at scale with no human review: frontier is worth it. The quality gap is measurable and matters.
Some tasks require processing very long documents. Contract review, codebase analysis, long research papers.
All frontier models handle long contexts (100K+ tokens). Mid-tier models vary in long-context performance. Test your specific long-context tasks with mid-tier models before assuming you need frontier.
Do not hardcode model selection throughout your application. Build a routing layer.
type TaskType =
| 'classification'
| 'extraction'
| 'simple_generation'
| 'complex_reasoning'
| 'code_generation'
| 'creative_writing'
| 'long_context_analysis';
type QualityTier = 'fast' | 'standard' | 'premium';
interface ModelConfig {
provider: 'anthropic' | 'openai' | 'google';
model: string;
maxTokens: number;
temperature: number;
}
const MODEL_ROUTER: Record<TaskType, Record<QualityTier, ModelConfig>> = {
classification: {
fast: { provider: 'anthropic', model: 'claude-haiku-3-5', maxTokens: 100, temperature: 0 },
standard: { provider: 'anthropic', model: 'claude-haiku-3-5', maxTokens: 100, temperature: 0 },
premium: { provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 200, temperature: 0 },
},
extraction: {
fast: { provider: 'anthropic', model: 'claude-haiku-3-5', maxTokens: 500, temperature: 0 },
standard: { provider: 'anthropic', model: 'claude-haiku-3-5', maxTokens: 500, temperature: 0 },
premium: { provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 1000, temperature: 0 },
},
complex_reasoning: {
fast: { provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 2000, temperature: 0.3 },
standard: { provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 2000, temperature: 0.3 },
premium: { provider: 'anthropic', model: 'claude-opus-4-6', maxTokens: 4000, temperature: 0.3 },
},
// ... other task types
};
function selectModel(taskType: TaskType, qualityTier: QualityTier = 'standard'): ModelConfig {
return MODEL_ROUTER[taskType][qualityTier];
}
// Usage in application code
const modelConfig = selectModel('classification');
const response = await callModel(modelConfig, input);The routing layer has two major advantages: you can change model selections in one place, and you can A/B test different models without touching application code.
For high-volume applications, self-hosted open-source models are increasingly viable. Llama 4 Scout, Mistral models, and Qwen 2.5 provide excellent capability for specific task types at near-zero marginal cost.
The economics work when:
Self-hosting costs:
Best self-hosted use cases:
Benchmarks tell you how a model performs on benchmark tasks. Production tells you how it performs on your tasks. These are not the same.
I have seen models that score highly on coding benchmarks produce mediocre results for specific frameworks. Models that score well on reasoning benchmarks struggle with the specific types of reasoning required by a particular application.
Before committing to a model for a production use case, evaluate it on your actual inputs. Build a test set of 50-100 representative examples from your use case. Run all candidate models against it. Score the outputs. Make the decision based on your data, not benchmark leaderboards.
async function benchmarkModels(
testCases: Array<{ input: string; expectedQuality: string[] }>,
models: ModelConfig[]
): Promise<Record<string, number>> {
const scores: Record<string, number> = {};
for (const model of models) {
const modelKey = `${model.provider}/${model.model}`;
const results = await Promise.all(
testCases.map(async ({ input, expectedQuality }) => {
const output = await callModel(model, input);
return scoreOutput(output, expectedQuality);
})
);
scores[modelKey] = results.reduce((sum, s) => sum + s, 0) / results.length;
}
return scores;
}The goal is not to minimize model cost. It is to minimize cost while maintaining acceptable quality.
Three levers:
Right-sizing by task. Use this guide. Do not use frontier models for classification.
Response length control. Set max_tokens appropriately for each task. A classification task with max_tokens=1000 is paying for tokens you will never use.
Caching. Identical inputs produce cacheable outputs. For applications with many repeated queries (FAQ chatbots, standard document templates), caching can reduce API calls by 20-40%.
const MAX_TOKENS_BY_TASK: Record<TaskType, number> = {
classification: 50,
extraction: 500,
simple_generation: 500,
complex_reasoning: 4000,
code_generation: 2000,
creative_writing: 2000,
long_context_analysis: 4000,
};Routinely review your token usage by task type. High average token counts on classification tasks usually mean the model is writing prose explanations when you asked for a category. Fix the prompt.
The sophistication of model selection is increasing. Most production AI systems in 2026 are running multiple models, routing requests based on complexity, cost constraints, and task type.
Specialized models for specific domains (legal, medical, code) are increasingly competing with general frontier models for domain-specific tasks, at lower cost.
The pattern: broad routing based on task type, with domain specialists for high-volume domain-specific work, and frontier models for the hardest tasks across all domains.
Build the routing layer now. The specific models it routes to will change. The architecture for intelligent routing will remain.
Q: How do you choose the right AI model in 2026?
Choose based on task requirements: Claude Opus for complex reasoning, architecture, and nuanced decisions. Claude Sonnet for balanced performance and cost on most tasks. Claude Haiku for high-volume, simple tasks where speed and cost matter most. GPT-4o for multi-modal tasks. Open source models (Llama, Mistral) for on-premises deployment or cost-sensitive batch processing.
Q: What is the difference between Claude, GPT-4, and open source models?
Claude excels at nuanced reasoning, code generation, and following complex instructions. GPT-4o is strong in multi-modal tasks and has broad capabilities. Open source models (Llama 3, Mistral) offer cost savings and deployment flexibility but trail in complex reasoning. For production AI agents, Claude and GPT-4 offer the highest reliability.
Q: Should I use one AI model or multiple models?
Most production systems benefit from using multiple models: a capable model (Claude Opus, GPT-4) for complex decisions and code generation, a balanced model (Claude Sonnet) for routine tasks, and a fast model (Claude Haiku) for classification, extraction, and high-volume processing. This tiered approach optimizes cost while maintaining quality.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Fine-Tuning vs RAG: The Real Decision Framework
Fine-tuning changes behavior. RAG adds knowledge. Most teams choose wrong. Here's the decision framework that saves months of wasted work and thousands.

Prompt Engineering: The Craft Behind Reliable AI Output
The difference between garbage and production-quality AI output is not magic. It is craft. System prompts, few-shot examples, chain-of-thought, structure.

Multi-Agent Orchestration: The Real Production Guide
Most multi-agent demos crumble in production. Here's how to build orchestration that survives real workloads, error storms, and 3am failures.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.