Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Filling a 128K context window with everything degrades output quality. The skill is using it wisely. Here's how to prioritize, compress, and budget tokens.

More context is not always better. That statement runs counter to most developer intuitions about language models, and getting it wrong is expensive.
When frontier models announced 128K and 200K context windows, many teams responded by stuffing everything they could into the context. Full codebase. Entire conversation history. Complete documentation. Every potentially relevant document. The assumption: more information helps the model give better answers.
The reality: past a certain point, more context actively hurts. Models struggle to maintain focus across very long contexts. Relevant information gets diluted by irrelevant material. Response quality degrades. Latency climbs. Cost climbs faster.
Context window optimization is the discipline of deciding what belongs in the context, what does not, and how to represent what does belong as efficiently as possible.
The "lost in the middle" problem is real and well-documented. Research shows that language models perform significantly better when relevant information appears at the beginning or end of a long context compared to the middle.
If you insert the answer to a question in position 50 of a 100-document context, models retrieve it far less reliably than if it appears in positions 1-5 or 95-100. Bury the most important content in the middle of a massive context and you might as well not have included it.
The practical implication: do not just add context. Structure it. Put the most critical information at the beginning. Do not assume the model will weight all content equally regardless of position.
Beyond the lost-in-the-middle problem, irrelevant context is actively harmful. It introduces noise. The model has to allocate attention across everything in the context. Irrelevant content consumes attention budget that should go to relevant content.
Think of it like asking someone to answer a question while reading a mixed stack of relevant documents and random junk. The relevant signal is there, but finding it requires filtering through noise. Human performance degrades. So does model performance.
Every model has a context window limit. Within that limit, you have four categories of tokens:
| Category | What It Contains | Priority |
|---|---|---|
| System prompt | Instructions, persona, constraints | Highest |
| Retrieved context | RAG results, relevant documents | High |
| Conversation history | Prior turns in the conversation | Medium |
| User input | Current query | Always included |
The output also consumes tokens from the context window in models that use a KV cache. Reserve budget for the expected output length.
A practical token budget for a 128K model:
const TOKEN_BUDGET = {
total: 128_000,
system_prompt: 2_000, // Well-written system prompts should be concise
retrieved_context: 60_000, // RAG results, selected documents
conversation_history: 20_000, // Rolling window of prior turns
user_input: 4_000, // Current message
output_reserve: 4_000, // Expected max response length
// Remaining: buffer for edge cases
};
function checkBudget(budget: typeof TOKEN_BUDGET): boolean {
const allocated = Object.values(budget).reduce((sum, val) => sum + val, 0);
return allocated <= budget.total;
}Budgeting forces intentionality. Instead of adding context until you hit the limit, you allocate purposefully and cut aggressively.
The single highest-leverage optimization for most RAG applications: retrieval quality. If you retrieve the right documents, you can use far fewer tokens and get better results than retrieving many documents of uncertain relevance.
Basic ANN search retrieves approximately similar documents. Adding a re-ranking step improves precision significantly.
async function retrieveWithReranking(
query: string,
topK: number = 20,
finalK: number = 5
): Promise<Document[]> {
// Step 1: Retrieve more candidates than you need
const candidates = await vectorDB.query({
vector: await getEmbedding(query),
topK: topK, // Get 20 candidates
includeMetadata: true,
});
// Step 2: Re-rank with a cross-encoder model
// Cross-encoders are slower but much more precise
const reranked = await reranker.rank({
query,
documents: candidates.matches.map(m => m.metadata?.content),
returnDocuments: true,
topN: finalK, // Keep only top 5
});
return reranked;
}
// Cohere reranker example
import { CohereClient } from 'cohere-ai';
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
async function rerankWithCohere(
query: string,
docs: string[],
topN: number
) {
const response = await cohere.rerank({
model: 'rerank-english-v3.0',
query,
documents: docs,
topN,
});
return response.results;
}The quality improvement from adding re-ranking often means you can reduce finalK from ten to five and still get better answers, because the five documents you do include are more precisely relevant.
Even when you retrieve the right document, you often do not need the whole thing. A ten-paragraph document might have one paragraph directly relevant to the query.
Contextual compression extracts the relevant portion before adding it to the context.
async function compressContext(
query: string,
document: string
): Promise<string> {
const response = await anthropic.messages.create({
model: 'claude-haiku-20241022', // Use fast/cheap model for compression
max_tokens: 500,
messages: [{
role: 'user',
content: `Given this query: "${query}"
Extract ONLY the portions of the following document that are directly relevant to answering the query.
Preserve exact wording. Omit irrelevant sections entirely.
If nothing is relevant, respond with: [NOT RELEVANT]
Document:
${document}`,
}],
});
const compressed = response.content[0].type === 'text'
? response.content[0].text
: '[ERROR]';
return compressed === '[NOT RELEVANT]' ? '' : compressed;
}This costs additional API calls for the compression step, but the overall cost is often lower because the main model processes far fewer tokens and produces higher quality output.
Multi-turn conversations are a common context killer. Every turn adds tokens. By turn twenty, you are carrying substantial history that may be mostly irrelevant to the current question.
The simplest approach: keep only the last N turns. Works well when earlier turns have minimal bearing on the current query.
function rollingWindow(
history: Message[],
maxTokens: number
): Message[] {
let tokenCount = 0;
const trimmed: Message[] = [];
// Process from most recent to oldest
for (let i = history.length - 1; i >= 0; i--) {
const estimated = estimateTokens(history[i].content);
if (tokenCount + estimated > maxTokens) break;
trimmed.unshift(history[i]); // Maintain order
tokenCount += estimated;
}
return trimmed;
}
function estimateTokens(text: string): number {
// Rough estimate: 1 token per 4 characters
return Math.ceil(text.length / 4);
}For conversations where early turns establish important context, a summarization approach preserves key information while reducing token count.
async function compressHistory(
history: Message[],
keepRecent: number = 4
): Promise<{ summary: string; recentTurns: Message[] }> {
const recent = history.slice(-keepRecent);
const toSummarize = history.slice(0, -keepRecent);
if (toSummarize.length === 0) {
return { summary: '', recentTurns: recent };
}
const historyText = toSummarize
.map(m => `${m.role}: ${m.content}`)
.join('\n');
const summaryResponse = await anthropic.messages.create({
model: 'claude-haiku-20241022',
max_tokens: 300,
messages: [{
role: 'user',
content: `Summarize this conversation concisely, preserving key decisions, facts established, and context needed to continue the conversation:\n\n${historyText}`,
}],
});
const summary = summaryResponse.content[0].type === 'text'
? summaryResponse.content[0].text
: '';
return { summary, recentTurns: recent };
}Insert the summary as a system message at the start of the context. The model uses the summary as background and the recent turns for immediate context.
System prompts repeat with every API call. A bloated system prompt wastes tokens on every request, compounds across many requests, and often dilutes focus.
Principles for lean system prompts:
Eliminate redundancy. If you say "always respond professionally" and also include a professional example response, the instruction is redundant. The example does more work.
Use examples over explanations. One good example is worth ten sentences of instruction.
Remove aspirational statements. "You strive to provide the most accurate information possible" does nothing. Remove it.
Cut the introduction. Many system prompts start with background about the company or product that the model does not need. It needs constraints and formats, not marketing copy.
// Bloated system prompt (high token cost, low signal)
const bloated = `
You are an AI assistant for Acme Corporation, a leading provider of cloud software solutions.
Our company was founded in 2015 and serves over 10,000 customers worldwide.
Your job is to help customers with their questions about our products.
You should always be professional, helpful, and accurate.
If you don't know something, say so.
Always provide clear and concise answers.
You strive to provide excellent customer service.
...
`;
// Lean system prompt (lower token cost, higher signal)
const lean = `
You are an Acme customer support agent.
Scope: Answer questions about Acme products and services only.
If asked about competitors or out-of-scope topics: "I can only help with Acme-related questions."
If uncertain: say so rather than guessing.
Format: Direct answers. No preamble. Lists for multi-part answers.
`;
// Lean version: ~60 tokens vs ~100 tokens
// But more importantly: clearer constraints = more consistent behaviorFor high-volume applications, prompt caching can dramatically reduce costs. Anthropic's prompt caching feature caches the beginning of a prompt and reuses the KV cache across requests with the same prefix.
If you have a 2,000-token system prompt that you send with every request, caching it reduces the cost of those tokens by 90% after the first request.
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: [
{
type: 'text',
text: systemPrompt, // Your stable system prompt
cache_control: { type: 'ephemeral' }, // Cache this block
},
{
type: 'text',
text: dynamicContext, // Not cached (changes per request)
},
],
messages: [{ role: 'user', content: userMessage }],
});
// First request: full cost for system prompt
// Subsequent requests with same system prompt: 90% cost reduction on those tokensCache the stable parts. Do not cache the dynamic parts. The cache TTL is 5 minutes, so this works best for high-frequency applications.
You cannot optimize what you do not measure.
Track these metrics per request and in aggregate:
interface TokenMetrics {
input_tokens: number;
output_tokens: number;
cache_creation_tokens: number;
cache_read_tokens: number;
request_cost_usd: number;
response_quality_score?: number; // From your eval framework
}
async function trackRequest(
response: Anthropic.Message,
qualityScore?: number
): Promise<void> {
const metrics: TokenMetrics = {
input_tokens: response.usage.input_tokens,
output_tokens: response.usage.output_tokens,
cache_creation_tokens: response.usage.cache_creation_input_tokens || 0,
cache_read_tokens: response.usage.cache_read_input_tokens || 0,
request_cost_usd: calculateCost(response.usage, response.model),
response_quality_score: qualityScore,
};
await metricsDB.insert(metrics);
}When you have metrics, you can make data-driven optimization decisions. Reduce retrieval from 10 results to 5 and measure whether quality drops. Enable prompt caching and measure cost reduction. Add contextual compression and see if quality improves while costs decrease.
If you are optimizing a production AI application for the first time:
topK without sacrificing precision.For each step, measure before and after. The goal is not the minimum possible tokens. The goal is the optimal balance of quality and cost.
Q: What is context window optimization?
Context window optimization is the practice of maximizing the value of the limited text an AI model can process at once. Strategies include prioritizing relevant information, compressing verbose content, using RAG to fetch only needed context, and structuring prompts to front-load the most important information.
Q: How do you fit more useful information in a context window?
Fit more information through semantic chunking (keeping related content together), summarization of less critical sections, dynamic context loading (RAG retrieves only relevant passages), structured formatting (tables and lists are more token-efficient than prose), and prompt compression techniques.
Q: What happens when you exceed the context window?
Exceeding the context window causes information loss — the model ignores or poorly processes content beyond its limit. This leads to inconsistent responses, missed context, and errors. Prevention strategies include monitoring token counts, using RAG for large knowledge bases, and hierarchical summarization for long conversations.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Prompt Engineering: The Craft Behind Reliable AI Output
The difference between garbage and production-quality AI output is not magic. It is craft. System prompts, few-shot examples, chain-of-thought, structure.

Fine-Tuning vs RAG: The Real Decision Framework
Fine-tuning changes behavior. RAG adds knowledge. Most teams choose wrong. Here's the decision framework that saves months of wasted work and thousands.

Agent Memory Systems: Building AI That Actually Remembers
Your brilliant AI agent forgets everything between sessions. Here's how to build memory systems that make agents genuinely useful over time.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.