Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Your agent confidently cites a policy updated six months ago. Not a hallucination problem. A knowledge problem. RAG fixes it. Here is how.

Your AI agent is confidently wrong. It cites a policy updated six months ago. References a product that was deprecated last quarter. Quotes pricing that changed in January.
This is not a hallucination problem. Hallucination is when a model fabricates facts from nothing. This is different. The model is faithfully reporting what it learned during training. The problem is that training data has a cutoff date. Everything after that cutoff is invisible. Everything before it might be stale. The agent cannot tell the difference between current truth and outdated truth because to the agent, all of its knowledge looks the same.
Retrieval-Augmented Generation, RAG, fixes this at the architectural level. Instead of relying on frozen training knowledge, RAG connects your agent to a live knowledge base, retrieves the documents most relevant to the current query, and grounds the response in that retrieved context rather than training memory alone. The agent stops being an oracle drawing on ancient wisdom and becomes a researcher who checks sources before answering.
The idea is elegant. The implementation is where most teams stumble. I have built and debugged RAG systems across dozens of production deployments. The naive version works in demos. The production version requires understanding five critical decisions that determine whether users trust the system or quietly stop using it.
Every RAG tutorial shows the same basic flow. Embed documents. Store in a vector database. At query time, embed the query. Find similar vectors. Stuff the matching chunks into the context. Ask the LLM. Get an answer.
This works. Until it does not.
I have watched naive RAG systems fail in three specific ways, repeatedly, across different teams and use cases.
Retrieval miss. The system retrieves chunks that are semantically adjacent to the question but not actually relevant. A user asks about the enterprise pricing tier. The system retrieves chunks about pricing philosophy, value-based pricing, and pricing FAQs, none of which contain the actual enterprise pricing numbers. The agent confidently answers with related but wrong information.
Context fragmentation. The answer spans multiple document sections. Section 3.1 says the policy applies to all accounts. Section 7.4 lists the exceptions. Naive chunking by character count splits these sections into separate chunks. The retrieval system finds one but not the other. The agent gives a partial answer that omits the exceptions, creating real liability if this is a legal or compliance context.
Recency blindness. The knowledge base contains both the current policy and an archived version from two years ago. Both are semantically similar. The retrieval system might surface either one. The agent has no reliable way to prefer the current document unless you built metadata filtering into the system from day one.
These failures share a root cause. Naive RAG treats retrieval as a pure similarity problem. Real retrieval is a relevance problem, and relevance requires understanding document structure, metadata, recency, and the difference between semantic adjacency and actual answer containment.
The gap between a demo RAG system and a production RAG system is not model quality. It is retrieval quality. A better LLM cannot compensate for retrieving the wrong documents.
Production RAG has five layers. Each layer is a decision point. Getting each one right compounds into a system users actually trust.
Chunking is the most underestimated decision in RAG architecture. It determines the granularity of what you can retrieve, which determines the precision of what the agent can use.
Fixed-size chunking, splitting every 500 tokens regardless of content structure, is the wrong default despite being the default in most tutorials. It destroys document structure. A table split across two chunks becomes two orphaned fragments that individually make no sense. A section header ends up in one chunk while the content it describes is in the next. The answer to a user's question exists in the document, but it exists across a boundary that your chunking created.
Semantic chunking respects document structure. For markdown documents, split at header boundaries and keep headers with their content. For code, split at function or class boundaries. For prose, split at paragraph breaks and merge short paragraphs until you reach a meaningful size threshold.
interface ChunkOptions {
docType: "markdown" | "code" | "prose" | "table";
targetSize: number; // tokens
overlapSize: number; // tokens of overlap between chunks
minSize: number; // minimum viable chunk size
}
function chunkDocument(content: string, options: ChunkOptions): Chunk[] {
const { docType, targetSize, overlapSize, minSize } = options;
if (docType === "markdown") {
return chunkMarkdown(content, targetSize, overlapSize);
}
if (docType === "code") {
return chunkCode(content, targetSize);
}
return chunkProse(content, targetSize, overlapSize, minSize);
}
function chunkMarkdown(content: string, targetSize: number, overlap: number): Chunk[] {
// Split at H1, H2, H3 boundaries
const sections = content.split(/(?=^#{1,3} )/m).filter(s => s.trim().length > 0);
const chunks: Chunk[] = [];
for (const section of sections) {
const tokenCount = estimateTokens(section);
if (tokenCount <= targetSize) {
chunks.push({ content: section, tokenCount });
} else {
// Section too large: split into paragraphs with overlap
const header = section.match(/^#{1,3} .+/m)?.[0] || "";
const paragraphs = section.split(/\n\n+/).filter(p => p.trim().length > 0);
let current = header;
let currentTokens = estimateTokens(header);
for (const para of paragraphs) {
const paraTokens = estimateTokens(para);
if (currentTokens + paraTokens > targetSize && currentTokens > 0) {
chunks.push({ content: current.trim(), tokenCount: currentTokens });
// Keep last N tokens as overlap
const overlapText = getLastNTokens(current, overlap);
current = overlapText + "\n\n" + para;
currentTokens = estimateTokens(current);
} else {
current += "\n\n" + para;
currentTokens += paraTokens;
}
}
if (current.trim()) {
chunks.push({ content: current.trim(), tokenCount: currentTokens });
}
}
}
return chunks;
}Two additional practices matter here. First, add overlap between chunks. The last 100 tokens of chunk N appear at the start of chunk N+1. This ensures that content near chunk boundaries gets retrieved regardless of which chunk the retrieval system finds. Second, add contextual headers to orphaned chunks. If chunk 5 is a sub-section from a larger document, prepend the document title and section hierarchy so the chunk is interpretable on its own.
Every chunk needs metadata. This is non-negotiable in production systems.
interface ChunkMetadata {
chunkId: string;
documentId: string;
sourceUrl: string;
documentTitle: string;
section?: string;
subsection?: string;
documentType: "policy" | "faq" | "technical" | "legal" | "product";
createdAt: Date;
updatedAt: Date;
version?: string;
tags: string[];
language: string;
authoritative: boolean; // Is this the canonical source?
}Metadata enables filtered retrieval, which is qualitatively more powerful than pure semantic search. Instead of finding the ten most semantically similar chunks across your entire knowledge base, you find the five most semantically similar chunks from the "policy" document type that were updated in the last 90 days. Filtered retrieval dramatically reduces false positive retrievals.
When building your indexing pipeline, invest the time to extract metadata automatically where possible. Document creation dates are usually in file metadata or document headers. Document types can often be classified by a lightweight LLM call at index time. Tags can be extracted from document structure. The upfront cost is worth it every time.
Pure semantic search fails on exact-match queries. If a user asks "what is the maximum file size limit?", the exact phrase "file size limit" might not be the most semantically central concept in the most relevant chunk. A keyword search would find it immediately. Pure keyword search fails on paraphrase queries. Hybrid search handles both.
async function hybridSearch(
query: string,
index: VectorIndex,
options: {
topK: number;
semanticWeight: number; // 0-1, keyword weight is 1 - semanticWeight
filters?: Partial<ChunkMetadata>;
}
): Promise<RankedChunk[]> {
const { topK, semanticWeight, filters } = options;
const keywordWeight = 1 - semanticWeight;
// Run both searches in parallel
const [semanticResults, keywordResults] = await Promise.all([
index.semanticSearch(query, { topK: topK * 2, filters }),
index.keywordSearch(query, { topK: topK * 2, filters }),
]);
// Reciprocal rank fusion
const scores = new Map<string, number>();
const k = 60; // RRF constant
semanticResults.forEach((result, rank) => {
const current = scores.get(result.chunkId) || 0;
scores.set(result.chunkId, current + semanticWeight * (1 / (k + rank + 1)));
});
keywordResults.forEach((result, rank) => {
const current = scores.get(result.chunkId) || 0;
scores.set(result.chunkId, current + keywordWeight * (1 / (k + rank + 1)));
});
// Merge, sort, return top K
const allChunks = new Map<string, Chunk>();
[...semanticResults, ...keywordResults].forEach(r => allChunks.set(r.chunkId, r.chunk));
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, topK)
.map(([chunkId, score]) => ({ chunk: allChunks.get(chunkId)!, score }));
}Reciprocal Rank Fusion is the right merging strategy here. It combines rankings from multiple retrieval methods without requiring you to normalize scores across different scoring systems. A result that ranks highly in both semantic and keyword search rises to the top. A result that only appears in one system is penalized.
Retrieval is fast but approximate. Re-ranking is slower but precise. Use them together.
The pattern: retrieve the top 20 candidates with fast hybrid search, then re-rank with a cross-encoder model that considers the query and each candidate together. Return the top 5 after re-ranking. This two-stage approach balances speed and relevance quality.
async function retrieveAndRerank(
query: string,
index: VectorIndex,
reranker: CrossEncoderModel
): Promise<Chunk[]> {
// Stage 1: Fast retrieval of candidates
const candidates = await hybridSearch(query, index, {
topK: 20,
semanticWeight: 0.7,
});
// Stage 2: Precise re-ranking
const rerankedScores = await reranker.score(
candidates.map(c => ({ query, document: c.chunk.content }))
);
return candidates
.map((c, i) => ({ ...c, rerankScore: rerankedScores[i] }))
.sort((a, b) => b.rerankScore - a.rerankScore)
.slice(0, 5)
.map(c => c.chunk);
}Cross-encoder models like Cohere Rerank or BGE-Reranker evaluate the full relationship between query and document, not just their embedding proximity. They catch cases where a document is not particularly close in embedding space but is actually the most relevant answer to the specific question asked.
What you put in the context window, and how you arrange it, affects response quality significantly.
Do not dump chunks in the order you retrieved them. Group chunks from the same document together. Present source metadata before each group so the agent knows what it is reading. Present the most relevant material first since LLMs attend more strongly to the beginning of long contexts.
function assembleContext(retrievedChunks: RankedChunk[]): string {
// Group by source document
const byDocument = new Map<string, RankedChunk[]>();
for (const chunk of retrievedChunks) {
const docId = chunk.chunk.metadata.documentId;
const existing = byDocument.get(docId) || [];
byDocument.set(docId, [...existing, chunk]);
}
const parts: string[] = [
"Use the following sources to answer the question. Cite sources by document title.",
"",
];
for (const [docId, chunks] of byDocument) {
const meta = chunks[0].chunk.metadata;
const lastUpdated = meta.updatedAt.toLocaleDateString();
parts.push(`### ${meta.documentTitle}`);
parts.push(`Source: ${meta.sourceUrl} | Last updated: ${lastUpdated}`);
parts.push("");
// Sort by document position, not retrieval score
const sorted = chunks.sort((a, b) =>
(a.chunk.position || 0) - (b.chunk.position || 0)
);
for (const chunk of sorted) {
parts.push(chunk.chunk.content);
parts.push("");
}
}
return parts.join("\n");
}Users phrase questions in their own words. The knowledge base uses the organization's terminology. These often do not match.
A customer asks "why can't I add team members?" The relevant document talks about "seat limits" and "license tiers." A simple semantic search might miss this because the query and the document use completely different vocabulary to describe the same problem.
Query expansion uses a fast, cheap LLM call to generate alternative phrasings before retrieval:
async function expandQuery(query: string): Promise<string[]> {
const response = await anthropic.messages.create({
model: "claude-haiku-4-5",
max_tokens: 300,
messages: [
{
role: "user",
content: `Generate 3 alternative search queries for: "${query}"
The alternatives should use different vocabulary but seek the same information.
Return as a JSON array of strings only, no explanation.`,
},
],
});
try {
const alternatives = JSON.parse(
response.content[0].type === "text" ? response.content[0].text : "[]"
);
return [query, ...alternatives.slice(0, 3)];
} catch {
return [query];
}
}Run all queries in parallel and deduplicate results before re-ranking. The marginal cost of three extra embedding lookups is negligible compared to the improvement in retrieval recall.
RAG is only as good as its data freshness. A stale knowledge base is worse than no RAG at all because it gives users false confidence in outdated information.
Event-driven re-indexing is the right architecture. When a document is created or updated in your source system, trigger a re-indexing job immediately. Not nightly. Not weekly. Immediately. Document updated at 2pm should be available for retrieval by 2:05pm.
// Webhook handler for document updates
async function handleDocumentUpdate(event: DocumentUpdateEvent) {
const { documentId, documentUrl, eventType } = event;
if (eventType === "deleted") {
await index.deleteByDocumentId(documentId);
return;
}
// Fetch, chunk, embed, upsert
const content = await fetchDocument(documentUrl);
const chunks = chunkDocument(content, { docType: detectDocType(content) });
const embeddings = await batchEmbed(chunks.map(c => c.content));
const indexableChunks = chunks.map((chunk, i) => ({
...chunk,
embedding: embeddings[i],
metadata: {
...chunk.metadata,
documentId,
updatedAt: new Date(),
},
}));
await index.upsertChunks(indexableChunks);
}Version your index. When you change chunking strategy, embedding model, or metadata schema, create a new index and migrate. Never mutate the current production index with breaking changes. Keep the old index available for rollback for at least 24 hours.
Monitor retrieval quality continuously. Sample 20-30 queries per day, check what was retrieved, and ask whether a human would consider those the right sources. Automated relevance metrics like NDCG and MRR are useful, but human spot-checks catch problems automated metrics miss.
I worked with a SaaS company that had a support agent confidently giving customers wrong answers about their subscription limits. Classic naive RAG problem. The correct information existed in the knowledge base. The retrieval was surfacing similar but wrong chunks.
After rebuilding with structured chunking, filtered retrieval by document type, and re-ranking, accuracy on subscription-related questions went from 64% to 91%. More importantly, the failure mode changed. Instead of wrong answers with false confidence, the agent now correctly escalated the 9% of cases where retrieved context was insufficient. Users learned to trust the system because when it answered, it was right, and when it was not sure, it said so.
The goal of RAG is not to make your agent answer more questions. It is to make the answers it gives trustworthy. An agent that declines to answer when it lacks good retrieval is more valuable than one that always answers.
These retrieval patterns connect directly to agent memory architecture, where the same retrieval infrastructure serves both RAG knowledge bases and episodic memory stores. Build them with the same rigor.
Not all embedding models are equal, and the gap matters more than most people realize.
General-purpose embedding models like OpenAI text-embedding-3-large or Cohere embed-english-v3.0 work well for general knowledge bases. For specialized domains, domain-specific models substantially outperform general ones.
The test: take 100 representative queries from your use case. For each query, identify the ground-truth relevant document manually. Run all queries through candidate embedding models, run retrieval, and measure Recall@5 (did the relevant document appear in the top 5 results?). The model with the highest Recall@5 on your queries is the right model for your system, regardless of public benchmarks.
Do this test before you build your production system. Re-running it after indexing 50,000 documents is expensive.
Q: What is RAG (Retrieval-Augmented Generation)?
RAG enhances AI responses by retrieving relevant information from external knowledge bases before generating answers. Instead of relying solely on training data, RAG searches a vector database, finds relevant passages, and includes them in context. This grounds responses in actual data, reducing hallucinations.
Q: How does RAG work with AI agents?
AI agents use RAG to access domain-specific knowledge not in their training data. When needing information, the agent queries a vector database, retrieves similar passages, and uses them as context. This enables working with private data and recent information without fine-tuning.
Q: When should you use RAG vs fine-tuning?
Use RAG for grounding AI in specific, frequently changing data (documentation, knowledge bases). Use fine-tuning to change model behavior or reasoning patterns. RAG is cheaper, faster to implement, and easier to update. Most production systems use RAG.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Agent Memory Systems: Building AI That Actually Remembers
Your brilliant AI agent forgets everything between sessions. Here's how to build memory systems that make agents genuinely useful over time.

Building Custom AI Agents from Scratch: What Works
Stop wrapping ChatGPT in a text box and calling it an agent. Here's how to build real agents with perception, reasoning, tools, and memory.

Vector Databases: A Practical Guide for Developers
Vector databases search by meaning, not keywords. Here's how embeddings work, which indexing strategy to pick, and when to use Pinecone vs Chroma vs pgvector.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.