Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Basic RAG works in demos and breaks in production. Here is what naive implementations get wrong and how to build the version that handles real users.

The basic RAG tutorial is everywhere.
Embed documents. Store in a vector database. At query time, embed the query. Find similar documents. Stuff them in context. Ask the model.
This works. For demos. For documents you hand-picked. For the test set you wrote yourself.
For production systems with diverse documents, complex questions, and users who ask things you never anticipated, basic RAG fails in specific, predictable ways. The failures are not random. They are structural weaknesses in the naive approach that become obvious under real load.
This guide fixes those failures with code you can actually run.
Four structural failures that basic RAG tutorials do not address.
Chunking destroys semantic units. Splitting at fixed character counts breaks sentences mid-thought, separates tables from their headers, and severs list items from the parent sentence that gives them context. Retrieved chunks that start in the middle of a thought are useless.
Top-K retrieval is a blunt instrument. Fetching the three closest chunks by cosine similarity misses chunks that are relevant in different ways. A question about a policy change needs the current policy, the previous policy, and the changelog entry. Cosine similarity to a single query vector will not reliably retrieve all three.
Embedding quality is inconsistent across content types. Code, tables, and structured data embed differently than prose. Using a single general-purpose embedding model for all content types produces uneven retrieval quality across your document corpus.
Context stuffing confuses the model. Dumping retrieved chunks directly into context without structure makes it hard for the model to understand which information comes from where and whether sources might contradict each other.
Each failure has a concrete fix.
Chunk by semantic unit, not character count. The implementation varies by document format.
type DocType = "prose" | "code" | "markdown" | "structured";
function chunkBySemanticUnit(document: string, docType: DocType): string[] {
switch (docType) {
case "markdown":
// Split at headers, keeping header with its content
const sections = document.split(/(?=^#{1,3} )/m);
return sections
.filter(s => s.trim().length > 100)
.flatMap(section => {
// Section too large? Split at paragraph boundaries
if (section.length > 2000) {
return section
.split(/\n\n+/)
.filter(p => p.trim().length > 50);
}
return [section];
});
case "code":
// Split at function/class boundaries
return document
.split(/(?=\n(?:export\s+)?(?:function|class|const|async\s+function|interface|type)\s)/)
.filter(s => s.trim().length > 50);
case "structured":
// For tables, keep each row with header
const lines = document.split("\n");
const headerLine = lines.find(l => l.includes("|"));
if (!headerLine) return [document];
return lines
.filter(l => l.includes("|") && !l.match(/^\|[-|]+\|$/))
.map(row => `${headerLine}\n${row}`);
default: // prose
// Split at paragraph boundaries, merge short ones
const paragraphs = document
.split(/\n\n+/)
.filter(p => p.trim().length > 50);
const chunks: string[] = [];
let current = "";
for (const para of paragraphs) {
if ((current + para).length > 1500 && current.length > 0) {
chunks.push(current.trim());
current = para;
} else {
current += "\n\n" + para;
}
}
if (current.trim()) chunks.push(current.trim());
return chunks;
}
}Every chunk gets rich metadata:
interface DocumentChunk {
id: string;
documentId: string;
content: string;
metadata: {
source: string;
title: string;
section?: string;
docType: DocType;
pageNumber?: number;
lastUpdated: Date;
contentHash: string; // For detecting when chunks need re-indexing
};
embedding: number[];
}Metadata is what enables filtered retrieval. "Find chunks from the API documentation updated after January 2026" is dramatically more useful than "find chunks that seem related to the API."
Pure semantic search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine them.
interface SearchResult {
chunk: DocumentChunk;
semanticScore: number;
keywordScore: number;
combinedScore: number;
}
async function hybridSearch(
query: string,
chunks: DocumentChunk[],
options: { topK: number; semanticWeight?: number; filters?: MetadataFilters }
): Promise<SearchResult[]> {
const { topK, semanticWeight = 0.7, filters } = options;
const keywordWeight = 1 - semanticWeight;
// Apply metadata filters first to narrow the search space
const filteredChunks = filters ? applyFilters(chunks, filters) : chunks;
// Semantic search via embedding similarity
const queryEmbedding = await embedText(query);
const results: SearchResult[] = filteredChunks.map(chunk => ({
chunk,
semanticScore: cosineSimilarity(queryEmbedding, chunk.embedding),
keywordScore: 0,
combinedScore: 0,
}));
// BM25-style keyword scoring
const queryTerms = query.toLowerCase().split(/\s+/).filter(t => t.length > 2);
for (const result of results) {
const content = result.chunk.content.toLowerCase();
const words = content.split(/\s+/).length;
let score = 0;
for (const term of queryTerms) {
const occurrences = (content.match(new RegExp(`\b${term}\b`, "g")) || []).length;
if (occurrences > 0) {
// TF-IDF inspired scoring
score += (1 + Math.log(occurrences)) / Math.sqrt(words);
}
}
result.keywordScore = score;
}
// Normalize scores to [0, 1] range
const maxSemantic = Math.max(...results.map(r => r.semanticScore), 0.001);
const maxKeyword = Math.max(...results.map(r => r.keywordScore), 0.001);
for (const result of results) {
result.combinedScore =
semanticWeight * (result.semanticScore / maxSemantic) +
keywordWeight * (result.keywordScore / maxKeyword);
}
return results
.sort((a, b) => b.combinedScore - a.combinedScore)
.slice(0, topK);
}Users ask questions in their own words. Your knowledge base uses different words. Bridge the gap by expanding queries before retrieval.
async function expandQuery(originalQuery: string): Promise<string[]> {
const response = await anthropic.messages.create({
model: "claude-haiku-4-5",
max_tokens: 300,
system: `Generate 3 alternative phrasings of this search query that would match
different but relevant documents. Focus on:
- Different terminology for the same concept
- More specific versions of the query
- Related questions that would have useful overlapping answers
Return a JSON array of strings only.`,
messages: [{ role: "user", content: originalQuery }]
});
try {
const text = response.content[0].type === "text" ? response.content[0].text : "[]";
const alternatives = JSON.parse(text);
return [originalQuery, ...alternatives.slice(0, 3)];
} catch {
return [originalQuery]; // Fall back to original if parsing fails
}
}
async function retrieveWithExpansion(
query: string,
chunks: DocumentChunk[],
topK: number
): Promise<SearchResult[]> {
const expandedQueries = await expandQuery(query);
// Run parallel searches for each query variant
const allResultArrays = await Promise.all(
expandedQueries.map(q => hybridSearch(q, chunks, { topK, semanticWeight: 0.7 }))
);
// Deduplicate: keep highest score for each chunk seen
const bestScores = new Map<string, SearchResult>();
for (const results of allResultArrays) {
for (const result of results) {
const existing = bestScores.get(result.chunk.id);
if (!existing || result.combinedScore > existing.combinedScore) {
bestScores.set(result.chunk.id, result);
}
}
}
return Array.from(bestScores.values())
.sort((a, b) => b.combinedScore - a.combinedScore)
.slice(0, topK);
}Vector search is fast but approximate. Re-rank the top candidates with a more precise relevance model.
async function rerankResults(
query: string,
candidates: SearchResult[],
topN: number
): Promise<SearchResult[]> {
if (candidates.length <= topN) return candidates;
// Use LLM to score relevance of each candidate
const scoringPrompt = `Score the relevance of each passage to the query.
Query: "${query}"
${candidates.map((r, i) => `[${i}] ${r.chunk.content.substring(0, 300)}...`).join("\n\n")}
Return a JSON array of numbers from 0-10 representing relevance scores for each passage in order.`;
const response = await anthropic.messages.create({
model: "claude-haiku-4-5",
max_tokens: 200,
messages: [{ role: "user", content: scoringPrompt }]
});
try {
const text = response.content[0].type === "text" ? response.content[0].text : "[]";
const scores: number[] = JSON.parse(text);
return candidates
.map((result, i) => ({ ...result, rerankedScore: scores[i] || 0 }))
.sort((a: any, b: any) => b.rerankedScore - a.rerankedScore)
.slice(0, topN);
} catch {
return candidates.slice(0, topN); // Fall back to original ranking
}
}Do not dump chunks into context. Assemble them with structure that helps the model understand provenance.
function assembleStructuredContext(
query: string,
retrievedChunks: SearchResult[]
): string {
// Group chunks by source document
const byDocument = new Map<string, SearchResult[]>();
for (const result of retrievedChunks) {
const docId = result.chunk.documentId;
if (!byDocument.has(docId)) byDocument.set(docId, []);
byDocument.get(docId)!.push(result);
}
const contextParts: string[] = [
`RETRIEVED CONTEXT FOR QUERY: "${query}"`,
"=".repeat(50),
];
let sourceIndex = 1;
for (const [, docResults] of byDocument) {
const meta = docResults[0].chunk.metadata;
contextParts.push(`\n[Source ${sourceIndex}: ${meta.title} | ${meta.docType} | Updated: ${meta.lastUpdated.toLocaleDateString()}]`);
// Sort chunks within document by page/section order
const sorted = docResults.sort(
(a, b) => (a.chunk.metadata.pageNumber || 0) - (b.chunk.metadata.pageNumber || 0)
);
for (const result of sorted) {
contextParts.push(result.chunk.content);
}
sourceIndex++;
}
return contextParts.join("\n\n");
}Putting it all together:
interface RAGOptions {
topKRetrieval?: number; // Initial candidates to retrieve
topNRerank?: number; // Final chunks after re-ranking
semanticWeight?: number; // Balance semantic vs keyword search
enableQueryExpansion?: boolean;
metadataFilters?: MetadataFilters;
}
async function productionRAGQuery(
userQuery: string,
knowledgeBase: DocumentChunk[],
options: RAGOptions = {}
): Promise<{ answer: string; sources: string[] }> {
const {
topKRetrieval = 12,
topNRerank = 5,
semanticWeight = 0.7,
enableQueryExpansion = true,
metadataFilters,
} = options;
// Step 1: Retrieve candidates (with or without query expansion)
const candidates = enableQueryExpansion
? await retrieveWithExpansion(userQuery, knowledgeBase, topKRetrieval)
: await hybridSearch(userQuery, knowledgeBase, { topK: topKRetrieval, semanticWeight, filters: metadataFilters });
// Step 2: Re-rank to get final context
const finalChunks = await rerankResults(userQuery, candidates, topNRerank);
// Step 3: Assemble structured context
const context = assembleStructuredContext(userQuery, finalChunks);
// Step 4: Generate grounded response
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1500,
system: `You are a helpful assistant. Answer the user's question based ONLY on the provided context.
If the context does not contain enough information to answer fully, say what you know from the context and clearly indicate what is missing.
Always cite sources by referring to the [Source N] labels in the context.`,
messages: [{
role: "user",
content: `${context}\n\n${"-".repeat(50)}\n\nQuestion: ${userQuery}`
}]
});
const answer = response.content[0].type === "text" ? response.content[0].text : "";
// Extract source citations from the answer
const sourcesUsed = finalChunks
.filter((_, i) => answer.includes(`[Source ${i + 1}]`) || answer.includes(`Source ${i + 1}`))
.map(r => `${r.chunk.metadata.title} (${r.chunk.metadata.source})`);
return { answer, sources: sourcesUsed };
}A RAG system is only as good as its data. Build the indexing pipeline with updates in mind from the start.
class DocumentIndexer {
async indexDocument(document: RawDocument): Promise<void> {
// Detect document type for appropriate chunking
const docType = this.detectDocType(document);
// Chunk by semantic unit
const textChunks = chunkBySemanticUnit(document.content, docType);
// Generate embeddings in batches to respect rate limits
const batchSize = 20;
const chunks: DocumentChunk[] = [];
for (let i = 0; i < textChunks.length; i += batchSize) {
const batch = textChunks.slice(i, i + batchSize);
const embeddings = await this.embedBatch(batch);
for (let j = 0; j < batch.length; j++) {
chunks.push({
id: `${document.id}-chunk-${i + j}`,
documentId: document.id,
content: batch[j],
metadata: {
source: document.url,
title: document.title,
docType,
lastUpdated: document.updatedAt,
contentHash: this.hash(batch[j]),
},
embedding: embeddings[j],
});
}
}
// Remove old chunks and insert new ones atomically
await this.vectorDB.transaction(async (tx) => {
await tx.deleteWhere({ documentId: document.id });
await tx.insertBatch(chunks);
});
console.log(`Indexed ${chunks.length} chunks for: ${document.title}`);
}
async handleDocumentDeletion(documentId: string): Promise<void> {
await this.vectorDB.deleteWhere({ documentId });
console.log(`Removed chunks for deleted document: ${documentId}`);
}
}For production vector database setup, Pinecone, Weaviate, and Qdrant each have strong operational characteristics. The vector databases explained for developers guide covers the tradeoffs in detail.
Before shipping, measure quality:
interface RAGEvaluation {
retrievalPrecision: number; // Fraction of retrieved chunks that are relevant
retrievalRecall: number; // Fraction of relevant chunks that were retrieved
answerFaithfulness: number; // Fraction of answer claims supported by context
answerRelevance: number; // How well the answer addresses the question
}
async function evaluateRAG(
testCases: Array<{ query: string; relevantDocIds: string[]; expectedAnswer: string }>,
knowledgeBase: DocumentChunk[]
): Promise<RAGEvaluation> {
const results = await Promise.all(
testCases.map(async (tc) => {
const { answer, sources } = await productionRAGQuery(tc.query, knowledgeBase);
const retrievedIds = sources.map(s => extractDocId(s));
const truePositives = retrievedIds.filter(id => tc.relevantDocIds.includes(id)).length;
const precision = truePositives / Math.max(retrievedIds.length, 1);
const recall = truePositives / Math.max(tc.relevantDocIds.length, 1);
return { precision, recall, answer, expectedAnswer: tc.expectedAnswer };
})
);
return {
retrievalPrecision: average(results.map(r => r.precision)),
retrievalRecall: average(results.map(r => r.recall)),
answerFaithfulness: 0, // Requires LLM-as-judge evaluation
answerRelevance: 0, // Requires LLM-as-judge evaluation
};
}Target benchmarks: retrieval precision above 0.75, recall above 0.70. If you are below these, the embedding model or chunking strategy is usually the culprit, not the retrieval algorithm.
Q: How do you build a RAG system?
Building a RAG system involves four steps: chunking (splitting documents into semantic passages), embedding (converting text to vector representations), indexing (storing embeddings in a vector database like Pinecone or Chroma), and retrieval (finding relevant passages for each query using similarity search, then including them in the AI prompt as context).
Q: What vector database should I use for RAG?
Popular choices include Pinecone (fully managed, easy to scale), Chroma (open source, good for development), Weaviate (hybrid search combining vector and keyword), and pgvector (PostgreSQL extension, good if you already use Postgres). Choose based on scale (Pinecone for production), flexibility (Weaviate for hybrid search), or simplicity (Chroma for prototypes).
Q: How do you improve RAG retrieval quality?
Improve RAG quality through better chunking strategies (semantic chunking instead of fixed-size), hybrid search (combining vector similarity with keyword matching), reranking (using a cross-encoder to refine initial results), metadata filtering (narrowing search by document type or date), and query expansion (generating multiple search queries from one user question).
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Vector Databases: A Practical Guide for Developers
Vector databases search by meaning, not keywords. Here's how embeddings work, which indexing strategy to pick, and when to use Pinecone vs Chroma vs pgvector.

Build a Production Chatbot from Scratch: No Shortcuts
Chat widgets are easy. Production chatbots that handle real users without breaking are hard. Here's the full guide to building one that actually works.

RAG for AI Agents: Grounding Decisions in Real Data
Your agent confidently cites a policy updated six months ago. Not a hallucination problem. A knowledge problem. RAG fixes it. Here is how.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.