Loading...
Loading...

I tried fine-tuning a model on our company docs. Took a week. Cost $400. The model hallucinated our refund policy.
Then I built a RAG system in an afternoon. Cost $0.02 per query. Cites its sources. When our docs change, the answers change automatically.
RAG isn't sexy. It's plumbing. But it's plumbing that makes AI actually useful for your specific business.
Retrieval-Augmented Generation. Three words, one idea:
Before the AI answers a question, it searches your documents for relevant information. Then it answers using that information. Instead of guessing from training data, it reads your actual docs.
User: "What's your refund policy?"
Without RAG: AI guesses based on training data. Probably wrong.
With RAG:
1. Search your docs for "refund policy"
2. Find the actual refund policy document
3. Give the AI that document as context
4. AI answers based on YOUR policy, not a guess
That's the entire concept. Everything else is implementation detail.
Documents --> Chunking --> Embedding --> Vector DB
|
User Query --> Embedding -------> Search ---+
|
Top K Results --> LLM --> Answer
Five components. Let me explain each one in plain language.
Chunking: Your documents are too long to use as context. Split them into smaller pieces. A paragraph or section at a time.
Embedding: Convert text into numbers (vectors) that capture meaning. "Refund policy" and "return merchandise" end up as similar vectors even though the words are different.
Vector Database: Stores the embeddings. When you search, it finds the vectors most similar to your query.
Search: Convert the user's question into an embedding, find the closest document chunks.
Generation: Give the LLM the retrieved chunks plus the question. It answers using the provided context.
import { readdir, readFile } from "fs/promises";
import path from "path";
interface DocumentChunk {
id: string;
content: string;
metadata: {
source: string;
title: string;
section: string;
chunkIndex: number;
};
}
async function loadDocuments(directory: string): Promise<DocumentChunk[]> {
const files = await readdir(directory, { recursive: true });
const mdFiles = files.filter((f) => f.toString().endsWith(".md"));
const chunks: DocumentChunk[] = [];
for (const file of mdFiles) {
const filePath = path.join(directory, file.toString());
const content = await readFile(filePath, "utf-8");
const fileChunks = chunkDocument(content, filePath);
chunks.push(...fileChunks);
}
return chunks;
}This is where most RAG implementations fail. Bad chunking means bad retrieval means bad answers.
function chunkDocument(
content: string,
source: string,
maxChunkSize: number = 800,
overlap: number = 100
): DocumentChunk[] {
const chunks: DocumentChunk[] = [];
// Split by headers first (semantic boundaries)
const sections = content.split(/\n(?=#{1,3} )/);
let chunkIndex = 0;
for (const section of sections) {
const title = section.match(/^#{1,3} (.+)/)?.[1] || "Untitled";
if (section.length <= maxChunkSize) {
// Section fits in one chunk
chunks.push({
id: `\${source}-\${chunkIndex}`,
content: section.trim(),
metadata: {
source,
title,
section: title,
chunkIndex: chunkIndex++,
},
});
} else {
// Split long sections by paragraphs with overlap
const paragraphs = section.split(/\n\n+/);
let currentChunk = "";
for (const paragraph of paragraphs) {
if ((currentChunk + paragraph).length > maxChunkSize && currentChunk) {
chunks.push({
id: `\${source}-\${chunkIndex}`,
content: currentChunk.trim(),
metadata: {
source,
title,
section: title,
chunkIndex: chunkIndex++,
},
});
// Keep overlap from previous chunk
const words = currentChunk.split(" ");
currentChunk =
words.slice(-Math.floor(overlap / 5)).join(" ") +
"\n\n" +
paragraph;
} else {
currentChunk += (currentChunk ? "\n\n" : "") + paragraph;
}
}
// Don't forget the last chunk
if (currentChunk.trim()) {
chunks.push({
id: `\${source}-\${chunkIndex}`,
content: currentChunk.trim(),
metadata: {
source,
title,
section: title,
chunkIndex: chunkIndex++,
},
});
}
}
}
return chunks;
}Key decisions:
I use Chroma for local development and Pinecone for production. But any vector database works.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
async function embedText(text: string): Promise<number[]> {
// Use an embedding model
// Anthropic doesn't have embeddings yet, so we use OpenAI or Cohere
const response = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
Authorization: `Bearer \${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "text-embedding-3-small",
input: text,
}),
});
const data = await response.json();
return data.data[0].embedding;
}
async function indexChunks(chunks: DocumentChunk[]) {
for (const chunk of chunks) {
const embedding = await embedText(chunk.content);
await vectorDb.upsert({
id: chunk.id,
values: embedding,
metadata: {
content: chunk.content,
...chunk.metadata,
},
});
}
console.log(`Indexed \${chunks.length} chunks`);
}text-embedding-3-small costs $0.02 per million tokens. For most document sets, indexing costs less than a dollar.
async function retrieveRelevant(
query: string,
topK: number = 5
): Promise<DocumentChunk[]> {
const queryEmbedding = await embedText(query);
const results = await vectorDb.query({
vector: queryEmbedding,
topK,
includeMetadata: true,
});
return results.matches.map((match) => ({
id: match.id,
content: match.metadata!.content as string,
metadata: {
source: match.metadata!.source as string,
title: match.metadata!.title as string,
section: match.metadata!.section as string,
chunkIndex: match.metadata!.chunkIndex as number,
},
score: match.score,
}));
}topK = 5 means "give me the five most relevant chunks." For most questions, 3-5 is enough. More than 10 and you're stuffing the context window with noise.
async function answerQuestion(userQuery: string): Promise<string> {
const relevantChunks = await retrieveRelevant(userQuery, 5);
const context = relevantChunks
.map(
(chunk) =>
`[Source: \${chunk.metadata.source} - \${chunk.metadata.title}]\n\${chunk.content}`
)
.join("\n\n---\n\n");
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: `You are a helpful assistant that answers questions based on the provided documentation.
Rules:
- ONLY answer based on the provided context. If the context doesn't contain the answer, say so.
- Cite your sources by mentioning the document name.
- Be concise. Answer in 2-3 sentences unless more detail is needed.
- Never make up information that isn't in the context.`,
messages: [
{
role: "user",
content: `Context documents:
\${context}
---
Question: \${userQuery}`,
},
],
});
return response.content[0].type === "text" ? response.content[0].text : "";
}The system prompt is critical. "Only answer based on the provided context" prevents hallucination. "Cite your sources" builds user trust.
How do you know your RAG system is actually good?
Build a test set. Twenty questions with known correct answers. Run them through the system weekly.
const testCases = [
{
question: "What is the refund policy?",
expectedSource: "policies/refund.md",
expectedContent: ["30 days", "full refund", "no questions asked"],
},
{
question: "How do I reset my password?",
expectedSource: "help/account.md",
expectedContent: ["settings", "security", "email verification"],
},
];
async function evaluateRAG() {
let passed = 0;
for (const test of testCases) {
const chunks = await retrieveRelevant(test.question);
// Check if the right source was retrieved
const sourceFound = chunks.some((c) =>
c.metadata.source.includes(test.expectedSource)
);
// Check if key content was in retrieved chunks
const allContent = chunks.map((c) => c.content).join(" ");
const contentFound = test.expectedContent.every((keyword) =>
allContent.toLowerCase().includes(keyword.toLowerCase())
);
if (sourceFound && contentFound) passed++;
else {
console.log(`FAILED: "\${test.question}"`);
console.log(` Source found: \${sourceFound}`);
console.log(` Content found: \${contentFound}`);
}
}
console.log(`\nRetrieval accuracy: \${passed}/\${testCases.length} (\${Math.round((passed / testCases.length) * 100)}%)`);
}If retrieval accuracy drops below 80%, your chunking strategy needs work. If it's above 90% but answers are still bad, your generation prompt needs work.
Re-indexing. When documents change, re-index them. I run this on a webhook from our CMS. Document updated? Re-chunk, re-embed, re-index. Old chunks deleted.
Caching. Same question asked repeatedly? Cache the answer. Invalidate when the underlying documents change.
Hybrid search. Vector search alone misses exact matches. Combine with keyword search for best results. Most vector databases support this.
Chunk size tuning. Start with 500-800 tokens. If answers lack detail, increase. If retrieval is noisy, decrease. Test with your evaluation set.
Cost management. Embedding is cheap. The LLM call is the expensive part. Use Haiku for simple Q&A, Sonnet for complex analysis. Route based on query complexity.
RAG is the 80/20 of making AI useful for your business. Eighty percent of the value for twenty percent of the complexity of fine-tuning.
Build it. Test it. Deploy it. Your support team will thank you.

Implement Retrieval-Augmented Generation to give your AI agents access to current, accurate information beyond their training data.

Understanding vector databases for AI applications — embeddings, similarity search, indexing strategies, and choosing the right solution.

Your AI agent is only as useful as the services it can talk to. Here are the patterns I use to connect AI to everything else.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.