Building a RAG System: Complete Implementation Guide

I tried fine-tuning a model on our company docs. Took a week. Cost $400. The model hallucinated our refund policy.

Then I built a RAG system in an afternoon. Cost $0.02 per query. Cites its sources. When our docs change, the answers change automatically.

RAG isn't sexy. It's plumbing. But it's plumbing that makes AI actually useful for your specific business.

What RAG Actually Does

Retrieval-Augmented Generation. Three words, one idea:

Before the AI answers a question, it searches your documents for relevant information. Then it answers using that information. Instead of guessing from training data, it reads your actual docs.

User: "What's your refund policy?"

Without RAG: AI guesses based on training data. Probably wrong.

With RAG:
1. Search your docs for "refund policy"
2. Find the actual refund policy document
3. Give the AI that document as context
4. AI answers based on YOUR policy, not a guess

That's the entire concept. Everything else is implementation detail.

The Components

Documents --> Chunking --> Embedding --> Vector DB
                                            |
User Query --> Embedding -------> Search ---+
                                    |
                              Top K Results --> LLM --> Answer

Five components. Let me explain each one in plain language.

Chunking: Your documents are too long to use as context. Split them into smaller pieces. A paragraph or section at a time.

Embedding: Convert text into numbers (vectors) that capture meaning. "Refund policy" and "return merchandise" end up as similar vectors even though the words are different.

Vector Database: Stores the embeddings. When you search, it finds the vectors most similar to your query.

Search: Convert the user's question into an embedding, find the closest document chunks.

Generation: Give the LLM the retrieved chunks plus the question. It answers using the provided context.

Step 1: Document Ingestion

typescript

import { readdir, readFile } from "fs/promises";
import path from "path";

interface DocumentChunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    title: string;
    section: string;
    chunkIndex: number;
  };
}

async function loadDocuments(directory: string): Promise<DocumentChunk[]> {
  const files = await readdir(directory, { recursive: true });
  const mdFiles = files.filter((f) => f.toString().endsWith(".md"));
  const chunks: DocumentChunk[] = [];

  for (const file of mdFiles) {
    const filePath = path.join(directory, file.toString());
    const content = await readFile(filePath, "utf-8");
    const fileChunks = chunkDocument(content, filePath);
    chunks.push(...fileChunks);
  }

  return chunks;
}

Step 2: Chunking Strategy

This is where most RAG implementations fail. Bad chunking means bad retrieval means bad answers.

typescript

function chunkDocument(
  content: string,
  source: string,
  maxChunkSize: number = 800,
  overlap: number = 100
): DocumentChunk[] {
  const chunks: DocumentChunk[] = [];

  // Split by headers first (semantic boundaries)
  const sections = content.split(/\n(?=#{1,3} )/);

  let chunkIndex = 0;

  for (const section of sections) {
    const title = section.match(/^#{1,3} (.+)/)?.[1] || "Untitled";

    if (section.length <= maxChunkSize) {
      // Section fits in one chunk
      chunks.push({
        id: `\${source}-\${chunkIndex}`,
        content: section.trim(),
        metadata: {
          source,
          title,
          section: title,
          chunkIndex: chunkIndex++,
        },
      });
    } else {
      // Split long sections by paragraphs with overlap
      const paragraphs = section.split(/\n\n+/);
      let currentChunk = "";

      for (const paragraph of paragraphs) {
        if ((currentChunk + paragraph).length > maxChunkSize && currentChunk) {
          chunks.push({
            id: `\${source}-\${chunkIndex}`,
            content: currentChunk.trim(),
            metadata: {
              source,
              title,
              section: title,
              chunkIndex: chunkIndex++,
            },
          });

          // Keep overlap from previous chunk
          const words = currentChunk.split(" ");
          currentChunk =
            words.slice(-Math.floor(overlap / 5)).join(" ") +
            "\n\n" +
            paragraph;
        } else {
          currentChunk += (currentChunk ? "\n\n" : "") + paragraph;
        }
      }

      // Don't forget the last chunk
      if (currentChunk.trim()) {
        chunks.push({
          id: `\${source}-\${chunkIndex}`,
          content: currentChunk.trim(),
          metadata: {
            source,
            title,
            section: title,
            chunkIndex: chunkIndex++,
          },
        });
      }
    }
  }

  return chunks;
}

Key decisions:

Split on semantic boundaries (headers, paragraphs), not arbitrary character counts
Keep chunks around 500-800 tokens. Too small and you lose context. Too large and you dilute relevance.
Overlap chunks by ~100 tokens. Sentences that span chunk boundaries won't get lost.
Preserve metadata. You need to know where each chunk came from for citations.

Step 3: Embedding and Storage

I use Chroma for local development and Pinecone for production. But any vector database works.

typescript

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

async function embedText(text: string): Promise<number[]> {
  // Use an embedding model
  // Anthropic doesn't have embeddings yet, so we use OpenAI or Cohere
  const response = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      Authorization: `Bearer \${process.env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "text-embedding-3-small",
      input: text,
    }),
  });

  const data = await response.json();
  return data.data[0].embedding;
}

async function indexChunks(chunks: DocumentChunk[]) {
  for (const chunk of chunks) {
    const embedding = await embedText(chunk.content);

    await vectorDb.upsert({
      id: chunk.id,
      values: embedding,
      metadata: {
        content: chunk.content,
        ...chunk.metadata,
      },
    });
  }

  console.log(`Indexed \${chunks.length} chunks`);
}

text-embedding-3-small costs $0.02 per million tokens. For most document sets, indexing costs less than a dollar.

Step 4: Retrieval

typescript

async function retrieveRelevant(
  query: string,
  topK: number = 5
): Promise<DocumentChunk[]> {
  const queryEmbedding = await embedText(query);

  const results = await vectorDb.query({
    vector: queryEmbedding,
    topK,
    includeMetadata: true,
  });

  return results.matches.map((match) => ({
    id: match.id,
    content: match.metadata!.content as string,
    metadata: {
      source: match.metadata!.source as string,
      title: match.metadata!.title as string,
      section: match.metadata!.section as string,
      chunkIndex: match.metadata!.chunkIndex as number,
    },
    score: match.score,
  }));
}

topK = 5 means "give me the five most relevant chunks." For most questions, 3-5 is enough. More than 10 and you're stuffing the context window with noise.

Step 5: Generation

typescript

async function answerQuestion(userQuery: string): Promise<string> {
  const relevantChunks = await retrieveRelevant(userQuery, 5);

  const context = relevantChunks
    .map(
      (chunk) =>
        `[Source: \${chunk.metadata.source} - \${chunk.metadata.title}]\n\${chunk.content}`
    )
    .join("\n\n---\n\n");

  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: `You are a helpful assistant that answers questions based on the provided documentation.

Rules:
- ONLY answer based on the provided context. If the context doesn't contain the answer, say so.
- Cite your sources by mentioning the document name.
- Be concise. Answer in 2-3 sentences unless more detail is needed.
- Never make up information that isn't in the context.`,
    messages: [
      {
        role: "user",
        content: `Context documents:

\${context}

---

Question: \${userQuery}`,
      },
    ],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

The system prompt is critical. "Only answer based on the provided context" prevents hallucination. "Cite your sources" builds user trust.

Step 6: Evaluation (The Part Everyone Skips)

How do you know your RAG system is actually good?

Build a test set. Twenty questions with known correct answers. Run them through the system weekly.

typescript

const testCases = [
  {
    question: "What is the refund policy?",
    expectedSource: "policies/refund.md",
    expectedContent: ["30 days", "full refund", "no questions asked"],
  },
  {
    question: "How do I reset my password?",
    expectedSource: "help/account.md",
    expectedContent: ["settings", "security", "email verification"],
  },
];

async function evaluateRAG() {
  let passed = 0;

  for (const test of testCases) {
    const chunks = await retrieveRelevant(test.question);

    // Check if the right source was retrieved
    const sourceFound = chunks.some((c) =>
      c.metadata.source.includes(test.expectedSource)
    );

    // Check if key content was in retrieved chunks
    const allContent = chunks.map((c) => c.content).join(" ");
    const contentFound = test.expectedContent.every((keyword) =>
      allContent.toLowerCase().includes(keyword.toLowerCase())
    );

    if (sourceFound && contentFound) passed++;
    else {
      console.log(`FAILED: "\${test.question}"`);
      console.log(`  Source found: \${sourceFound}`);
      console.log(`  Content found: \${contentFound}`);
    }
  }

  console.log(`\nRetrieval accuracy: \${passed}/\${testCases.length} (\${Math.round((passed / testCases.length) * 100)}%)`);
}

If retrieval accuracy drops below 80%, your chunking strategy needs work. If it's above 90% but answers are still bad, your generation prompt needs work.

Production Considerations

Re-indexing. When documents change, re-index them. I run this on a webhook from our CMS. Document updated? Re-chunk, re-embed, re-index. Old chunks deleted.

Caching. Same question asked repeatedly? Cache the answer. Invalidate when the underlying documents change.

Hybrid search. Vector search alone misses exact matches. Combine with keyword search for best results. Most vector databases support this.

Chunk size tuning. Start with 500-800 tokens. If answers lack detail, increase. If retrieval is noisy, decrease. Test with your evaluation set.

Cost management. Embedding is cheap. The LLM call is the expensive part. Use Haiku for simple Q&A, Sonnet for complex analysis. Route based on query complexity.

RAG is the 80/20 of making AI useful for your business. Eighty percent of the value for twenty percent of the complexity of fine-tuning.

Build it. Test it. Deploy it. Your support team will thank you.

Building a RAG System: Complete Implementation Guide

I tried fine-tuning a model on our company docs. Took a week. Cost $400. The model hallucinated our refund policy.

Then I built a RAG system in an afternoon. Cost $0.02 per query. Cites its sources. When our docs change, the answers change automatically.

RAG isn't sexy. It's plumbing. But it's plumbing that makes AI actually useful for your specific business.

What RAG Actually Does

Retrieval-Augmented Generation. Three words, one idea:

Before the AI answers a question, it searches your documents for relevant information. Then it answers using that information. Instead of guessing from training data, it reads your actual docs.

User: "What's your refund policy?"

Without RAG: AI guesses based on training data. Probably wrong.

With RAG:
1. Search your docs for "refund policy"
2. Find the actual refund policy document
3. Give the AI that document as context
4. AI answers based on YOUR policy, not a guess

That's the entire concept. Everything else is implementation detail.

The Components

Documents --> Chunking --> Embedding --> Vector DB
                                            |
User Query --> Embedding -------> Search ---+
                                    |
                              Top K Results --> LLM --> Answer

Five components. Let me explain each one in plain language.

Chunking: Your documents are too long to use as context. Split them into smaller pieces. A paragraph or section at a time.

Embedding: Convert text into numbers (vectors) that capture meaning. "Refund policy" and "return merchandise" end up as similar vectors even though the words are different.

Vector Database: Stores the embeddings. When you search, it finds the vectors most similar to your query.

Search: Convert the user's question into an embedding, find the closest document chunks.

Generation: Give the LLM the retrieved chunks plus the question. It answers using the provided context.

Step 1: Document Ingestion

typescript

import { readdir, readFile } from "fs/promises";
import path from "path";

interface DocumentChunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    title: string;
    section: string;
    chunkIndex: number;
  };
}

async function loadDocuments(directory: string): Promise<DocumentChunk[]> {
  const files = await readdir(directory, { recursive: true });
  const mdFiles = files.filter((f) => f.toString().endsWith(".md"));
  const chunks: DocumentChunk[] = [];

  for (const file of mdFiles) {
    const filePath = path.join(directory, file.toString());
    const content = await readFile(filePath, "utf-8");
    const fileChunks = chunkDocument(content, filePath);
    chunks.push(...fileChunks);
  }

  return chunks;
}

Step 2: Chunking Strategy

This is where most RAG implementations fail. Bad chunking means bad retrieval means bad answers.

typescript

function chunkDocument(
  content: string,
  source: string,
  maxChunkSize: number = 800,
  overlap: number = 100
): DocumentChunk[] {
  const chunks: DocumentChunk[] = [];

  // Split by headers first (semantic boundaries)
  const sections = content.split(/\n(?=#{1,3} )/);

  let chunkIndex = 0;

  for (const section of sections) {
    const title = section.match(/^#{1,3} (.+)/)?.[1] || "Untitled";

    if (section.length <= maxChunkSize) {
      // Section fits in one chunk
      chunks.push({
        id: `\${source}-\${chunkIndex}`,
        content: section.trim(),
        metadata: {
          source,
          title,
          section: title,
          chunkIndex: chunkIndex++,
        },
      });
    } else {
      // Split long sections by paragraphs with overlap
      const paragraphs = section.split(/\n\n+/);
      let currentChunk = "";

      for (const paragraph of paragraphs) {
        if ((currentChunk + paragraph).length > maxChunkSize && currentChunk) {
          chunks.push({
            id: `\${source}-\${chunkIndex}`,
            content: currentChunk.trim(),
            metadata: {
              source,
              title,
              section: title,
              chunkIndex: chunkIndex++,
            },
          });

          // Keep overlap from previous chunk
          const words = currentChunk.split(" ");
          currentChunk =
            words.slice(-Math.floor(overlap / 5)).join(" ") +
            "\n\n" +
            paragraph;
        } else {
          currentChunk += (currentChunk ? "\n\n" : "") + paragraph;
        }
      }

      // Don't forget the last chunk
      if (currentChunk.trim()) {
        chunks.push({
          id: `\${source}-\${chunkIndex}`,
          content: currentChunk.trim(),
          metadata: {
            source,
            title,
            section: title,
            chunkIndex: chunkIndex++,
          },
        });
      }
    }
  }

  return chunks;
}

Key decisions:

Split on semantic boundaries (headers, paragraphs), not arbitrary character counts
Keep chunks around 500-800 tokens. Too small and you lose context. Too large and you dilute relevance.
Overlap chunks by ~100 tokens. Sentences that span chunk boundaries won't get lost.
Preserve metadata. You need to know where each chunk came from for citations.

Step 3: Embedding and Storage

I use Chroma for local development and Pinecone for production. But any vector database works.

typescript

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

async function embedText(text: string): Promise<number[]> {
  // Use an embedding model
  // Anthropic doesn't have embeddings yet, so we use OpenAI or Cohere
  const response = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      Authorization: `Bearer \${process.env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "text-embedding-3-small",
      input: text,
    }),
  });

  const data = await response.json();
  return data.data[0].embedding;
}

async function indexChunks(chunks: DocumentChunk[]) {
  for (const chunk of chunks) {
    const embedding = await embedText(chunk.content);

    await vectorDb.upsert({
      id: chunk.id,
      values: embedding,
      metadata: {
        content: chunk.content,
        ...chunk.metadata,
      },
    });
  }

  console.log(`Indexed \${chunks.length} chunks`);
}

text-embedding-3-small costs $0.02 per million tokens. For most document sets, indexing costs less than a dollar.

Step 4: Retrieval

typescript

async function retrieveRelevant(
  query: string,
  topK: number = 5
): Promise<DocumentChunk[]> {
  const queryEmbedding = await embedText(query);

  const results = await vectorDb.query({
    vector: queryEmbedding,
    topK,
    includeMetadata: true,
  });

  return results.matches.map((match) => ({
    id: match.id,
    content: match.metadata!.content as string,
    metadata: {
      source: match.metadata!.source as string,
      title: match.metadata!.title as string,
      section: match.metadata!.section as string,
      chunkIndex: match.metadata!.chunkIndex as number,
    },
    score: match.score,
  }));
}

topK = 5 means "give me the five most relevant chunks." For most questions, 3-5 is enough. More than 10 and you're stuffing the context window with noise.

Step 5: Generation

typescript

async function answerQuestion(userQuery: string): Promise<string> {
  const relevantChunks = await retrieveRelevant(userQuery, 5);

  const context = relevantChunks
    .map(
      (chunk) =>
        `[Source: \${chunk.metadata.source} - \${chunk.metadata.title}]\n\${chunk.content}`
    )
    .join("\n\n---\n\n");

  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: `You are a helpful assistant that answers questions based on the provided documentation.

Rules:
- ONLY answer based on the provided context. If the context doesn't contain the answer, say so.
- Cite your sources by mentioning the document name.
- Be concise. Answer in 2-3 sentences unless more detail is needed.
- Never make up information that isn't in the context.`,
    messages: [
      {
        role: "user",
        content: `Context documents:

\${context}

---

Question: \${userQuery}`,
      },
    ],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

The system prompt is critical. "Only answer based on the provided context" prevents hallucination. "Cite your sources" builds user trust.

Step 6: Evaluation (The Part Everyone Skips)

How do you know your RAG system is actually good?

Build a test set. Twenty questions with known correct answers. Run them through the system weekly.

typescript

const testCases = [
  {
    question: "What is the refund policy?",
    expectedSource: "policies/refund.md",
    expectedContent: ["30 days", "full refund", "no questions asked"],
  },
  {
    question: "How do I reset my password?",
    expectedSource: "help/account.md",
    expectedContent: ["settings", "security", "email verification"],
  },
];

async function evaluateRAG() {
  let passed = 0;

  for (const test of testCases) {
    const chunks = await retrieveRelevant(test.question);

    // Check if the right source was retrieved
    const sourceFound = chunks.some((c) =>
      c.metadata.source.includes(test.expectedSource)
    );

    // Check if key content was in retrieved chunks
    const allContent = chunks.map((c) => c.content).join(" ");
    const contentFound = test.expectedContent.every((keyword) =>
      allContent.toLowerCase().includes(keyword.toLowerCase())
    );

    if (sourceFound && contentFound) passed++;
    else {
      console.log(`FAILED: "\${test.question}"`);
      console.log(`  Source found: \${sourceFound}`);
      console.log(`  Content found: \${contentFound}`);
    }
  }

  console.log(`\nRetrieval accuracy: \${passed}/\${testCases.length} (\${Math.round((passed / testCases.length) * 100)}%)`);
}

If retrieval accuracy drops below 80%, your chunking strategy needs work. If it's above 90% but answers are still bad, your generation prompt needs work.

Production Considerations

Re-indexing. When documents change, re-index them. I run this on a webhook from our CMS. Document updated? Re-chunk, re-embed, re-index. Old chunks deleted.

Caching. Same question asked repeatedly? Cache the answer. Invalidate when the underlying documents change.

Hybrid search. Vector search alone misses exact matches. Combine with keyword search for best results. Most vector databases support this.

Chunk size tuning. Start with 500-800 tokens. If answers lack detail, increase. If retrieval is noisy, decrease. Test with your evaluation set.

Cost management. Embedding is cheap. The LLM call is the expensive part. Use Haiku for simple Q&A, Sonnet for complex analysis. Route based on query complexity.

RAG is the 80/20 of making AI useful for your business. Eighty percent of the value for twenty percent of the complexity of fine-tuning.

Build it. Test it. Deploy it. Your support team will thank you.

Building a RAG System: Complete Implementation Guide

Building a RAG System: Complete Implementation Guide

What RAG Actually Does

The Components

Step 1: Document Ingestion

Step 2: Chunking Strategy

Step 3: Embedding and Storage

Step 4: Retrieval

Step 5: Generation

Step 6: Evaluation (The Part Everyone Skips)

Production Considerations

Related Articles

RAG for AI Agents: Grounding Decisions in Real Data

Vector Databases Explained: A Developer's Practical Guide

API Integration Patterns: Connecting AI with External Services

Want to Implement This?

Building a RAG System: Complete Implementation Guide

Building a RAG System: Complete Implementation Guide

What RAG Actually Does

The Components

Step 1: Document Ingestion

Step 2: Chunking Strategy

Step 3: Embedding and Storage

Step 4: Retrieval

Step 5: Generation

Step 6: Evaluation (The Part Everyone Skips)

Production Considerations

Related Articles

RAG for AI Agents: Grounding Decisions in Real Data

Vector Databases Explained: A Developer's Practical Guide

API Integration Patterns: Connecting AI with External Services

Want to Implement This?