Technical Deep DivesFebruary 9, 202618 min read

Semantic Search in Production: From Embeddings to Results

Founder & CEO, Agentik{OS}

Keyword search returns documents containing your words. Semantic search returns documents matching your intent. Here is how to build one for production.

Semantic Search in Production: From Embeddings to Results

Keyword search fails in predictable ways. A user searches for "how to cancel my account" and your support search returns nothing because your documentation says "delete your account." Same intent. Different words. Zero results.

Semantic search solves this. It retrieves documents that match the meaning of a query, not just the words. Built correctly, it makes search feel like the system actually understood what you were asking.

Built incorrectly, it is slow, expensive, and no better than keyword search. Here is the difference.

The Architecture at a Glance

A production semantic search system has four components:

Indexing pipeline: Process documents, chunk them, embed the chunks, store embeddings in a vector database with metadata
Query processing: Take the user's query, embed it, retrieve top-K candidates by similarity
Re-ranking (optional but recommended): Score the top-K candidates more precisely using a cross-encoder
Result presentation: Return results with relevance scores and source attribution

The indexing pipeline runs once (plus updates when content changes). The query processing runs on every search request. Getting both right is necessary; getting only one right produces poor results.

Building the Indexing Pipeline

The indexing pipeline determines the upper bound of your search quality. Bad indexing cannot be fixed at query time.

Step 1: Document Processing

Before embedding, clean and normalize your content.

typescript

interface ProcessedDocument {
  id: string;
  title: string;
  content: string;
  url: string;
  lastModified: Date;
  metadata: Record<string, unknown>;
}

function processDocument(rawDoc: RawDocument): ProcessedDocument {
  return {
    id: rawDoc.id,
    title: cleanText(rawDoc.title),
    content: cleanText(rawDoc.body),
    url: rawDoc.url,
    lastModified: new Date(rawDoc.updatedAt),
    metadata: {
      category: rawDoc.category,
      author: rawDoc.author,
      wordCount: rawDoc.body.split(' ').length,
      language: detectLanguage(rawDoc.body),
    },
  };
}

function cleanText(text: string): string {
  return text
    .replace(/\s+/g, ' ')           // Normalize whitespace
    .replace(/<[^>]*>/g, ' ')        // Strip HTML tags
    .replace(/\[.*?\]/g, '')         // Remove markdown link syntax
    .replace(/#{1,6}\s/g, '')        // Remove markdown headers
    .trim();
}

Step 2: Intelligent Chunking

Chunking strategy has an outsized impact on search quality. The wrong chunk size makes even perfect embeddings useless.

typescript

interface Chunk {
  id: string;
  docId: string;
  content: string;
  title: string;  // Include parent document title for context
  chunkIndex: number;
  metadata: Record<string, unknown>;
}

function chunkDocument(doc: ProcessedDocument): Chunk[] {
  // Split on natural boundaries first (headers, paragraphs)
  const sections = doc.content.split(/\n\n+/);
  const chunks: Chunk[] = [];
  let buffer = '';
  let bufferTokens = 0;
  const TARGET_TOKENS = 400;
  const OVERLAP_TOKENS = 50;
  
  for (const section of sections) {
    const sectionTokens = estimateTokens(section);
    
    if (bufferTokens + sectionTokens > TARGET_TOKENS && buffer) {
      // Flush current buffer as a chunk
      chunks.push({
        id: `${doc.id}-chunk-${chunks.length}`,
        docId: doc.id,
        content: buffer.trim(),
        title: doc.title, // Title provides crucial context
        chunkIndex: chunks.length,
        metadata: doc.metadata,
      });
      
      // Keep last ~50 tokens for overlap
      const words = buffer.split(' ');
      buffer = words.slice(-OVERLAP_TOKENS).join(' ') + ' ' + section;
      bufferTokens = OVERLAP_TOKENS + sectionTokens;
    } else {
      buffer += (buffer ? ' ' : '') + section;
      bufferTokens += sectionTokens;
    }
  }
  
  // Don't forget the last chunk
  if (buffer.trim()) {
    chunks.push({
      id: `${doc.id}-chunk-${chunks.length}`,
      docId: doc.id,
      content: buffer.trim(),
      title: doc.title,
      chunkIndex: chunks.length,
      metadata: doc.metadata,
    });
  }
  
  return chunks;
}

function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4); // ~4 chars per token for English
}

Prepend Title to Each Chunk

A chunk without its document title loses context. The same sentence means something different as part of a troubleshooting guide versus a sales page.

typescript

function buildEmbeddingText(chunk: Chunk): string {
  // Prepending the title dramatically improves retrieval quality
  // The model knows what document this chunk belongs to
  return `${chunk.title}\n\n${chunk.content}`;
}

This single change, prepending the document title to each chunk before embedding, consistently improves retrieval quality by 10-20% in my experience. Do not skip it.

Step 3: Batch Embedding and Storage

typescript

async function indexDocuments(
  documents: ProcessedDocument[],
  vectorDB: VectorDB
): Promise<{ indexed: number; failed: number }> {
  let indexed = 0;
  let failed = 0;
  
  const allChunks = documents.flatMap(doc => chunkDocument(doc));
  const BATCH_SIZE = 100;
  
  for (let i = 0; i < allChunks.length; i += BATCH_SIZE) {
    const batch = allChunks.slice(i, i + BATCH_SIZE);
    
    try {
      // Embed the batch
      const embeddingTexts = batch.map(chunk => buildEmbeddingText(chunk));
      const embeddings = await batchEmbed(embeddingTexts);
      
      // Store in vector database
      await vectorDB.upsert(
        batch.map((chunk, j) => ({
          id: chunk.id,
          values: embeddings[j],
          metadata: {
            docId: chunk.docId,
            title: chunk.title,
            content: chunk.content,
            url: chunk.metadata.url,
            category: chunk.metadata.category,
            lastModified: (chunk.metadata.lastModified as Date).toISOString(),
          },
        }))
      );
      
      indexed += batch.length;
    } catch (error) {
      console.error(`Failed to index batch starting at ${i}:`, error);
      failed += batch.length;
    }
    
    // Respect API rate limits
    await new Promise(r => setTimeout(r, 200));
  }
  
  return { indexed, failed };
}

The Query Pipeline

Basic Semantic Query

typescript

interface SearchResult {
  id: string;
  docId: string;
  title: string;
  content: string;
  url: string;
  score: number;
  highlights?: string[];
}

async function semanticSearch(
  query: string,
  options: {
    topK?: number;
    category?: string;
    minScore?: number;
  } = {}
): Promise<SearchResult[]> {
  const { topK = 10, category, minScore = 0.5 } = options;
  
  // Embed the query
  const queryEmbedding = await getEmbedding(query);
  
  // Build metadata filter
  const filter: Record<string, unknown> = {};
  if (category) filter.category = { $eq: category };
  
  // Query vector database
  const results = await vectorDB.query({
    vector: queryEmbedding,
    topK: topK * 2, // Retrieve more than needed for re-ranking
    includeMetadata: true,
    filter,
  });
  
  // Filter by minimum score and deduplicate by document
  const seen = new Set<string>();
  const filtered = results.matches
    .filter(match => match.score >= minScore)
    .filter(match => {
      const docId = match.metadata?.docId as string;
      if (seen.has(docId)) return false;
      seen.add(docId);
      return true;
    })
    .slice(0, topK);
  
  return filtered.map(match => ({
    id: match.id,
    docId: match.metadata?.docId as string,
    title: match.metadata?.title as string,
    content: match.metadata?.content as string,
    url: match.metadata?.url as string,
    score: match.score,
  }));
}

Adding Re-ranking

Vector similarity retrieves approximately relevant results. Re-ranking scores the top candidates much more precisely.

typescript

async function semanticSearchWithReranking(
  query: string,
  topK: number = 5
): Promise<SearchResult[]> {
  // Step 1: Get 3x more candidates than needed
  const candidates = await semanticSearch(query, { topK: topK * 3 });
  
  if (candidates.length === 0) return [];
  
  // Step 2: Re-rank the candidates
  const reranked = await cohereRerank(
    query,
    candidates.map(c => c.content),
    topK
  );
  
  // Step 3: Return top results with updated scores
  return reranked.map(result => ({
    ...candidates[result.index],
    score: result.relevanceScore,
  }));
}

async function cohereRerank(
  query: string,
  documents: string[],
  topN: number
): Promise<Array<{ index: number; relevanceScore: number }>> {
  const response = await cohere.rerank({
    model: 'rerank-english-v3.0',
    query,
    documents,
    topN,
  });
  
  return response.results.map(r => ({
    index: r.index,
    relevanceScore: r.relevanceScore,
  }));
}

Hybrid Search: Combining Semantic and Keyword

Pure semantic search misses exact matches. Pure keyword search misses semantic matches. Hybrid search gets both.

typescript

async function hybridSearch(
  query: string,
  topK: number = 10
): Promise<SearchResult[]> {
  // Run both in parallel
  const [semanticResults, keywordResults] = await Promise.all([
    semanticSearch(query, { topK: topK * 2 }),
    keywordSearch(query, topK * 2), // Your existing full-text search
  ]);
  
  // Reciprocal Rank Fusion
  const scores = new Map<string, number>();
  const resultMap = new Map<string, SearchResult>();
  const k = 60; // RRF constant
  
  semanticResults.forEach((result, rank) => {
    const current = scores.get(result.id) ?? 0;
    scores.set(result.id, current + 1 / (k + rank + 1));
    resultMap.set(result.id, result);
  });
  
  keywordResults.forEach((result, rank) => {
    const current = scores.get(result.id) ?? 0;
    scores.set(result.id, current + 1 / (k + rank + 1));
    if (!resultMap.has(result.id)) resultMap.set(result.id, result);
  });
  
  // Sort by RRF score
  return Array.from(scores.entries())
    .sort(([, a], [, b]) => b - a)
    .slice(0, topK)
    .map(([id, score]) => ({ ...resultMap.get(id)!, score }));
}

Hybrid search with RRF outperforms either approach alone. The extra implementation complexity is worth it for any serious search application.

Query Understanding: Handling Difficult Queries

Not all queries are equal. Short queries lose information. Vague queries have many valid interpretations. Misspelled queries mismatch on character level.

Query Expansion

For short or vague queries, expand them before embedding:

typescript

async function expandQuery(query: string): Promise<string> {
  if (query.split(' ').length > 6) return query; // Already detailed enough
  
  const expanded = await anthropic.messages.create({
    model: 'claude-haiku-20241022',
    max_tokens: 150,
    messages: [{
      role: 'user',
      content: `Expand this search query with related terms and concepts that would help find relevant documents. Return ONLY the expanded query as a single sentence, nothing else.

Original query: ${query}`,
    }],
  });
  
  const expandedText = expanded.content[0].type === 'text' 
    ? expanded.content[0].text 
    : query;
  
  // Use HyDE (Hypothetical Document Embeddings) pattern:
  // Embed the expanded query, not just the original
  return expandedText;
}

The HyDE pattern takes this further: instead of expanding the query, generate a hypothetical document that would answer the query, then embed that. The hypothetical document is in the same representation space as real documents, which often produces better retrieval than embedding the query directly.

Performance and Scaling

The bottleneck in semantic search is almost always the embedding API call, not the vector database query.

For single-user applications: query latency of 200-500ms is acceptable (100-200ms embedding + 10-50ms vector DB query).

For multi-user production systems: cache frequent queries.

typescript

const queryCache = new Map<string, { embedding: number[]; timestamp: number }>();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes

async function getCachedEmbedding(query: string): Promise<number[]> {
  const normalized = query.toLowerCase().trim();
  const cached = queryCache.get(normalized);
  
  if (cached && Date.now() - cached.timestamp < CACHE_TTL) {
    return cached.embedding;
  }
  
  const embedding = await getEmbedding(normalized);
  queryCache.set(normalized, { embedding, timestamp: Date.now() });
  
  return embedding;
}

For high-volume applications, use Redis for distributed query caching instead of in-memory.

Evaluating Search Quality

Good search is invisible. Bad search is immediately obvious. But without metrics, you cannot tell if it is getting better or worse over time.

Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are the standard metrics. Both require a ground truth dataset: queries paired with their correct results.

Start by collecting search queries from real users and having humans label which results were relevant. Even fifty labeled queries give you a baseline to measure against. Every search system change should be measured against this baseline.

typescript

function meanReciprocalRank(
  queries: Array<{ query: string; relevantDocIds: string[] }>,
  searchFunction: (q: string) => Promise<SearchResult[]>
): Promise<number> {
  return Promise.all(
    queries.map(async ({ query, relevantDocIds }) => {
      const results = await searchFunction(query);
      const relevantSet = new Set(relevantDocIds);
      
      const firstRelevantRank = results.findIndex(r => relevantSet.has(r.docId));
      return firstRelevantRank === -1 ? 0 : 1 / (firstRelevantRank + 1);
    })
  ).then(reciprocalRanks => 
    reciprocalRanks.reduce((sum, r) => sum + r, 0) / reciprocalRanks.length
  );
}

The Complete System

Semantic search done right is a significant engineering investment. The payoff, search that actually understands what users are looking for, is worth it.

Start simple. Basic ANN search with good chunking and title prepending. Measure quality. Add re-ranking if precision is not high enough. Add hybrid search if keyword matches are being missed. Add query expansion for short query handling.

Each addition should be justified by a measurable improvement in search quality. Build what you can measure, measure what you build.

FAQ

Q: What is semantic search?

Semantic search finds content based on meaning rather than exact keyword matches. It converts search queries and documents into vector embeddings, then finds the most similar documents using distance metrics. This means searching for 'how to fix slow website' finds results about 'performance optimization' even without matching keywords.

Q: How do you implement semantic search?

Implementation involves four steps: generate embeddings for all your content using an embedding model, store them in a vector database, at search time embed the user's query, and find the nearest vectors using cosine similarity. Add hybrid search (combining vector + keyword) for best results.

Q: What is hybrid search and when should you use it?

Hybrid search combines semantic search (vector similarity) with traditional keyword search (BM25). Use it when users sometimes search for exact terms (product names, error codes) and sometimes search by concept (how to fix a problem). Hybrid search handles both cases well and typically outperforms either approach alone.

Sources

The Architecture at a Glance

A production semantic search system has four components:

Indexing pipeline: Process documents, chunk them, embed the chunks, store embeddings in a vector database with metadata
Query processing: Take the user's query, embed it, retrieve top-K candidates by similarity
Re-ranking (optional but recommended): Score the top-K candidates more precisely using a cross-encoder
Result presentation: Return results with relevance scores and source attribution

The indexing pipeline runs once (plus updates when content changes). The query processing runs on every search request. Getting both right is necessary; getting only one right produces poor results.

Building the Indexing Pipeline

The indexing pipeline determines the upper bound of your search quality. Bad indexing cannot be fixed at query time.

Step 1: Document Processing

Before embedding, clean and normalize your content.

typescript

interface ProcessedDocument {
  id: string;
  title: string;
  content: string;
  url: string;
  lastModified: Date;
  metadata: Record<string, unknown>;
}

function processDocument(rawDoc: RawDocument): ProcessedDocument {
  return {
    id: rawDoc.id,
    title: cleanText(rawDoc.title),
    content: cleanText(rawDoc.body),
    url: rawDoc.url,
    lastModified: new Date(rawDoc.updatedAt),
    metadata: {
      category: rawDoc.category,
      author: rawDoc.author,
      wordCount: rawDoc.body.split(' ').length,
      language: detectLanguage(rawDoc.body),
    },
  };
}

function cleanText(text: string): string {
  return text
    .replace(/\s+/g, ' ')           // Normalize whitespace
    .replace(/<[^>]*>/g, ' ')        // Strip HTML tags
    .replace(/\[.*?\]/g, '')         // Remove markdown link syntax
    .replace(/#{1,6}\s/g, '')        // Remove markdown headers
    .trim();
}

Step 2: Intelligent Chunking

Chunking strategy has an outsized impact on search quality. The wrong chunk size makes even perfect embeddings useless.

typescript

interface Chunk {
  id: string;
  docId: string;
  content: string;
  title: string;  // Include parent document title for context
  chunkIndex: number;
  metadata: Record<string, unknown>;
}

function chunkDocument(doc: ProcessedDocument): Chunk[] {
  // Split on natural boundaries first (headers, paragraphs)
  const sections = doc.content.split(/\n\n+/);
  const chunks: Chunk[] = [];
  let buffer = '';
  let bufferTokens = 0;
  const TARGET_TOKENS = 400;
  const OVERLAP_TOKENS = 50;
  
  for (const section of sections) {
    const sectionTokens = estimateTokens(section);
    
    if (bufferTokens + sectionTokens > TARGET_TOKENS && buffer) {
      // Flush current buffer as a chunk
      chunks.push({
        id: `${doc.id}-chunk-${chunks.length}`,
        docId: doc.id,
        content: buffer.trim(),
        title: doc.title, // Title provides crucial context
        chunkIndex: chunks.length,
        metadata: doc.metadata,
      });
      
      // Keep last ~50 tokens for overlap
      const words = buffer.split(' ');
      buffer = words.slice(-OVERLAP_TOKENS).join(' ') + ' ' + section;
      bufferTokens = OVERLAP_TOKENS + sectionTokens;
    } else {
      buffer += (buffer ? ' ' : '') + section;
      bufferTokens += sectionTokens;
    }
  }
  
  // Don't forget the last chunk
  if (buffer.trim()) {
    chunks.push({
      id: `${doc.id}-chunk-${chunks.length}`,
      docId: doc.id,
      content: buffer.trim(),
      title: doc.title,
      chunkIndex: chunks.length,
      metadata: doc.metadata,
    });
  }
  
  return chunks;
}

function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4); // ~4 chars per token for English
}

Prepend Title to Each Chunk

A chunk without its document title loses context. The same sentence means something different as part of a troubleshooting guide versus a sales page.

typescript

function buildEmbeddingText(chunk: Chunk): string {
  // Prepending the title dramatically improves retrieval quality
  // The model knows what document this chunk belongs to
  return `${chunk.title}\n\n${chunk.content}`;
}

This single change, prepending the document title to each chunk before embedding, consistently improves retrieval quality by 10-20% in my experience. Do not skip it.

Step 3: Batch Embedding and Storage

typescript

async function indexDocuments(
  documents: ProcessedDocument[],
  vectorDB: VectorDB
): Promise<{ indexed: number; failed: number }> {
  let indexed = 0;
  let failed = 0;
  
  const allChunks = documents.flatMap(doc => chunkDocument(doc));
  const BATCH_SIZE = 100;
  
  for (let i = 0; i < allChunks.length; i += BATCH_SIZE) {
    const batch = allChunks.slice(i, i + BATCH_SIZE);
    
    try {
      // Embed the batch
      const embeddingTexts = batch.map(chunk => buildEmbeddingText(chunk));
      const embeddings = await batchEmbed(embeddingTexts);
      
      // Store in vector database
      await vectorDB.upsert(
        batch.map((chunk, j) => ({
          id: chunk.id,
          values: embeddings[j],
          metadata: {
            docId: chunk.docId,
            title: chunk.title,
            content: chunk.content,
            url: chunk.metadata.url,
            category: chunk.metadata.category,
            lastModified: (chunk.metadata.lastModified as Date).toISOString(),
          },
        }))
      );
      
      indexed += batch.length;
    } catch (error) {
      console.error(`Failed to index batch starting at ${i}:`, error);
      failed += batch.length;
    }
    
    // Respect API rate limits
    await new Promise(r => setTimeout(r, 200));
  }
  
  return { indexed, failed };
}

The Query Pipeline

Basic Semantic Query

typescript

interface SearchResult {
  id: string;
  docId: string;
  title: string;
  content: string;
  url: string;
  score: number;
  highlights?: string[];
}

async function semanticSearch(
  query: string,
  options: {
    topK?: number;
    category?: string;
    minScore?: number;
  } = {}
): Promise<SearchResult[]> {
  const { topK = 10, category, minScore = 0.5 } = options;
  
  // Embed the query
  const queryEmbedding = await getEmbedding(query);
  
  // Build metadata filter
  const filter: Record<string, unknown> = {};
  if (category) filter.category = { $eq: category };
  
  // Query vector database
  const results = await vectorDB.query({
    vector: queryEmbedding,
    topK: topK * 2, // Retrieve more than needed for re-ranking
    includeMetadata: true,
    filter,
  });
  
  // Filter by minimum score and deduplicate by document
  const seen = new Set<string>();
  const filtered = results.matches
    .filter(match => match.score >= minScore)
    .filter(match => {
      const docId = match.metadata?.docId as string;
      if (seen.has(docId)) return false;
      seen.add(docId);
      return true;
    })
    .slice(0, topK);
  
  return filtered.map(match => ({
    id: match.id,
    docId: match.metadata?.docId as string,
    title: match.metadata?.title as string,
    content: match.metadata?.content as string,
    url: match.metadata?.url as string,
    score: match.score,
  }));
}

Adding Re-ranking

Vector similarity retrieves approximately relevant results. Re-ranking scores the top candidates much more precisely.

typescript

async function semanticSearchWithReranking(
  query: string,
  topK: number = 5
): Promise<SearchResult[]> {
  // Step 1: Get 3x more candidates than needed
  const candidates = await semanticSearch(query, { topK: topK * 3 });
  
  if (candidates.length === 0) return [];
  
  // Step 2: Re-rank the candidates
  const reranked = await cohereRerank(
    query,
    candidates.map(c => c.content),
    topK
  );
  
  // Step 3: Return top results with updated scores
  return reranked.map(result => ({
    ...candidates[result.index],
    score: result.relevanceScore,
  }));
}

async function cohereRerank(
  query: string,
  documents: string[],
  topN: number
): Promise<Array<{ index: number; relevanceScore: number }>> {
  const response = await cohere.rerank({
    model: 'rerank-english-v3.0',
    query,
    documents,
    topN,
  });
  
  return response.results.map(r => ({
    index: r.index,
    relevanceScore: r.relevanceScore,
  }));
}

Hybrid Search: Combining Semantic and Keyword

Pure semantic search misses exact matches. Pure keyword search misses semantic matches. Hybrid search gets both.

typescript

async function hybridSearch(
  query: string,
  topK: number = 10
): Promise<SearchResult[]> {
  // Run both in parallel
  const [semanticResults, keywordResults] = await Promise.all([
    semanticSearch(query, { topK: topK * 2 }),
    keywordSearch(query, topK * 2), // Your existing full-text search
  ]);
  
  // Reciprocal Rank Fusion
  const scores = new Map<string, number>();
  const resultMap = new Map<string, SearchResult>();
  const k = 60; // RRF constant
  
  semanticResults.forEach((result, rank) => {
    const current = scores.get(result.id) ?? 0;
    scores.set(result.id, current + 1 / (k + rank + 1));
    resultMap.set(result.id, result);
  });
  
  keywordResults.forEach((result, rank) => {
    const current = scores.get(result.id) ?? 0;
    scores.set(result.id, current + 1 / (k + rank + 1));
    if (!resultMap.has(result.id)) resultMap.set(result.id, result);
  });
  
  // Sort by RRF score
  return Array.from(scores.entries())
    .sort(([, a], [, b]) => b - a)
    .slice(0, topK)
    .map(([id, score]) => ({ ...resultMap.get(id)!, score }));
}

Hybrid search with RRF outperforms either approach alone. The extra implementation complexity is worth it for any serious search application.

Query Understanding: Handling Difficult Queries

Not all queries are equal. Short queries lose information. Vague queries have many valid interpretations. Misspelled queries mismatch on character level.

Query Expansion

For short or vague queries, expand them before embedding:

typescript

async function expandQuery(query: string): Promise<string> {
  if (query.split(' ').length > 6) return query; // Already detailed enough
  
  const expanded = await anthropic.messages.create({
    model: 'claude-haiku-20241022',
    max_tokens: 150,
    messages: [{
      role: 'user',
      content: `Expand this search query with related terms and concepts that would help find relevant documents. Return ONLY the expanded query as a single sentence, nothing else.

Original query: ${query}`,
    }],
  });
  
  const expandedText = expanded.content[0].type === 'text' 
    ? expanded.content[0].text 
    : query;
  
  // Use HyDE (Hypothetical Document Embeddings) pattern:
  // Embed the expanded query, not just the original
  return expandedText;
}

Performance and Scaling

The bottleneck in semantic search is almost always the embedding API call, not the vector database query.

For single-user applications: query latency of 200-500ms is acceptable (100-200ms embedding + 10-50ms vector DB query).

For multi-user production systems: cache frequent queries.

typescript

const queryCache = new Map<string, { embedding: number[]; timestamp: number }>();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes

async function getCachedEmbedding(query: string): Promise<number[]> {
  const normalized = query.toLowerCase().trim();
  const cached = queryCache.get(normalized);
  
  if (cached && Date.now() - cached.timestamp < CACHE_TTL) {
    return cached.embedding;
  }
  
  const embedding = await getEmbedding(normalized);
  queryCache.set(normalized, { embedding, timestamp: Date.now() });
  
  return embedding;
}

For high-volume applications, use Redis for distributed query caching instead of in-memory.

Evaluating Search Quality

Good search is invisible. Bad search is immediately obvious. But without metrics, you cannot tell if it is getting better or worse over time.

Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are the standard metrics. Both require a ground truth dataset: queries paired with their correct results.

typescript

function meanReciprocalRank(
  queries: Array<{ query: string; relevantDocIds: string[] }>,
  searchFunction: (q: string) => Promise<SearchResult[]>
): Promise<number> {
  return Promise.all(
    queries.map(async ({ query, relevantDocIds }) => {
      const results = await searchFunction(query);
      const relevantSet = new Set(relevantDocIds);
      
      const firstRelevantRank = results.findIndex(r => relevantSet.has(r.docId));
      return firstRelevantRank === -1 ? 0 : 1 / (firstRelevantRank + 1);
    })
  ).then(reciprocalRanks => 
    reciprocalRanks.reduce((sum, r) => sum + r, 0) / reciprocalRanks.length
  );
}

The Complete System

Semantic search done right is a significant engineering investment. The payoff, search that actually understands what users are looking for, is worth it.

Each addition should be justified by a measurable improvement in search quality. Build what you can measure, measure what you build.

FAQ

Q: What is semantic search?

Q: How do you implement semantic search?

Q: What is hybrid search and when should you use it?

Semantic Search in Production: From Embeddings to Results

The Architecture at a Glance

Building the Indexing Pipeline

Step 1: Document Processing

Step 2: Intelligent Chunking

Prepend Title to Each Chunk

Step 3: Batch Embedding and Storage

The Query Pipeline

Basic Semantic Query

Adding Re-ranking

Hybrid Search: Combining Semantic and Keyword

Query Understanding: Handling Difficult Queries

Query Expansion

Performance and Scaling

Evaluating Search Quality

The Complete System

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?

Semantic Search in Production: From Embeddings to Results

The Architecture at a Glance

Building the Indexing Pipeline

Step 1: Document Processing

Step 2: Intelligent Chunking

Prepend Title to Each Chunk

Step 3: Batch Embedding and Storage

The Query Pipeline

Basic Semantic Query

Adding Re-ranking

Hybrid Search: Combining Semantic and Keyword

Query Understanding: Handling Difficult Queries

Query Expansion

Performance and Scaling

Evaluating Search Quality

The Complete System

FAQ

Sources

Further Reading

Related Articles

Want to Implement This?