Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Keyword search returns documents containing your words. Semantic search returns documents matching your intent. Here is how to build one for production.

Keyword search fails in predictable ways. A user searches for "how to cancel my account" and your support search returns nothing because your documentation says "delete your account." Same intent. Different words. Zero results.
Semantic search solves this. It retrieves documents that match the meaning of a query, not just the words. Built correctly, it makes search feel like the system actually understood what you were asking.
Built incorrectly, it is slow, expensive, and no better than keyword search. Here is the difference.
A production semantic search system has four components:
The indexing pipeline runs once (plus updates when content changes). The query processing runs on every search request. Getting both right is necessary; getting only one right produces poor results.
The indexing pipeline determines the upper bound of your search quality. Bad indexing cannot be fixed at query time.
Before embedding, clean and normalize your content.
interface ProcessedDocument {
id: string;
title: string;
content: string;
url: string;
lastModified: Date;
metadata: Record<string, unknown>;
}
function processDocument(rawDoc: RawDocument): ProcessedDocument {
return {
id: rawDoc.id,
title: cleanText(rawDoc.title),
content: cleanText(rawDoc.body),
url: rawDoc.url,
lastModified: new Date(rawDoc.updatedAt),
metadata: {
category: rawDoc.category,
author: rawDoc.author,
wordCount: rawDoc.body.split(' ').length,
language: detectLanguage(rawDoc.body),
},
};
}
function cleanText(text: string): string {
return text
.replace(/\s+/g, ' ') // Normalize whitespace
.replace(/<[^>]*>/g, ' ') // Strip HTML tags
.replace(/\[.*?\]/g, '') // Remove markdown link syntax
.replace(/#{1,6}\s/g, '') // Remove markdown headers
.trim();
}Chunking strategy has an outsized impact on search quality. The wrong chunk size makes even perfect embeddings useless.
interface Chunk {
id: string;
docId: string;
content: string;
title: string; // Include parent document title for context
chunkIndex: number;
metadata: Record<string, unknown>;
}
function chunkDocument(doc: ProcessedDocument): Chunk[] {
// Split on natural boundaries first (headers, paragraphs)
const sections = doc.content.split(/\n\n+/);
const chunks: Chunk[] = [];
let buffer = '';
let bufferTokens = 0;
const TARGET_TOKENS = 400;
const OVERLAP_TOKENS = 50;
for (const section of sections) {
const sectionTokens = estimateTokens(section);
if (bufferTokens + sectionTokens > TARGET_TOKENS && buffer) {
// Flush current buffer as a chunk
chunks.push({
id: `${doc.id}-chunk-${chunks.length}`,
docId: doc.id,
content: buffer.trim(),
title: doc.title, // Title provides crucial context
chunkIndex: chunks.length,
metadata: doc.metadata,
});
// Keep last ~50 tokens for overlap
const words = buffer.split(' ');
buffer = words.slice(-OVERLAP_TOKENS).join(' ') + ' ' + section;
bufferTokens = OVERLAP_TOKENS + sectionTokens;
} else {
buffer += (buffer ? ' ' : '') + section;
bufferTokens += sectionTokens;
}
}
// Don't forget the last chunk
if (buffer.trim()) {
chunks.push({
id: `${doc.id}-chunk-${chunks.length}`,
docId: doc.id,
content: buffer.trim(),
title: doc.title,
chunkIndex: chunks.length,
metadata: doc.metadata,
});
}
return chunks;
}
function estimateTokens(text: string): number {
return Math.ceil(text.length / 4); // ~4 chars per token for English
}A chunk without its document title loses context. The same sentence means something different as part of a troubleshooting guide versus a sales page.
function buildEmbeddingText(chunk: Chunk): string {
// Prepending the title dramatically improves retrieval quality
// The model knows what document this chunk belongs to
return `${chunk.title}\n\n${chunk.content}`;
}This single change, prepending the document title to each chunk before embedding, consistently improves retrieval quality by 10-20% in my experience. Do not skip it.
async function indexDocuments(
documents: ProcessedDocument[],
vectorDB: VectorDB
): Promise<{ indexed: number; failed: number }> {
let indexed = 0;
let failed = 0;
const allChunks = documents.flatMap(doc => chunkDocument(doc));
const BATCH_SIZE = 100;
for (let i = 0; i < allChunks.length; i += BATCH_SIZE) {
const batch = allChunks.slice(i, i + BATCH_SIZE);
try {
// Embed the batch
const embeddingTexts = batch.map(chunk => buildEmbeddingText(chunk));
const embeddings = await batchEmbed(embeddingTexts);
// Store in vector database
await vectorDB.upsert(
batch.map((chunk, j) => ({
id: chunk.id,
values: embeddings[j],
metadata: {
docId: chunk.docId,
title: chunk.title,
content: chunk.content,
url: chunk.metadata.url,
category: chunk.metadata.category,
lastModified: (chunk.metadata.lastModified as Date).toISOString(),
},
}))
);
indexed += batch.length;
} catch (error) {
console.error(`Failed to index batch starting at ${i}:`, error);
failed += batch.length;
}
// Respect API rate limits
await new Promise(r => setTimeout(r, 200));
}
return { indexed, failed };
}interface SearchResult {
id: string;
docId: string;
title: string;
content: string;
url: string;
score: number;
highlights?: string[];
}
async function semanticSearch(
query: string,
options: {
topK?: number;
category?: string;
minScore?: number;
} = {}
): Promise<SearchResult[]> {
const { topK = 10, category, minScore = 0.5 } = options;
// Embed the query
const queryEmbedding = await getEmbedding(query);
// Build metadata filter
const filter: Record<string, unknown> = {};
if (category) filter.category = { $eq: category };
// Query vector database
const results = await vectorDB.query({
vector: queryEmbedding,
topK: topK * 2, // Retrieve more than needed for re-ranking
includeMetadata: true,
filter,
});
// Filter by minimum score and deduplicate by document
const seen = new Set<string>();
const filtered = results.matches
.filter(match => match.score >= minScore)
.filter(match => {
const docId = match.metadata?.docId as string;
if (seen.has(docId)) return false;
seen.add(docId);
return true;
})
.slice(0, topK);
return filtered.map(match => ({
id: match.id,
docId: match.metadata?.docId as string,
title: match.metadata?.title as string,
content: match.metadata?.content as string,
url: match.metadata?.url as string,
score: match.score,
}));
}Vector similarity retrieves approximately relevant results. Re-ranking scores the top candidates much more precisely.
async function semanticSearchWithReranking(
query: string,
topK: number = 5
): Promise<SearchResult[]> {
// Step 1: Get 3x more candidates than needed
const candidates = await semanticSearch(query, { topK: topK * 3 });
if (candidates.length === 0) return [];
// Step 2: Re-rank the candidates
const reranked = await cohereRerank(
query,
candidates.map(c => c.content),
topK
);
// Step 3: Return top results with updated scores
return reranked.map(result => ({
...candidates[result.index],
score: result.relevanceScore,
}));
}
async function cohereRerank(
query: string,
documents: string[],
topN: number
): Promise<Array<{ index: number; relevanceScore: number }>> {
const response = await cohere.rerank({
model: 'rerank-english-v3.0',
query,
documents,
topN,
});
return response.results.map(r => ({
index: r.index,
relevanceScore: r.relevanceScore,
}));
}Pure semantic search misses exact matches. Pure keyword search misses semantic matches. Hybrid search gets both.
async function hybridSearch(
query: string,
topK: number = 10
): Promise<SearchResult[]> {
// Run both in parallel
const [semanticResults, keywordResults] = await Promise.all([
semanticSearch(query, { topK: topK * 2 }),
keywordSearch(query, topK * 2), // Your existing full-text search
]);
// Reciprocal Rank Fusion
const scores = new Map<string, number>();
const resultMap = new Map<string, SearchResult>();
const k = 60; // RRF constant
semanticResults.forEach((result, rank) => {
const current = scores.get(result.id) ?? 0;
scores.set(result.id, current + 1 / (k + rank + 1));
resultMap.set(result.id, result);
});
keywordResults.forEach((result, rank) => {
const current = scores.get(result.id) ?? 0;
scores.set(result.id, current + 1 / (k + rank + 1));
if (!resultMap.has(result.id)) resultMap.set(result.id, result);
});
// Sort by RRF score
return Array.from(scores.entries())
.sort(([, a], [, b]) => b - a)
.slice(0, topK)
.map(([id, score]) => ({ ...resultMap.get(id)!, score }));
}Hybrid search with RRF outperforms either approach alone. The extra implementation complexity is worth it for any serious search application.
Not all queries are equal. Short queries lose information. Vague queries have many valid interpretations. Misspelled queries mismatch on character level.
For short or vague queries, expand them before embedding:
async function expandQuery(query: string): Promise<string> {
if (query.split(' ').length > 6) return query; // Already detailed enough
const expanded = await anthropic.messages.create({
model: 'claude-haiku-20241022',
max_tokens: 150,
messages: [{
role: 'user',
content: `Expand this search query with related terms and concepts that would help find relevant documents. Return ONLY the expanded query as a single sentence, nothing else.
Original query: ${query}`,
}],
});
const expandedText = expanded.content[0].type === 'text'
? expanded.content[0].text
: query;
// Use HyDE (Hypothetical Document Embeddings) pattern:
// Embed the expanded query, not just the original
return expandedText;
}The HyDE pattern takes this further: instead of expanding the query, generate a hypothetical document that would answer the query, then embed that. The hypothetical document is in the same representation space as real documents, which often produces better retrieval than embedding the query directly.
The bottleneck in semantic search is almost always the embedding API call, not the vector database query.
For single-user applications: query latency of 200-500ms is acceptable (100-200ms embedding + 10-50ms vector DB query).
For multi-user production systems: cache frequent queries.
const queryCache = new Map<string, { embedding: number[]; timestamp: number }>();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes
async function getCachedEmbedding(query: string): Promise<number[]> {
const normalized = query.toLowerCase().trim();
const cached = queryCache.get(normalized);
if (cached && Date.now() - cached.timestamp < CACHE_TTL) {
return cached.embedding;
}
const embedding = await getEmbedding(normalized);
queryCache.set(normalized, { embedding, timestamp: Date.now() });
return embedding;
}For high-volume applications, use Redis for distributed query caching instead of in-memory.
Good search is invisible. Bad search is immediately obvious. But without metrics, you cannot tell if it is getting better or worse over time.
Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are the standard metrics. Both require a ground truth dataset: queries paired with their correct results.
Start by collecting search queries from real users and having humans label which results were relevant. Even fifty labeled queries give you a baseline to measure against. Every search system change should be measured against this baseline.
function meanReciprocalRank(
queries: Array<{ query: string; relevantDocIds: string[] }>,
searchFunction: (q: string) => Promise<SearchResult[]>
): Promise<number> {
return Promise.all(
queries.map(async ({ query, relevantDocIds }) => {
const results = await searchFunction(query);
const relevantSet = new Set(relevantDocIds);
const firstRelevantRank = results.findIndex(r => relevantSet.has(r.docId));
return firstRelevantRank === -1 ? 0 : 1 / (firstRelevantRank + 1);
})
).then(reciprocalRanks =>
reciprocalRanks.reduce((sum, r) => sum + r, 0) / reciprocalRanks.length
);
}Semantic search done right is a significant engineering investment. The payoff, search that actually understands what users are looking for, is worth it.
Start simple. Basic ANN search with good chunking and title prepending. Measure quality. Add re-ranking if precision is not high enough. Add hybrid search if keyword matches are being missed. Add query expansion for short query handling.
Each addition should be justified by a measurable improvement in search quality. Build what you can measure, measure what you build.
Q: What is semantic search?
Semantic search finds content based on meaning rather than exact keyword matches. It converts search queries and documents into vector embeddings, then finds the most similar documents using distance metrics. This means searching for 'how to fix slow website' finds results about 'performance optimization' even without matching keywords.
Q: How do you implement semantic search?
Implementation involves four steps: generate embeddings for all your content using an embedding model, store them in a vector database, at search time embed the user's query, and find the nearest vectors using cosine similarity. Add hybrid search (combining vector + keyword) for best results.
Q: What is hybrid search and when should you use it?
Hybrid search combines semantic search (vector similarity) with traditional keyword search (BM25). Use it when users sometimes search for exact terms (product names, error codes) and sometimes search by concept (how to fix a problem). Hybrid search handles both cases well and typically outperforms either approach alone.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

Embeddings Explained: Coordinates for Meaning
Embeddings are just coordinates for meaning. Once you understand that, semantic search, recommendations, and classification all click into place.

Vector Databases: A Practical Guide for Developers
Vector databases search by meaning, not keywords. Here's how embeddings work, which indexing strategy to pick, and when to use Pinecone vs Chroma vs pgvector.

Build a RAG System That Actually Works in Production
Basic RAG works in demos and breaks in production. Here is what naive implementations get wrong and how to build the version that handles real users.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.