Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Your brilliant AI agent forgets everything between sessions. Here's how to build memory systems that make agents genuinely useful over time.

I spent three months building an AI agent that was genuinely impressive. Smart prompts. Good tool integrations. Solid error handling. Users loved it for about five minutes.
Then they asked it something like: "Remember last week when you helped me analyze that report? Can we do something similar for this quarter?" And the agent had absolutely no idea what they were talking about. It had forgotten everything the moment the session ended.
That is the memory problem. Every AI agent today is essentially stateless by default. It wakes up for a session, does useful work, then forgets everything. Like Groundhog Day, except the user remembers and gets increasingly frustrated.
Building memory systems that actually work is one of the hardest engineering challenges in agent development. Not because the technology is obscure, but because getting it right requires thinking clearly about what kind of memory you need and when.
Not all memory is the same. Mixing them up leads to architectures that work in demos and collapse under real usage.
Working memory is what the agent knows right now. Everything in the active context window. Current conversation, recent tool results, immediate task state.
This is the simplest form and every agent has it. The limit is the context window size, which ranges from 8K to 200K+ tokens depending on the model.
The practical problem: long tasks accumulate context fast. Tool results, intermediate reasoning, conversation history. A complex task that runs for twenty minutes can easily overflow even a 200K token context if you are not careful.
Context management strategies:
Summarization. Periodically compress older exchanges into summaries. Keep the summary, discard the raw exchange. You lose some detail but preserve the signal.
Selective retention. Not all context is equally valuable. Tool results from three steps ago matter less than the current goal. Implement relevance scoring and prune low-value context aggressively.
Chunking. For very long tasks, break them into phases with defined handoff points. Complete a phase, summarize its outcomes, start the next phase with a clean context initialized from the summary.
Episodic memory stores past interactions. The user's preferences discovered last week. The analysis you ran last month. The bugs you fixed yesterday.
This is stored outside the context window, retrieved when relevant, and injected into the current context. The architecture: vector database for semantic search, retrieval pipeline to find relevant memories, injection into the system prompt or early context.
async function buildContextWithMemory(
userId: string,
currentQuery: string,
baseSystemPrompt: string
): Promise<string> {
// Retrieve relevant past interactions
const relevantMemories = await vectorDB.search({
collection: "user_interactions",
query: currentQuery,
filter: { userId },
limit: 5,
minScore: 0.75,
});
if (relevantMemories.length === 0) {
return baseSystemPrompt;
}
const memoryContext = relevantMemories
.map((m) => `[${m.timestamp}]: ${m.summary}`)
.join("\n");
return `${baseSystemPrompt}\n\n## Relevant past interactions with this user:\n${memoryContext}`;
}The retrieval step is critical. Bad retrieval injects irrelevant memories that confuse the agent. Good retrieval surfaces exactly the past context that makes the current response better.
Semantic memory is the agent's knowledge base. Product documentation. Domain expertise. Company policies. Data that the agent needs to answer questions or make decisions.
This is the domain of RAG (Retrieval-Augmented Generation). You store knowledge in a vector database, retrieve relevant chunks when needed, inject them into context. The agent reasons over retrieved knowledge rather than relying solely on training data.
The distinction from episodic memory: semantic memory is factual knowledge that does not expire or change based on individual interactions. Episodic memory is the record of what happened in specific past sessions.
Procedural memory encodes skills and workflows. How to handle a specific type of request. The sequence of steps for a complex task. The heuristics that work for a particular domain.
This lives primarily in the system prompt and in fine-tuning. When you notice the agent consistently making the same type of mistake, you update the system prompt to encode the correct approach. Over time, the system prompt becomes a library of learned procedures.
For teams that run many agent sessions, consider fine-tuning on your successful interactions. This is the deepest form of procedural memory: baking learned behaviors into the model weights themselves.
For episodic and semantic memory, vector databases are the standard choice. They store high-dimensional embeddings and support semantic similarity search: find the ten most conceptually similar items to a query, even without exact keyword matches.
| Database | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Pinecone | Managed, fast, scalable | Expensive at scale | Production semantic search |
| Weaviate | Hybrid search, open-source | More ops overhead | Complex retrieval needs |
| Chroma | Simple, local-friendly | Not enterprise-scale | Development and prototyping |
| pgvector | Postgres native | Less optimized for vectors | Teams already using Postgres |
| Qdrant | Fast, open-source, feature-rich | Newer ecosystem | High-performance requirements |
For most agent systems, start with pgvector if you are already running Postgres, or Chroma for development. Migrate to Pinecone or Qdrant when you hit scale.
Everyone focuses on memory retrieval. The harder problem is memory writing. What do you store? When? How do you decide what is worth remembering?
Storing everything is tempting and wrong. You end up with a massive database of noise that degrades retrieval quality and costs significant money.
A practical framework for deciding what to store:
Store outcomes, not processes. The fact that a user prefers brief responses matters. The full transcript of a three-hour session usually does not. Extract the signal, store the signal.
Store corrections explicitly. When a user corrects the agent, store the correction with high weight. "That is not what I meant. I prefer X, not Y" is high-value episodic memory.
Store decisions with rationale. When the agent makes a significant decision, store what it decided and why. This helps with consistency across sessions and makes the agent's reasoning auditable.
Set retention policies. Old memories become less relevant. Implement TTL (time-to-live) for episodic memory. What mattered six months ago probably matters less now. Decay the relevance score of old memories over time.
async function storeInteractionMemory(
userId: string,
interaction: Interaction
): Promise<void> {
// Extract key facts from the interaction
const factsToStore = await llm.extract({
text: interaction.transcript,
schema: {
preferences: "User preferences discovered during this interaction",
corrections: "Cases where the user corrected the agent",
decisions: "Significant decisions made and their rationale",
context: "Background information about the user or their situation",
},
});
// Only store if there is meaningful content
if (factsToStore.preferences.length === 0 &&
factsToStore.corrections.length === 0) {
return; // Nothing worth storing
}
const embedding = await embeddings.create(JSON.stringify(factsToStore));
await vectorDB.upsert({
collection: "user_memory",
id: `${userId}-${interaction.id}`,
vector: embedding,
metadata: {
userId,
timestamp: interaction.endTime,
...factsToStore,
},
});
}For sessions that run long, context management is not optional. Here is a production-ready approach.
class ContextManager {
private maxTokens: number;
private messages: Message[];
constructor(maxTokens = 100000) {
this.maxTokens = maxTokens;
this.messages = [];
}
async add(message: Message): Promise<void> {
this.messages.push(message);
await this.pruneIfNeeded();
}
private async pruneIfNeeded(): Promise<void> {
const currentTokens = await this.countTokens();
if (currentTokens <= this.maxTokens * 0.8) return;
// Summarize old messages when we hit 80% of limit
const oldMessages = this.messages.slice(0, Math.floor(this.messages.length / 2));
const summary = await this.summarize(oldMessages);
// Replace old messages with summary
this.messages = [
{ role: "system", content: `[Context summary]: ${summary}` },
...this.messages.slice(Math.floor(this.messages.length / 2)),
];
}
private async summarize(messages: Message[]): Promise<string> {
const response = await llm.complete({
messages,
system:
"Summarize the key information from this conversation. " +
"Focus on decisions made, user preferences, and task state. " +
"Be concise. This summary will be used to maintain context in a long-running session.",
});
return response.content;
}
}Context window management is the difference between an agent that handles real-world tasks and one that works only in demos. Real tasks are long. Plan for it from day one.
When multiple agents work together, memory becomes a coordination challenge. Agents need to share state without conflicts.
Two patterns work:
Shared memory store. All agents read from and write to a central memory store. Works well when agents have complementary roles that do not overlap. Use optimistic locking or versioning to handle concurrent writes.
Memory-passing protocols. Agents explicitly pass memory objects between handoffs. Agent A completes its work, packages its findings into a structured memory object, and passes it to Agent B. More overhead but better isolation.
For most multi-agent systems, the shared memory store is simpler to implement and reason about. Lock contention is rarely a problem because well-designed agent teams minimize concurrent writes to the same memory regions.
Memory is not free. Vector database queries, embedding generation, and the additional tokens injected into context all add up.
For a production agent handling 10,000 sessions per day:
The ROI is almost always positive when the memory system meaningfully improves agent usefulness. A support agent that remembers customer history resolves issues faster and requires fewer follow-up interactions. The token cost is a rounding error compared to the value of fewer support tickets.
For strategies on managing the broader cost picture, see agent cost optimization.
Q: How does AI agent memory work?
AI agent memory operates in three tiers: short-term (current context window), working memory (active task state), and long-term (persistent knowledge in vector databases). Effective agents combine all three to maintain context across sessions and learn from past interactions.
Q: What is the context window limitation and how do you work around it?
The context window is the maximum text an AI model processes at once (100K-200K tokens). Workarounds include RAG to fetch relevant information, summarization to compress older context, hierarchical memory at different granularity levels, and strategic context management prioritizing the most relevant information.
Q: How do AI agents maintain context across sessions?
Agents maintain cross-session context through persistent stores — vector databases for semantic search, structured databases for explicit facts, and summary files (like CLAUDE.md) capturing project knowledge. The agent retrieves relevant memories at session start and updates them as new information is learned.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

RAG for AI Agents: Grounding Decisions in Real Data
Your agent confidently cites a policy updated six months ago. Not a hallucination problem. A knowledge problem. RAG fixes it. Here is how.

Building Custom AI Agents from Scratch: What Works
Stop wrapping ChatGPT in a text box and calling it an agent. Here's how to build real agents with perception, reasoning, tools, and memory.

AI Agent Cost Optimization: Stop Burning Money
$47,000 per month on LLM calls for 2,000 users. That's $23.50 per user on inference alone. Most of it was waste. Here's how to fix agent economics.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.