What is the difference between episodic and semantic memory in AI agents?

Episodic memory stores specific events with temporal context: 'On March 5, the customer called about invoice #4821 and was frustrated.' Semantic memory distills durable facts from those events: 'This customer prefers email resolution and has recurring billing issues.' Episodic answers 'what happened?' while semantic answers 'what's true?' Production agents need both: episodic for recent context and accountability, semantic for personalization and efficiency.

Which ICLR 2026 papers are most relevant to agent memory?

The MemAgents workshop at ICLR 2026 featured several landmark papers. MAGMA introduced multi-graph memory with semantic, temporal, causal, and entity layers, achieving a 0.70 judge score on LoCoMo. MEM-alpha used reinforcement learning to teach agents memory construction, generalizing from 30K to 400K+ token sequences. A-MAC proposed five-factor memory admission control. AdaMem combined working, episodic, persona, and graph-based memory for state-of-the-art dialogue performance.

How much does episodic vs semantic memory affect agent accuracy?

Mem0's research on the LOCOMO benchmark showed a 26% relative accuracy improvement over OpenAI's memory by combining episodic and semantic layers with graph-based representations. MAGMA's multi-graph approach, which explicitly separates semantic and temporal graphs, achieved the highest overall LoCoMo judge score of 0.70, a 45.5% improvement over full-context baselines.

Should I use Mem0, Letta, or Zep for agent memory?

It depends on your primary need. Mem0 excels at hybrid vector+graph search with 26% accuracy gains and 90% token savings, best for general-purpose agents. Zep's temporal knowledge graph tracks how facts change over time, best for enterprise agents that need relationship modeling. Letta's self-editing memory gives agents direct control over their memory blocks, best for agents that need to actively reason about what they know.

How does AWS AgentCore handle episodic vs semantic memory?

AWS AgentCore Memory extracts both types automatically from conversations. It preserves conversation summaries (episodic) and user preferences and facts (semantic) as structured memory records. As of March 2026, it supports streaming notifications via Kinesis when memory records change, available in 15 AWS regions.

What is memory consolidation in AI agents?

Memory consolidation is the process of converting episodic memories into semantic knowledge, inspired by how the human hippocampus replays experiences for the neocortex during sleep. In AI agents, this means periodically scanning recent episodes to extract durable facts and update the semantic store. For example, three separate calls about billing issues consolidate into the semantic fact 'this customer has recurring billing problems.'

Your Agent Remembers Everything Except What Matters

Our agent remembered that the customer called twice last week. It had perfect episodic recall: timestamps, transcript snippets, the exact tool calls from each session. But when the customer called a third time and said "I'm the one who always has billing problems," the agent drew a blank. It had the episodes. It never distilled the pattern.

Two types of memory. One was missing. And that gap cost us a fifteen-minute call that should have taken three. Conventional wisdom says "more memory is better." The research says the type of memory matters more than the amount.

This isn't a theoretical problem. It's the central architectural question facing every team building customer-facing AI agents in 2026. The ICLR 2026 MemAgents workshop in Rio de Janeiro, the first major venue dedicated entirely to agent memory, put it in formal terms: episodic memory stores what happened; semantic memory stores what's true. Most production agents have one or the other. The research says you need both, and more importantly, you need to know when each one matters.

Two memory systems, one agent
What the research says
The consolidation problem
Architecture patterns
Platform comparison
When each type matters
Implementation guide
The admission control problem

Two memory systems, one agent

The distinction comes from cognitive science. Endel Tulving proposed it in 1972, and it has held up through fifty years of neuroscience research. The MemAgents workshop proposal explicitly bridges these perspectives, calling out episodic, semantic, and working memory as the three architectural pillars, alongside neuroscience-inspired consolidation as a design pattern.

Episodic memory records specific events with temporal context. "On March 12 at 2:47 PM, the customer called about a double charge on invoice #8291. They were frustrated. The agent issued a $47.50 refund and the customer accepted."

Semantic memory stores distilled facts without event context. "This customer prefers email follow-ups. They have a history of billing disputes. They're on the Enterprise plan."

The difference isn't just academic. It determines what your agent can do:

Capability	Episodic	Semantic
"What happened last time?"	Direct recall of the event	Cannot answer
"What do we know about this customer?"	Must search all past episodes	Direct lookup
"Why did we give them a refund?"	Full context with reasoning chain	Cannot answer
"Does this customer prefer phone or email?"	Must infer from multiple episodes	Direct answer
Audit trail and compliance	Complete event log	Insufficient
Personalization at scale	Too slow to search everything	Fast, pre-computed

Human brains run both systems simultaneously. The hippocampus rapidly encodes episodes. The neocortex slowly consolidates patterns into semantic knowledge, mostly during sleep. The MemAgents workshop calls this complementary learning systems theory, and it's now a first-class design pattern for agents.

What the research says

The first quarter of 2026 produced more agent memory research than all of 2024 combined. Here are the papers that matter for practitioners.

MAGMA: four graphs, one memory

MAGMA (Multi-Graph based Agentic Memory Architecture) treats each memory item as a node that lives simultaneously in four orthogonal graphs: semantic, temporal, causal, and entity. When retrieving, a policy-guided traversal walks across whichever graph dimensions the query needs.

Ask "what happened after the customer complained?" and the traversal follows temporal and causal edges. Ask "what do we know about this customer's preferences?" and it follows semantic and entity edges. Same memory store, different retrieval paths.

The results on the LoCoMo benchmark are striking: a judge score of 0.70, outperforming full-context baselines (0.481) by 45.5% and beating prior memory systems like A-MEM (0.58) and Nemori (0.59) by 18-20%. The insight isn't just that multi-graph works. It's that explicitly separating the semantic and temporal views of the same memory enables fundamentally better retrieval.

Mem0: production-scale memory with a 26% accuracy edge

Mem0's paper on the LOCOMO benchmark demonstrated that combining episodic and semantic extraction with graph-based representations delivers a 26% relative improvement in accuracy over OpenAI's built-in memory (66.9% vs 52.9% overall LLM-as-Judge score). The graph-enhanced variant adds another 2% on top.

Beyond accuracy, the production numbers matter: 91% lower p95 latency and 90% token savings compared to stuffing full conversation history into context. That's the difference between a memory system that works in a demo and one that works at scale.

Mem0 explicitly categorizes memories into episodic (interaction-specific events with temporal markers) and semantic (extracted knowledge without event context). The system scores each memory using a composite of semantic similarity (60%), extraction confidence (20%), recency (10%), and access frequency (10%).

E-mem: episodic context without compression loss

E-mem took a different angle on episodic memory. Instead of compressing episodes into summaries (which loses detail), it uses a hierarchical multi-agent architecture where subordinate agents each maintain uncompressed episode windows while a master agent orchestrates retrieval across them.

The result: 54%+ F1 on LoCoMo (7.75% above the previous state of the art) while reducing token cost by 70%. The lesson for practitioners: if your use case requires high-fidelity episodic recall (compliance, legal, healthcare), compression-based approaches may sacrifice too much detail.

MEM-alpha: teaching agents to manage their own memory

MEM-alpha, under review at ICLR 2026, frames memory construction as a reinforcement learning problem. The agent learns when to store, update, summarize, or discard memories through interaction and feedback, using separate episodic, semantic, and core memory components with specialized tools for each.

The most impressive finding: despite training on sequences of only 30K tokens, the agents generalize to sequences exceeding 400K tokens, over 13x the training length. The memory management policy transfers because the episodic-semantic distinction itself is generalizable. An agent that learns "distill repeated patterns into semantic facts" applies that strategy regardless of conversation length.

AdaMem: four memory types in concert

AdaMem, published March 2026 from Tsinghua and Tencent, organizes dialogue history into working memory, episodic memory, persona memory, and graph-based memory. The key innovation is question-conditioned retrieval: the system examines each query and decides which memory types to activate. A "what happened" question triggers episodic retrieval. A "what kind of person" question triggers persona memory. A "how are X and Y related" question triggers graph expansion.

This achieved state-of-the-art results on both LoCoMo and PERSONAMEM benchmarks, confirming that the right memory type for a query is as important as the quality of any single memory type.

The consolidation problem

Here's where the storyline from our opening comes back. Our agent had excellent episodic memory. It stored every interaction faithfully. But it never consolidated those episodes into semantic facts.

The customer called three times about billing issues. Each episode was stored independently. When the customer said "I always have billing problems," the agent couldn't confirm or deny. It would need to search all episodes, find the pattern, and synthesize it in real-time. That's expensive, slow, and often inaccurate.

Memory consolidation is the process of converting episodic memories into semantic knowledge. In neuroscience, this happens during sleep: the hippocampus replays recent experiences for the neocortex. In AI agents, it's a background process that periodically scans recent episodes and extracts durable facts.

Memory consolidation: episodes become facts

The A-MAC paper (Adaptive Memory Admission Control) from the MemAgents workshop formalizes this with five factors for deciding what to consolidate: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. Not every episode deserves to become a semantic fact. The customer mentioning the weather doesn't consolidate. The customer mentioning they're switching to a competitor does.

Without consolidation, episodic memory grows unbounded and retrieval quality degrades. With naive consolidation, you lose the source episodes and can't answer "when did this happen?" or "who told us this?" The production answer is both: consolidate into semantic facts while retaining episodic sources as references.

Architecture patterns

Three architectural approaches have emerged from the 2026 research. Each makes a different trade-off between episodic fidelity and semantic efficiency.

Pattern 1: Dual-store with consolidation

Separate stores for episodes and facts, with a background consolidation process. This is closest to the biological model and is what Mem0 and AWS AgentCore implement.

typescript

interface EpisodicMemory {
  id: string;
  timestamp: Date;
  // Full event context -- who, what, when, why
  event: string;
  participants: string[];
  sentiment: 'positive' | 'neutral' | 'negative';
  embedding: number[];
}
 
interface SemanticMemory {
  id: string;
  // Distilled fact without temporal context
  fact: string;
  confidence: number;
  // Link back to supporting episodes for auditability
  sourceEpisodeIds: string[];
  lastUpdated: Date;
  embedding: number[];
}
 
async function consolidate(
  episodes: EpisodicMemory[],
  existingFacts: SemanticMemory[]
): Promise<SemanticMemory[]> {
  // Ask the LLM to extract durable facts from recent episodes
  // that aren't already captured in existing semantic memory
  const prompt = `Given these recent interactions:\n${
    episodes.map(e => `[${e.timestamp}] ${e.event}`).join('\n')
  }\n\nAnd these known facts:\n${
    existingFacts.map(f => f.fact).join('\n')
  }\n\nExtract NEW durable facts about this customer.
  Only include facts likely to be true in future interactions.
  Do not include one-time events or transient states.`;
 
  const response = await llm.generate(prompt);
  return parseFactsFromResponse(response, episodes);
}

When to use: Customer-facing agents where you need both personalization (semantic) and accountability (episodic). Most production use cases land here.

Pattern 2: Multi-graph unified store

MAGMA's approach: one memory store, multiple graph views. Every memory has semantic, temporal, causal, and entity edges. Retrieval traverses whichever dimensions the query requires.

typescript

interface UnifiedMemoryNode {
  id: string;
  content: string;
  // Semantic edges: "similar to" relationships
  semanticNeighbors: { nodeId: string; similarity: number }[];
  // Temporal edges: "happened before/after"
  temporalEdges: { nodeId: string; relation: 'before' | 'after' }[];
  // Causal edges: "caused by" / "resulted in"
  causalEdges: { nodeId: string; relation: 'cause' | 'effect' }[];
  // Entity edges: "involves" specific people, products, accounts
  entityEdges: { entityId: string; role: string }[];
}
 
// Query routing decides which graph dimensions to traverse
function routeQuery(query: string): GraphDimension[] {
  if (query.includes('what happened') || query.includes('when'))
    return ['temporal', 'causal'];
  if (query.includes('who') || query.includes('customer'))
    return ['entity', 'semantic'];
  if (query.includes('why') || query.includes('because'))
    return ['causal', 'semantic'];
  // Default: semantic similarity
  return ['semantic'];
}

When to use: Complex reasoning tasks where queries require navigating relationships between events. Think enterprise CRM, legal research, medical history analysis.

Pattern 3: Self-editing memory blocks

Letta's approach, inspired by MemGPT: the agent maintains structured memory blocks inside its context window and directly edits them during conversation. No separate retrieval step. Memory is always present.

typescript

// Memory lives in the system prompt, updated by tool calls
const coreMemory = {
  // Agent edits these blocks directly during conversation
  userProfile: "Name: Sarah Chen. Plan: Enterprise. Preference: email.",
  conversationState: "Third call this month. All about billing.",
  knownIssues: "Recurring billing discrepancies. Last refund: $47.50.",
};
 
// Agent uses tool calls to update memory in real-time
const memoryTools = [
  {
    name: "update_memory",
    description: "Update a memory block with new information",
    // Agent decides what to remember and how to phrase it
    parameters: { block: "string", content: "string" }
  },
  {
    name: "search_archival",
    description: "Search long-term storage for older memories",
    parameters: { query: "string" }
  }
];

When to use: Agents that need to actively reason about their own knowledge state. Personal assistants, tutoring systems, agents that explain their reasoning. The trade-off is smaller total memory capacity (limited by context window) but zero retrieval latency for core facts.

Platform comparison

The memory platform landscape in 2026 has consolidated around three major open-source options plus managed services. Here's how they map to the episodic-semantic decision.

Feature	Mem0	Zep (Graphiti)	Letta	AWS AgentCore
Memory model	Hybrid vector + graph	Temporal knowledge graph	Self-editing blocks + archival	Managed extraction
Episodic support	Events with temporal markers	Full temporal graph with validity windows	Recall memory (conversation history)	Conversation summaries
Semantic support	Extracted facts, composite scoring	Entity facts with temporal evolution	Core memory blocks, agent-edited	User preferences and facts
Consolidation	Automatic extraction + graph linking	Autonomous graph updates with temporal tracking	Agent-driven via tool calls	Automatic extraction
Temporal reasoning	Basic recency scoring	Native (tracks when facts became true/false)	Limited to agent's own edits	Basic
LoCoMo score	66.9% (LLM-as-Judge)	N/A (different benchmark: 94.8% DMR)	N/A (benchmark requested via GitHub issue)	N/A
Latency	91% lower p95 vs full-context	Sub-100ms retrieval	Zero for core memory (in-context)	AWS-managed
Self-hosted	Yes (open source)	Yes (Graphiti is open source)	Yes (open source)	No (AWS only)
Best for	General-purpose agents	Enterprise with evolving relationships	Agents that reason about knowledge	AWS-native stacks

Customer Memory

4 memories recalled

Sarah Chen

Premium

Last call

2 days ago

Prefers

Email follow-up

Session Memory

“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”

85% relevance

Choosing between them: For a deeper build-vs-buy analysis of these platforms, see our memory platform comparison. The short version: if your agent handles customer support or sales where facts about customers matter most, Mem0's composite scoring gives you the best accuracy-latency trade-off. If your agent operates in domains where facts change over time (insurance, healthcare, finance), Zep's temporal knowledge graph tracks when facts were true and when they were superseded. If your agent needs to actively manage what it knows (tutoring, personal assistants), Letta's self-editing model gives the agent direct control.

For teams already on AWS, AgentCore Memory provides a managed path. The March 2026 streaming notifications feature pushes memory changes to Kinesis, enabling downstream workflows without polling. It's available in 15 regions and handles both episodic (conversation summaries) and semantic (extracted preferences and facts) memory automatically.

When each type matters

Not every agent needs both memory types equally. The 2026 research points to clear guidelines based on your use case.

Episodic-heavy use cases need high-fidelity event recall. Compliance and audit requirements. Customer support where "what happened last time" is the most common question. Legal or medical agents where the source and context of information matters as much as the information itself. E-mem's uncompressed episodic architecture specifically targets these, preserving full event detail instead of summarizing it away.

Semantic-heavy use cases need fast personalization at scale. Sales agents that need to know preferences instantly. Recommendation engines that synthesize patterns across hundreds of interactions. Onboarding flows that adapt to what the system already knows. Here, Mem0's composite scoring and 90% token savings matter most. You need the answer, not the audit trail.

Both equally is the most common production scenario. Customer-facing agents that personalize (semantic) while maintaining accountability (episodic). Contact center agents that greet returning customers by name and preference (semantic) but can explain exactly when and why a previous decision was made (episodic). This is where the dual-store architecture from Pattern 1 delivers, and it's what most of the 2026 research optimizes for.

The AdaMem paper gives a practical heuristic: let the query decide. Build both memory types, then route each retrieval to the appropriate store based on what the question is actually asking. A "what happened" question goes to episodic. A "what kind of person" question goes to semantic. A "how are X and Y related" question goes to the graph layer.

Implementation guide

Here's a minimal dual-store implementation that handles both memory types with consolidation. This connects to any vector store and any LLM. The pattern is what matters.

typescript

import { openai } from './clients';
 
// Store an episode after each conversation turn
async function storeEpisode(
  customerId: string,
  event: string,
  metadata: Record<string, unknown>
): Promise<EpisodicMemory> {
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: event,
  });
 
  return db.episodicMemories.insertOne({
    customerId,
    timestamp: new Date(),
    event,
    metadata,
    embedding: embedding.data[0].embedding,
  });
}
 
// Query the right memory type based on the question
async function recall(
  customerId: string,
  query: string
): Promise<{ episodic: EpisodicMemory[]; semantic: SemanticMemory[] }> {
  const queryEmbedding = await embed(query);
 
  // Always retrieve relevant semantic facts -- cheap, fast
  const semanticResults = await db.semanticMemories.vectorSearch({
    filter: { customerId },
    vector: queryEmbedding,
    limit: 5,
    minScore: 0.3,
  });
 
  // Retrieve episodic memories for temporal/causal queries
  // or when semantic results are insufficient
  const needsEpisodic = isTemporalQuery(query)
    || semanticResults.length < 2;
 
  const episodicResults = needsEpisodic
    ? await db.episodicMemories.vectorSearch({
        filter: { customerId },
        vector: queryEmbedding,
        limit: 10,
        minScore: 0.25,
      })
    : [];
 
  return { episodic: episodicResults, semantic: semanticResults };
}
 
// Simple heuristic -- production systems use a classifier
function isTemporalQuery(query: string): boolean {
  const temporalSignals = [
    'when', 'last time', 'previously', 'before',
    'after', 'history', 'what happened', 'timeline',
  ];
  return temporalSignals.some(s =>
    query.toLowerCase().includes(s)
  );
}

The consolidation process runs on a schedule: hourly, daily, or triggered by a threshold of new episodes.

typescript

async function runConsolidation(customerId: string) {
  // Get episodes not yet consolidated
  const recentEpisodes = await db.episodicMemories.find({
    customerId,
    consolidated: { $ne: true },
    timestamp: { $gte: daysAgo(7) },
  });
 
  if (recentEpisodes.length < 3) return; // Not enough signal
 
  const existingFacts = await db.semanticMemories.find({ customerId });
 
  const newFacts = await consolidate(recentEpisodes, existingFacts);
 
  // Store new semantic facts with source references
  for (const fact of newFacts) {
    await db.semanticMemories.insertOne({
      ...fact,
      customerId,
      // Retain provenance -- which episodes support this fact
      sourceEpisodeIds: recentEpisodes.map(e => e.id),
    });
  }
 
  // Mark episodes as consolidated (don't delete them)
  await db.episodicMemories.updateMany(
    { _id: { $in: recentEpisodes.map(e => e._id) } },
    { $set: { consolidated: true } }
  );
}

The key detail: marking episodes as consolidated, not deleting them. You need both the distilled fact ("customer has recurring billing issues") and the source episodes ("specifically, calls on March 5, 9, and 14") for the system to be auditable. If you've worked through building a memory system from scratch, this consolidation layer is what sits on top of the persistent + semantic stores from that tutorial.

The admission control problem

Even with both memory types working, you face a harder question: what deserves to be remembered at all?

The A-MAC paper from the MemAgents workshop frames this as memory admission control. Current systems either store everything (expensive, noisy) or use opaque LLM-driven policies (costly, hard to audit). A-MAC decomposes memory value into five interpretable factors:

Future utility: will this information matter in future interactions?
Factual confidence: is this information reliable?
Semantic novelty: do we already know this?
Temporal recency: is this current?
Content type prior: does this category of information tend to be useful?

This maps directly to the episodic-semantic decision. High future utility + high confidence + low novelty = don't store (we already know it). High future utility + high confidence + high novelty = store as semantic fact. Moderate utility + high temporal relevance = store as episode (might matter in context, not worth promoting to a fact yet).

The practical implementation is a scoring function that runs before every memory write:

typescript

function shouldAdmit(candidate: MemoryCandidate): {
  admit: boolean;
  type: 'episodic' | 'semantic' | 'discard';
} {
  const score = (
    candidate.futureUtility * 0.35 +
    candidate.factualConfidence * 0.25 +
    candidate.semanticNovelty * 0.20 +
    candidate.temporalRecency * 0.10 +
    candidate.contentTypePrior * 0.10
  );
 
  if (score < 0.3) return { admit: false, type: 'discard' };
 
  // High-confidence, durable facts go to semantic store
  if (candidate.factualConfidence > 0.8 && candidate.futureUtility > 0.7)
    return { admit: true, type: 'semantic' };
 
  // Everything else starts as an episode
  // (may consolidate to semantic later)
  return { admit: true, type: 'episodic' };
}

Without admission control, agents accumulate hallucinated facts, obsolete preferences, and conversational noise. With it, memory stays focused on what actually matters for future interactions. This is especially critical for privacy-first memory design. Admission control is your first line of defense against storing information you shouldn't.

What this means for production

The 2026 research converges on a clear message: the episodic-semantic split isn't optional. It's the foundation that every other memory capability builds on. MAGMA's four-graph architecture works because it explicitly separates temporal and semantic views. Mem0's 26% accuracy gain works because it extracts and scores both memory types. AdaMem's query routing works because different questions need different memory stores.

If you're building memory into an agent today, start with the dual-store pattern. Store episodes from every conversation. Run consolidation to extract semantic facts. Route queries to the appropriate store based on what's being asked. That's the architecture that 2026's best-performing systems all share.

The agent from our opening? We added a nightly consolidation job. Three episodes about billing disputes became one semantic fact: "Customer has recurring billing issues, prefers email resolution, escalation risk." The next time they called, the agent knew them before they said a word. The call lasted four minutes.

Two types of memory. Both present. That's what makes an agent feel like it actually knows you.

Give your agents memory that learns

Chanl's memory system stores episodic events and distills semantic facts automatically. Your agents remember what happened and what it means.

Explore Memory

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

memory ai-agents research typescript architecture rag customer-experience

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.