Skip to content
10 min read

Agent Memory Is an Architecture Problem, Not a Context-Size Problem

When an AI agent fails to recall what it learned last session, the instinct is to add more context. That instinct is wrong. Working memory, episodic memory, and semantic memory are three structurally different problems; designing all three is what separates agents that compound value across sessions from agents that start from scratch each time.

Antonio J. del Águila

Knaisoma

When an AI agent fails on a task it completed successfully last session, the standard diagnosis is that the model forgot. The standard fix is a bigger context window. Both the diagnosis and the fix are usually wrong.

The model did not forget. It was never given a design that lets it remember. The context window it ran in was cleared at the end of the previous session, like a whiteboard wiped between meetings, because nothing was written anywhere deliberately. The architectural decisions — which information deserves persistence, in what form, with what retrieval mechanism, and with what retention policy — were simply never made.

Buying a larger context window as a response is treating the symptom. A 1-million-token window does not solve the problem of what to put in it, when to write information back out, and how to retrieve it reliably across sessions. Across the teams we work with that are moving agents from pilot to production, the failure mode that accounts for the most remediation effort is not a capability gap. It is a missing memory architecture.

The context window is working memory, not storage

A CPU has registers for fast, volatile computation. An AI agent’s context window is its equivalent: it holds the current task state, recent observations, the conversation history, and whatever grounding the agent needs right now. When the task ends, it is cleared. Nothing persists automatically. Everything that matters for future sessions has to be written somewhere deliberately, with an explicit read path back in.

That constraint has a straightforward engineering implication. Every durable piece of information needs an explicit home outside the context window, with a defined write policy and a defined retrieval strategy. The context window is where computation happens, not where knowledge lives.

Anthropic’s 2026 Agentic Coding Trends Report makes this concrete: agents now complete an average of 20 autonomous actions before requiring human input, a figure that doubled in six months. More pointedly, projects with well-maintained context files — explicit structured descriptions of what the agent is permitted to know and do — see 40% fewer agent errors and 55% faster task completion than projects without them. The implication is not that larger contexts help. It is that curated contexts help. Curation is an architectural discipline, not a window-size decision.

Three anti-patterns that follow from treating context as storage

When teams skip the architecture, they land in one of three places.

Context stuffing is the most common. The instinct is to load everything the agent might need into the system prompt: the full codebase, all past conversations, every policy document, every previous session summary. Context stuffing hits three failure modes quickly. First, the cost and latency profile becomes punishing at production volume. Second, accuracy degrades as the relevant content drifts toward the middle of a very long context: research into long-context model behavior consistently finds that models attend disproportionately to content near the window boundaries. Third, and most practically damaging, the context becomes stale and unmanageable. When everything is in the prompt, nothing is owned, nothing is versioned, and nothing is pruned.

Stateless agents go the other direction. Each session starts from scratch with no knowledge of previous outcomes. Stateless agents look clean in demos and brittle in production. A coding agent that cannot remember that the test suite uses a non-standard path, that a dependency must be pinned because of a known incompatibility, or that a specific API call requires a workaround documented two sessions ago will produce those failures again. Each repeat failure is a human review cycle that a design decision could have prevented.

Indiscriminate persistence is the failure mode that arrives after teams recognize the stateless problem and overcorrect. Everything gets written to a vector store: session summaries, intermediate reasoning traces, tool-call responses, partial drafts, discarded directions, superseded instructions. Within weeks the store is large, retrieval is slow, and the retrieved content is an indiscriminate mixture of confirmed outcomes and abandoned approaches. The agent becomes confused by its own history.

A three-tier model for agent memory

The problem with treating memory as a single system is that three structurally different problems are being conflated. Each tier has different retention requirements, different retrieval mechanisms, and different cost profiles.

TierWhat it holdsWhere it livesRetrievalRetention
Working memoryCurrent task state, active scratchpad, live conversation, in-progress planContext windowImmediate: already presentEphemeral: cleared at session end
Episodic memorySession outcomes, confirmed decisions, named failures, handoff-relevant observationsStructured store: task log, CLAUDE.md, lightweight DBExplicit load at session start or task handoffMedium-term: sessions to weeks; prune on supersession
Semantic memoryStable facts: architecture patterns, naming conventions, service topology, coding standards, approved dependenciesDocument store or RAG indexVector search or exact lookup, injected on demandLong-lived: indefinite, updated by deliberate revision

The tiers are not alternatives. A production agent uses all three in combination: working memory is where the agent thinks, episodic memory is where it records what happened, and semantic memory is where stable knowledge lives so it does not have to be re-explained on every session.

The architecture question is not “should we add memory” but “which tier does this piece of information belong in, and what is the write-and-read policy for that tier.”

72.9%

Full-context baseline accuracy, LoCoMo benchmark

Mem0, 2026

91.6%

Two-layer memory architecture accuracy, LoCoMo

Mem0, 2026

Fewer tokens per query with selective memory vs. context stuffing

Mem0, 2026

The LoCoMo benchmark, which tests multi-session conversational recall across 1,540 questions, makes the trade-off visible in production terms. Loading everything into context achieves 72.9% accuracy while consuming roughly 26,000 tokens per query at a p95 latency of 17.12 seconds. A two-layer memory architecture that routes stable knowledge out of the context window achieves 91.6% accuracy using roughly 6,956 tokens per query at a p95 latency of 1.44 seconds: 18.7 percentage points more accurate, 4x fewer tokens, and 91% lower latency. That gap does not come from a stronger model. It comes from a better design.

The assignment decision: what belongs where

The practical question is how to classify information at write time. Most teams discover too late that the hard problem is not retrieval but placement. Once the wrong data is in the wrong tier, retrieval will not rescue the downstream accuracy.

A working rule for each tier:

Working memory holds the current task context: the specific goal for this session, the active file being edited, the partial plan in progress, the live tool-call responses. It is volatile by design, and that is correct. The cost of pushing ephemeral task state into persistent storage is retrieval noise in future sessions, where the agent has to filter out last Thursday’s in-progress plan from the confirmed outcomes it actually needs.

Episodic memory holds outcomes, decisions, and named failures: “the authentication service test suite uses a custom mock path at /test/mocks/auth,” “the deployment pipeline expects a RELEASE_TAG environment variable,” “this migration was attempted and rolled back on 2026-05-20 because of a foreign-key constraint in the user_preferences table.” Episodic memory is what a human engineer would write in a handoff note before going on leave. It is session-to-multi-week relevant, and it should be pruned when it becomes superseded rather than allowed to accumulate indefinitely.

Semantic memory holds stable, reusable knowledge: the overall architecture of the system, the naming conventions, the approved dependency list, the service topology, the testing patterns. This is the knowledge that does not change between sessions and does not become more accurate by repeating it. It lives in a maintained document or index. In a Claude Code workflow, it lives in a well-maintained CLAUDE.md and the project documentation the agent is pointed at. In a multi-agent enterprise deployment, it lives in a shared knowledge base with versioning and an explicit update policy.

The placement test is temporal: if the information changes or expires within a session, it is working memory. If it is session-specific but relevant for future handoffs, it is episodic. If it is stable and reusable across all sessions in a project or organization, it is semantic.

Operational constraints that determine what is actually feasible

Memory architecture does not exist in isolation from compliance, cost, and latency requirements.

Compliance and data residency are the hardest constraint in regulated industries. Persisting session content, conversation history, and reasoning traces across sessions means that content is stored somewhere and is subject to the data classification and retention policies that apply to its most sensitive element. Teams building agents in healthcare, finance, or public sector contexts regularly discover that the episodic memory they planned to build runs directly into a retention policy that prohibits keeping conversation history beyond the session boundary. The architecture has to accommodate the policy. Treating compliance as a refinement to add post-deployment is how teams end up rebuilding their persistence layer under deadline pressure.

Retrieval latency becomes real quickly at production volume. A semantic memory retrieval that adds 800 milliseconds to every agent step is manageable in a developer demo and unacceptable in an automated pipeline chaining twenty steps. The latency budget per step is finite, and aggressive retrieval at every decision point compounds across long workflows. The right answer is not to retrieve less but to be explicit about which steps require retrieval and which can run on what is already in context.

Multi-agent write conflicts are the failure mode that surprises teams most in late-stage production deployments. In a multi-agent pipeline where several agents write to the same episodic or semantic store, you get the same class of problems as a shared mutable database without transactions: stale reads, concurrent writes that produce inconsistent state, and one agent’s corrections overwriting another agent’s current working assumption. Read-your-writes guarantees and explicit write ownership per tier are not optional once more than one agent writes to shared memory.

The production architecture that holds up under these constraints has three properties: writes are deliberate and schema-qualified, not append-everything; retrieval is tier-specific and filtered for relevance at the tier level; and each tier has an explicit owner and a defined pruning policy. Teams that get this right designed the memory architecture before deploying the agent. Teams that did not are retrofitting it after the first production incident, which is the more expensive way to arrive at the same destination.

Context windows will keep getting larger. That trend does not eliminate the need for a memory architecture — it changes the trade-offs at the working memory tier while leaving episodic and semantic tiers just as necessary. Anthropic’s data puts the current situation in useful relief: developers use AI in roughly 60% of their work, but report being able to fully delegate only 0 to 20% of tasks. The ceiling on that delegation is not model capability; the models are capable enough. It is design. Agents that compound value across sessions rather than resetting on every run are the ones whose working, episodic, and semantic memory tiers were designed explicitly, not discovered through production failures.

If your agents hold up well in single-session workflows and become unreliable across multi-session tasks, we have helped teams work through the memory architecture decision and are glad to think through yours.

AI Architecture Platform Engineering
Share:

Stay updated

Get insights on engineering transformation delivered to your inbox.

Newsletter coming soon.