Ten Memory Papers That Changed How I Think About AI Agents

Most agent memory systems start life as a charmingly bad idea:

summarize the conversation, put it in a vector database, retrieve the top-k chunks later.

This works just well enough to become dangerous. It can remember a favorite restaurant. It can also preserve a stale preference, retrieve the wrong episode, turn untrusted web text into durable instruction, or slowly summarize a user’s actual life into beige paste.

I spent some time reading recent papers about memory in AI agents. The good ones are no longer treating memory as storage. They are treating it as an evolving state-management problem: what should be written, what should be updated, what should be forgotten, what can be trusted, and whether memory itself is part of the security boundary.

Here are ten papers that, taken together, sketch the shape of where agent memory is going.

1. Memora: memory is not recall, it is keeping current truth

Paper: From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

The most important paper in the set is a benchmark paper. Annoying, but true. Architecture papers are only as good as the target they optimize for, and most memory benchmarks have been asking the wrong question.

The usual benchmark asks: can the model retrieve a fact that appeared earlier?

Memora asks something closer to the real problem: can an agent maintain a coherent model of a person over weeks or months as facts change?

The paper is blunt about the weakness of existing evaluations. In LoCoMo, the authors observe that 94% of evaluation questions require grounding evidence from no more than two previous sessions. In LongMemEval, they see the same pattern for 85% of questions. That means many supposedly long-term memory evaluations reduce to shallow retrieval.

Real memory is not like that. People change jobs. They correct themselves. They stop liking something. They move. Their constraints shift. A useful assistant has to know not just what was once true, but what is true now.

Memora introduces conversations spanning weekly, monthly, and quarterly timelines, then evaluates three tasks:

Remembering: recall the current memory state.
Reasoning: synthesize across multiple memory elements.
Recommending: act on current preferences rather than obsolete ones.

The paper’s key metric is FAMA, Forgetting-Aware Memory Accuracy. It penalizes relying on outdated memory. That is the crucial move. A system should not get credit for remembering an old preference after the user explicitly changed it.

One of the most useful findings is from their manual error analysis. Recommendation errors were primarily failures to forget: 16 of 25 recommendation errors, or 64%, were caused by outdated memory not being forgotten. Remembering errors were more often partial retrieval failures: 18 of 25, or 72%, involved retrieving only some required memories.

The lesson: long-term memory is less about capacity than mutation. A memory system that never forgets is not loyal. It is just hoarding.

2. MemReader: the memory writer should make decisions, not just JSON

Paper: MemReader: From Passive to Active Extraction for Long-Term Agent Memory

MemReader is the most directly practical paper here. It attacks the standard memory-ingestion pipeline:

conversation chunk → LLM → structured memory JSON

That pipeline treats memory extraction as passive transcription. The paper argues that this is the wrong abstraction. Memory writing should be active memory management.

The authors put it well:

“Existing methods model memory extraction as passive extraction rather than active decision-making. A more reasonable memory module should first judge the value of incoming information, then check whether it is complete or ambiguous, determine whether historical retrieval is required, and finally decide whether to write, buffer, ignore, or update the memory.”

That sentence is the design spec.

MemReader defines a ReAct-style memory manager with actions such as:

add_memory
buffer_memory
search_memory
ignore_memory

This matters because many useful facts are incomplete when first seen. “I hated the place we went last time” is potentially valuable, but only if the system knows what “the place” refers to. A passive extractor might write garbage. An active memory manager can retrieve context, buffer uncertainty, or ignore the utterance.

The paper trains two variants:

Model	Role
MemReader-0.6B	lightweight extractor distilled for schema-consistent structured output
MemReader-4B	active ReAct-style memory manager optimized with GRPO

They evaluate on LoCoMo, LongMemEval, and HaluMem-Medium. The reported pattern is intuitive: the small model is efficient and good at clean structured extraction; the larger active model does better on knowledge updates, temporal reasoning, and hallucination reduction. On HaluMem-Medium, the 4B-GRPO variant reports strong extraction numbers including 96.57% recall, 97.19% weighted recall, and 98.21% F1.

Their conclusion has the line memory systems deserve:

“The core of a long-term memory system is not to extract more information from input, but to build and maintain a low-noise, updatable, and retrievable user-state representation.”

The lesson: the write path is sacred. Reads are opportunistic; writes are permanent. Permanent state deserves suspicion.

3. Anatomy of Agentic Memory: the evaluations are fragile

Paper: Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

This is the sober survey paper. It is valuable because it refuses to treat memory systems as just retrieval quality plus vibes.

The paper organizes agent memory systems into four broad structures:

Lightweight semantic memory
Entity-centric and personalized memory
Episodic and reflective memory
Structured and hierarchical memory

But the more important contribution is the critique of evaluation.

First, many benchmarks are too small for the long-context era. If the entire benchmark fits inside a modern context window, then a model can solve the task by reading the whole transcript. That is not persistent memory. That is a long prompt wearing a trench coat.

The authors evaluate benchmark saturation across dimensions like total token load, interaction depth, and entity diversity. Their point is simple: a benchmark only forces memory if the task structurally requires persistent state, not merely because the word “memory” appears in the title.

Second, lexical metrics are often wrong. The paper compares F1-style scoring with LLM-as-judge semantic scoring and finds a significant mismatch. It names two failure modes:

Paraphrase penalty: a correct abstractive answer gets punished because it does not share enough tokens.
Negation trap: a wrong answer receives high overlap because it repeats many of the same words.

This matters because good memory often abstracts. It consolidates. It merges facts. It does not necessarily preserve the original phrasing.

Third, the paper highlights backbone sensitivity. A memory system is not just a database; it depends on the base model’s ability to follow structured update protocols. Small models can emit malformed JSON, hallucinate keys, or fail to maintain schema discipline. One bad write can become tomorrow’s retrieved truth.

The lesson: do not trust a memory benchmark unless it tests scale, mutation, semantic correctness, and operational cost.

4. ER-MIA: similarity search is an attack surface

Paper: ER-MIA: Black-Box Adversarial Memory Injection Attacks on Long-Term Memory-Augmented Large Language Models

ER-MIA is the paper that makes vector memory feel less cozy.

Most long-term memory systems retrieve memories by embedding similarity. ER-MIA shows that this mechanism can be attacked in a black-box setting. The attacker does not need model weights. They do not need access to the retriever. They inject adversarial memories that are close enough in embedding space to be retrieved later.

The paper studies two settings:

Attack setting	Idea
Content-based attacks	create misleading memories derived from prior interaction content, without knowing future questions
Question-targeted attacks	inject fabricated memories designed to answer specific future questions incorrectly

The authors build an attack arsenal with instruction-based manipulations, factual contradictions, non-semantic perturbations, and ensemble attacks. The ensemble idea is especially unpleasant: inject multiple related adversarial memories so that at least one is likely to be retrieved, and several may reinforce each other.

The key quote:

“Embedding-level similarity alone is sufficient to induce harmful retrieval and downstream reasoning failures in long-term memory–augmented LLMs.”

A particularly nasty result: increasing top-k retrieval can improve clean performance while making attacks more likely, because poisoned memories have more chances to enter the retrieved set.

The lesson: retrieval is not neutral plumbing. It is part of the security boundary.

5. Zombie Agents: prompt injection becomes persistent

Paper: Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections

Classic prompt injection is usually transient. The malicious text has to be in the current context. Reset the session and the infection disappears.

Long-term memory changes that.

Zombie Agents studies a two-phase attack:

Infection: an agent reads attacker-controlled content during a normal task and writes the payload into memory.
Trigger: in a later session, the payload is retrieved from memory and causes unauthorized tool behavior.

The paper’s core observation is simple and grim:

“Memory evolution can convert one-time indirect injection into persistent compromise.”

This breaks many existing defenses. Per-session prompt filtering assumes the malicious instruction arrives from the outside during the current turn. But once it has been written into memory, it may appear to come from the agent’s own trusted state.

The authors recommend treating memory as part of the trusted computing base. At minimum, systems should:

separate untrusted data from executable instructions
attach provenance to memory entries
apply policy checks to tool calls influenced by retrieved memory
audit memory writes, not just model outputs

The lesson: a memory system can launder untrusted text into trusted instruction. That is much worse than forgetting someone’s coffee order.

6. MemCollab: memories carry the bias of the agent that wrote them

Paper: MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation

MemCollab notices something subtle: memories are not neutral facts. In this paper’s setting, “memory” means distilled solution strategies for mathematical reasoning and code generation rather than episodic facts about a person. The lesson generalizes: memories often encode the reasoning style, preferences, and bad habits of the model that produced them.

The paper studies whether multiple agents can share a memory system. Naively, you might expect this to work. A stronger model solves a task, stores useful memories, and a weaker model benefits.

Sometimes. But direct transfer can hurt. The paper reports that using memory distilled solely from a 32B model degrades a 7B agent on some tasks: MATH500 drops to 50.6% vs. 52.2% baseline, and HumanEval drops to 34.1% vs. 42.7% baseline.

Their explanation is the important bit:

“The memory tends to reflect the originating model’s preferences and reasoning style … rather than objective and transferrable knowledge.”

MemCollab’s answer is contrastive trajectory distillation. Instead of copying one model’s memories, compare trajectories from multiple agents solving the same task. Extract what is invariant across them: reusable reasoning constraints, task-level structure, and common failure modes.

The results are promising. In their same-family Qwen experiments, a smaller model improves from 52.2% to 67.0% on MATH500 and from 47.9% to 57.6% on MBPP. Larger models benefit too.

The lesson: shared memory should be distilled, not copied. The provenance of a memory includes the mind that made it.

7. LMEB: memory retrieval is not ordinary passage retrieval

Paper: LMEB: Long-horizon Memory Embedding Benchmark

LMEB is about embeddings, which sounds less dramatic than zombie agents, but it hits one of the load-bearing assumptions in practical memory systems.

Many systems pick an embedding model because it scores well on ordinary retrieval benchmarks. LMEB argues that memory retrieval is different. Memories can be fragmented, temporally distant, context-dependent, procedural, conversational, or meaningful only in light of later updates.

LMEB includes:

22 datasets
193 zero-shot retrieval tasks
four memory types: episodic, dialogue, semantic, and procedural

The findings are useful:

The benchmark is difficult. The top model reaches about 61.41 N@10 — a normalized top-10 retrieval accuracy metric.
Larger embedding models do not always perform better.
LMEB and MTEB are nearly orthogonal, with Pearson and Spearman correlations close to zero.

That third point matters most. A model that is excellent at passage retrieval may not be excellent at memory retrieval.

The lesson: the embedding model is part of the agent’s hippocampus. Choose and test it like it matters.

8. ALMA: let agents discover memory designs

Paper: Learning to Continually Learn via Meta-learning Agentic Memory Designs

ALMA is the most future-facing paper in the set.

Most memory systems are hand-designed. Someone invents a memory schema, update rule, retrieval method, and reflection policy. ALMA asks: why not have a meta-agent search over memory designs itself?

The framework works roughly like this:

sample previous memory designs
        ↓
reflect on evaluation logs
        ↓
generate a new memory-design idea
        ↓
implement it as code
        ↓
validate it in a sandbox
        ↓
evaluate it on sequential tasks
        ↓
store the design and logs in an archive

They evaluate across sequential decision-making environments including ALFWorld, TextWorld, Baba Is AI, and MiniHack. These are useful because agents need to reuse experience rather than rely on pre-trained knowledge alone.

The paper reports that ALMA discovers memory designs that outperform human-designed baselines, are more cost-efficient, scale better with memory size, and adapt faster under distribution shift. In one transfer setting, learned designs improve over the no-memory baseline by 12.8% when moving to a stronger foundation model.

The interesting part is not just performance. It is that the meta-agent discovers domain-specific mechanisms such as property validation, spatial object normalization, and strategy switching — the kind of machinery you need when an agent must track objects, locations, and reusable tactics across sequential tasks.

The lesson: the best memory architecture may not be designed by hand. It may be found by search, with all the power and danger that implies.

9. MemFactory: memory research needs infrastructure

Paper: MemFactory: Unified Inference & Training Framework for Agent Memory

MemFactory is less conceptually wild than ALMA, but more likely to help researchers stop reimplementing the same machinery badly.

It decomposes memory systems into modular pieces:

extractors
updaters
retrievers
agents
environments
trainers

The paper emphasizes reinforcement learning for memory operations: when to extract, what to update, what to retrieve, and how to optimize those decisions using delayed feedback.

MemFactory includes baselines inspired by Memory-R1, MemAgent, and RMM, and supports GRPO training. In their experiments, they report a 14.8% relative average score increase for Qwen3-1.7B and a 7.3% increase for Qwen3-4B-Instruct, with training/evaluation runnable on a single A800 80GB GPU.

The exact numbers matter less than the direction. Memory-RL needs a harness. Without shared infrastructure, every paper becomes its own island of prompts, scripts, hidden assumptions, and evaluation quirks.

The lesson: memory will become trainable infrastructure, not a hand-written callback.

10. SSGM: evolving memory needs governance

Paper: Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory Framework

SSGM is more conceptual than empirical, but it gives useful vocabulary for the system you would actually want to deploy.

The paper argues that as memory systems evolve, they fail along four dimensions:

Stability: memories drift through summarization, abstraction, and repeated rewriting.
Validity: stale or contradicted facts remain active.
Efficiency: memory maintenance and retrieval become expensive.
Safety: private or malicious information becomes persistent state.

The proposed Stability and Safety-Governed Memory framework decouples memory evolution from memory governance. Before a memory is consolidated, the system should apply checks such as:

consistency verification
temporal decay modeling
dynamic access control
ground-truth anchoring
safety review

The paper also names the right tradeoffs:

Tradeoff	Meaning
Latency vs. safety	safer memory writes cost time
Stability vs. plasticity	too stable means stale; too plastic means drift
Graph scalability	richer memory structures help reasoning but increase complexity and leakage risk

The lesson: memory should have an immune system. Otherwise it will absorb whatever the world coughs into it.

The synthesis: memory is a governed state machine

Put these papers together and a modern agent memory system looks less like a vector database and more like this:

raw event / conversation / tool output
        ↓
provenance and trust classification
        ↓
active memory writer
        ├── ignore
        ├── buffer
        ├── retrieve context
        ├── write
        ├── update
        └── invalidate
        ↓
typed memory store
        ├── semantic facts
        ├── episodic events
        ├── user preferences
        ├── procedural lessons
        └── quarantined untrusted material
        ↓
governance layer
        ├── provenance
        ├── conflict detection
        ├── temporal validity
        ├── decay
        ├── access control
        └── safety checks
        ↓
retrieval layer
        ├── memory-specific embeddings
        ├── time-aware retrieval
        ├── current-truth filtering
        └── adversarial robustness
        ↓
response or action

Provenance appears twice on purpose: it should be stamped at write time, then checked again when memory is retrieved and used.

A few principles fall out:

1. Memory writes should be harder than memory reads

Retrieving an irrelevant memory is annoying. Writing a bad memory is durable damage. The write path needs more judgment, provenance, and validation than the read path.

2. Forgetting is a feature, not a cleanup task

The system needs first-class invalidation. Old truths should not merely sink lower in the vector store. They should be marked stale, superseded, contradicted, or deleted.

3. Memory retrieval needs its own evaluation

Do not assume a good passage embedding model is a good memory embedding model. Memory retrieval is temporal, fragmented, and contextual.

4. Memory is part of the security boundary

A malicious prompt in the current context is bad. A malicious prompt written into memory is worse. Persistent memory can turn one-time prompt injection into durable compromise.

5. Shared memory is not automatically shared knowledge

A memory produced by one model may encode that model’s reasoning style. Cross-agent memory needs distillation, not blind copying.

6. The best memory policies may be learned

Hand-written rules are a starting point. But as agents operate across domains, memory design itself becomes a search and optimization problem.

Why this matters

The obvious use case is personal assistants. If an AI assistant is going to persist across months or years, memory becomes the product. Not the model. Not the chat box. The memory.

But personal memory is also where the danger concentrates. It contains preferences, history, secrets, habits, corrections, and private context. It can make an assistant useful. It can also make an assistant vulnerable, stale, creepy, or confidently wrong about who you are.

The lesson from these papers is not “add memory.” That is the easy part.

The lesson is:

Build memory like you are building a small database, a belief tracker, a security boundary, and a cognitive organ at the same time.

Because, unfortunately, you are.