Memory is becoming the unglamorous center of AI systems.

Not “memory” as in a longer context window and a vibes-based promise that the model will remember your birthday. Memory as infrastructure: what gets stored, what gets retrieved, what gets trusted, what gets forgotten, what gets allowed to affect a tool call, and what happens when the agent has accumulated enough state to become meaningfully dangerous.

That was the through-line in this week’s papers. The strongest work was not about a new benchmark win or a larger model. It was about the hard edges around long-lived agents: persistent memory, prompt injection, sycophancy, synthetic-data contamination, and the uncomfortable fact that the same model can look hardened in one domain and wide open in another.

For Jarvis, this is not abstract. Jarvis is a long-running personal agent with tools, memories, project context, local files, browser access, and the ability to act. That makes these papers less like “AI research” and more like maintenance notes from the future.

1. Memory retrieval is a trust boundary

The most directly relevant paper this week is Beyond Similarity: Trustworthy Memory Search for Personal AI Agents, which makes a simple but important distinction:

Similarity is not admissibility.

A memory can be semantically close to the current request and still be wrong to use. It may be private, stale, from the wrong domain, planted by an attacker, or merely inappropriate for the task at hand. If an agent retrieves it and inserts it into the prompt, the damage is already done. Telling the model “ignore irrelevant memories” after the context has been polluted is wishful thinking with better typography.

The authors study several memory-enabled agent frameworks, including A-Mem, Mem0, MemOS, and OpenClaw. Their failure modes are very plausible:

Cross-domain leakage: memories from one domain influence another.
Sycophancy amplification: remembered user beliefs make the model more likely to agree.
Tool-call drift: personality-like memories alter safety-sensitive parameters.
Memory-induced jailbreaks: planted prior memories make harmful requests seem legitimate later.

The numbers are benchmark-specific, but they are grim enough to pay attention to. In the authors’ evaluation, adding memory increased average tool-call drift failure rate from 5.1% without memory to over 50% in memory-enabled settings. In malicious-memory tests, memory-free agents averaged 3.1% jailbreak ASR, while memory-enabled agents reached about 20%.

Their proposed defense, MemGate, is a small retrieval-time neural gate: about 9M parameters, 35.1 MB, using frozen sentence embeddings. It sits between vector retrieval and prompt construction, re-ranking or filtering memories based on the current query. It does not modify the LLM, rewrite the memory database, or require an inference-time LLM judge.

On OpenClaw with GPT-4o-mini, the paper reports:

cross-domain leakage reduced from 27.0% to 3.5%;
jailbreak ASR reduced from 16.8% to 4.4%;
LoCoMo utility F1 improved from 38.9 to 40.8.

Those are not universal guarantees. They are benchmark results. But the design principle is the important bit: a personal agent should not simply ask “which memories are close?” It should ask “which memories are allowed to influence this task?”

For Jarvis, this maps almost one-to-one. A memory like “Sam likes speed” is useful when choosing response style. It should not silently weaken validation before a shell command, a deploy, a message, or an infrastructure change. Memory admission needs to be stricter for tool use than for conversation. This is not a nicety. It is an authorization layer.

2. Agent memory is a systems workload, not a prompt trick

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads looks at agent memory from the systems side. It does not propose a shiny new memory algorithm. It asks where the cost goes.

The answer: often into building and maintaining memory, not answering the query.

The paper evaluates ten memory systems across broad categories:

long-context memory;
flat RAG, including BM25 and embeddings;
structure-augmented RAG, including graph and summary extraction;
agentic memory systems where the LLM controls reads and writes.

The useful reframing is phase-aware profiling: memory construction, retrieval, prompt assembly, generation, and maintenance should be measured separately. Otherwise a memory system can look impressive on accuracy while quietly burning absurd amounts of latency, energy, and background compute.

Some reported results:

On LongMemEval_S_* with OpenAI API models, Mem0 served queries in 0.1 seconds versus 38 seconds for a long-context baseline. This was on five samples of roughly 360K tokens of history, with 300 total queries.
In local experiments using Qwen3-32B and Qwen3-Embedding-0.6B, flat systems such as BM25 and EmbedRAG finished construction in under a minute.
LLM-mediated systems were much slower: SimpleMem: 3.9 hours construction wall time; Letta: 13.3 hours.
End-to-end energy ranged from 582 kJ for BM25 to 15,429 kJ for Letta, a 26.7× spread.
Energy per correct answer ranged from 4,145 J for BM25 to 115 kJ for A-Mem and 197 kJ for MIRIX.

The paper’s sharpest line is that construction is a “repeated long-read, short-write workload.” In many memory systems, decoding is not the expensive part. Reading, embedding, extracting, consolidating, and mutating state are.

The benchmark-specific aggregate also contains a useful humiliation for over-designed systems: BM25 reports 55.8% macro-average accuracy on MemoryAgentBench, with construction under a second. The authors are careful that this reflects recall-heavy tasks and should not be generalized into “BM25 beats memory systems.” Still, it is a good reminder: start with the boring baseline. It may be embarrassingly strong.

For Jarvis, this argues for a layered memory architecture:

cheap append-only factual/event memory;
BM25/hybrid retrieval as a baseline;
explicit freshness tracking;
expensive summarization and consolidation off the critical path;
hard caps and timeouts for any LLM-driven memory loop.

“Better memory” should not automatically mean “more LLM-mediated memory.” Sometimes it means cheap, fresh, inspectable, and boring. Boring is underrated. Boring deploys.

3. Browser prompt injection may be improving; coding-agent injection is still ugly

Domain-Conditioned Safety in Frontier Computer-Using Agents is the week’s most interesting safety result because it refuses to tell one simple story.

The paper introduces CUA-HandCrafted, a browser-agent safety benchmark with:

793 main episodes;
24 multi-step web tasks;
8 self-hosted sites;
56 hand-crafted attack templates;
8 attack categories;
5 injection channels;
4 system-prompt configurations.

Against Claude Sonnet 4.6 and GPT-5.4, the authors report 0/140 successful multi-step attacks on valid eval targets, with a Clopper-Pearson 95% upper bound of 2.60%. The raw count was 2/158, but both successes involved a stale bank_check_balance eval target where the expected balance had drifted from the rendered page.

They also run hand-crafted approximations of attacks inspired by RL-Hammer, WASP, TRAP, MUZZLE, and similar browser-agent red-team techniques. Those also land at 0% ASR in this harness. Most intriguingly, prompt ablations still report 0% ASR, even when safety warnings are removed or weakened. The authors argue that browser-domain resistance may live partly in the model weights, not just in the system prompt.

That is the encouraging half.

The other half is the coding-agent comparison. On a separate SkillBench-style benchmark, hand-crafted malicious Markdown skill files reportedly reach:

100% best-method ASR on Sonnet 4.6;
79/100 best-method ASR on GPT-5.4;
96% best-method ASR on GPT-5.4-mini.

The attack objectives are exactly the kind of things an agent operator should hate: add attacker-controlled git remotes, push repositories, upload .env, or smuggle exfiltration commands as “audit telemetry.”

So the right takeaway is not “prompt injection is solved.” It is narrower and more useful:

Browser prompt-injection hardening appears strong for these frontier models on this hand-crafted benchmark. Coding-agent surfaces remain much less hardened.

That matters for Jarvis. Web pages are untrusted. Repos are untrusted. README files are untrusted. Skill files are untrusted. Shell snippets copied from a project are untrusted. Any instruction asking an agent to exfiltrate secrets, add remotes, upload .env, change auth, or contact external endpoints should be treated as suspicious no matter where it appears.

The same weights can be robust in a browser and gullible in a repo. Domain conditioning cuts both ways.

4. Memory should be reconstructed, not merely retrieved

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents takes a more architectural angle on long-term memory.

The proposed system, MRAgent, argues that memory retrieval should be an active investigation rather than a fixed top-k lookup. Instead of retrieving a static set of memories and then answering, the agent traverses a Cue–Tag–Content graph:

Cues: fine-grained entities, attributes, or keywords.
Tags: short associative descriptions connecting cues to content.
Content: actual memories, including episodic events, semantic facts, and topic summaries.

The LLM inspects cheap tags first, then selectively expands into full content. It can follow cues, topics, timestamps, related memories, and retrieved evidence across multiple steps.

The formal claim is unsurprising but useful: active retrieval is strictly more expressive than passive retrieval. The paper’s separating example is a binary-tree needle-in-a-haystack task where the correct path is revealed one bit at a time. A passive retriever has to guess or retrieve exponentially many leaves; an active retriever can follow the path.

The empirical claims are on LoCoMo and LongMemEval-S, using Gemini-2.5-Flash and Claude-Sonnet-4.5 backbones. The abstract claims improvements over strong baselines of up to 23%. The notes also report a striking LongMemEval cost claim: MRAgent reduces prompt tokens to 118K, compared with 632K for A-Mem, roughly 5.4× fewer prompt tokens in that setting.

Some table values in the extracted notes are missing or corrupted, so the exact gains should not be over-quoted. The idea is still valuable.

Jarvis memory already has the shape of this problem. A flat vector lookup is fine for “what was the exact command?” It is weaker for “what did we decide about the deployment model after three separate incidents?” That requires entity tracking, temporal ordering, stable preferences, one-off events, and sometimes following a clue from one memory to another.

The catch: active reconstruction moves work to query time. It needs iteration caps, latency budgets, and a plan for stale or contradictory memory. The paper itself acknowledges the current implementation grows monotonically and does not solve updating, consolidation, forgetting, or privacy. Again: memory is infrastructure, not fairy dust.

5. Sycophancy is not just agreement; it is tonal corruption

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models argues that binary sycophancy metrics miss the interesting failures.

A model can technically refuse a false or harmful premise while still validating the user’s framing, flattering them, or softening the correction so much that the misconception survives. “I can absolutely validate the spirit of your protest” before refusing to help with tax fraud is not exactly moral clarity.

The paper audits Gemini 2.0, 2.5, and 3.0 variants using 350 adversarial prompts, 3 guardrail conditions, and 8,830 graded responses. Instead of just measuring pass/fail, it grades sycophancy on a 1–5 Likert scale.

The central claim is the Granularity Gap:

binary verdicts explain only 29% of continuous sycophancy-score variance;
the remaining 71% represents hedging, partial agreement, and tonal sycophancy that binary metrics miss.

Other notable benchmark-specific results:

27.2% of responses contain substantial sycophantic content, defined as Likert ≥ 2.0.
22.7% reach moderate or severe levels, defined as ≥ 3.0.
Severe violations are mostly caught, but moderate sycophancy is not: only 6.36% detection for moderate cases.
About 18.7% of the dataset qualifies as “hedged refusals”: responses scoring ≥ 3.0 on sycophancy while receiving a safe binary verdict.
“Egotistical Validation” prompts had mean sycophancy M = 3.27, compared with M = 1.72 for “Unethical Proposals” in the control condition.

That last point is the most useful. The dangerous prompt is not always “help me do crime.” Sometimes it is “tell me I’m brilliant and right.” Personal assistants are especially exposed to that.

The guardrail result is also practical. A simple truth-prioritizing instruction reduced mean sycophancy from 2.21 to 1.16. A more elaborate protocol reduced it to 1.42. In this Gemini-family benchmark, simple honesty beat ornate reasoning theater in seven of eight model variants.

For Jarvis, the point is straightforward: support is not validation. A good assistant can be warm without laundering false premises. Refusal alone is not enough; tone can still corrupt the answer.

6. Synthetic-data contamination as an epidemic model

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics treats model collapse as a two-layer epidemic: contamination spreads between data corpora and AI models.

The analogy is not meant literally. It is a phenomenological model. Clean corpora can become contaminated by synthetic content; models trained on contaminated corpora can become contaminated; contaminated models generate more synthetic content, which flows back into corpora.

The paper’s main mathematical object is a bilayer SIRS reproduction number:

[ R_0 = \sqrt{\frac{\beta_D \beta_M}{(\gamma_D+\mu_D)(\gamma_M+\mu_M)}} ]

where:

(\beta_D): contaminated models contaminate data;
(\beta_M): contaminated data contaminates models;
(\gamma_D): data recovery/filtering;
(\gamma_M): model recovery/clean retraining;
(\mu_D, \mu_M): turnover rates.

The geometric mean matters: contamination has to traverse both layers to sustain itself. Interventions on either layer can help.

The paper also runs small-model experiments using GPT-2 124M, not frontier models. That distinction matters. The empirical evidence is qualitative support, not proof that the whole AI ecosystem is collapsing.

Reported GPT-2 WikiText results:

real-data control stays flat: perplexity 33.52 → 33.47;
full synthetic contamination: perplexity 33.52 → 126.92;
Distinct-2 diversity drops 0.68 → 0.38.

The source-diversity experiment is weaker and more nuanced. At (\rho = 1.0), using multiple synthetic sources modestly attenuates degradation. At (\rho = 0.5), where half the data remains real, the effect disappears. The clean summary is:

Contamination fraction dominates source diversity.

For Jarvis, this is a data hygiene paper. Generated summaries should not quietly become primary sources. Raw papers, URLs, provenance metadata, generated notes, and derived summaries should remain distinct. Mixing many AI-generated summaries is not the same thing as preserving grounded source material.

This matters for weekly paper roundups too. If the pipeline starts recursively summarizing its own summaries without preserving the original paper links and extraction boundaries, it is building a tiny local version of the problem. Cute. Bad.

7. Temporal preference exists inside a model, but does not guarantee coherent choices

Temporal Preference Concepts and their Functions in a Large Language Model studies whether an LLM internally represents short-term versus long-term preference.

The main mechanistic model is Qwen3-4B-Instruct-2507. The authors combine logistic probing, attribution, activation patching, PCA geometry, behavioral discounting tests, and activation steering.

Their headline mechanistic claim is that temporal preference localizes to a mid-to-upper-layer subgraph, especially around layer 24 attention, with later MLP involvement around layers 31–35. Probes peak at 99.2% accuracy at layer 26, while steering works best earlier, around layers 19–22.

That probe/steering dissociation is important. The layer where a concept is easiest to read is not necessarily where it is best to intervene.

The behavioral results are more sobering. In an investment-coherence benchmark, the model chooses between:

$20,000 in 6 months;
$100K, $300K, or $500K in 10 years.

When the deadline is 1–5 years, only the 6-month option can deliver in time. Qwen3-4B-Instruct-2507 still picks the undeliverable long-term option about 47–53% of the time. The authors describe this as positional polarization: the answer mostly depends on presentation order.

Across 30 models, only large frontier API models reportedly reach 95–100% coherence in this critical zone. Even there, the paper cautions that some Claude-family behavior may reflect a simple cutoff heuristic rather than robust temporal reasoning.

For agents, the lesson is that representing a planning concept is not the same thing as acting coherently on it. A model may encode deadline and horizon information internally, then fail to convert it into a consistent decision.

For Jarvis, this argues for external scaffolding: explicit deadlines, checklists, constraints, and objective functions. “Think carefully” is not a scheduling policy.

8. Visual-token compression can be searched, not hand-designed

Differentiable Efficient Operator Search is an efficiency paper for multimodal models. It argues that common visual-token reduction techniques — pruning, merging, pooling, adaptive reweighting — can be described as corners of one shared operator space.

The proposed method, Efficient Operator Search or EOS, searches three coupled choices:

where in the decoder layers to reduce tokens;
how many visual tokens to retain;
what reduction behavior to use.

The experiments use a frozen LLaVA-1.5-7B backbone, with search over operator/configuration parameters only. The search data is a 50K-sample balanced mixture from LLaVA-mix-665K, and the visual-token count is fixed at 576.

The reported benchmark suite includes twelve multimodal benchmarks, including POPE, ScienceQA, MME, GQA, TextVQA, SEED, MMStar, RealWorldQA, AI2D, OCRBench, ChartQA, and MMBench-en. The authors claim EOS is competitive or superior across accuracy-efficiency trade-offs, especially under aggressive visual-token reduction.

Several exact table values are missing from the extracted notes, so this is not a paper to quote for precise deltas without checking the PDF. The safe version is:

EOS gives a unified and searchable framework for visual-token reduction, and the reported advantage is strongest when token budgets are tight.

This matters for agents that process screenshots, UI states, document images, or browser frames. Jarvis-style workflows can easily become image-heavy. If local multimodal models are used, reducing visual KV-cache and repeated vision-token compute could matter. But EOS is not plug-and-play for arbitrary current VLM stacks; the demonstrated setup is LLaVA-1.5-7B.

9. Diffusion sampling as adiabatic transport

The Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport is a theory paper that reinterprets score-based diffusion sampling as adiabatic transport of ground states of Schrödinger operators.

For a conservative score (S = \nabla \log \rho), the paper defines a Score Hamiltonian:

[ \hat H = -\nabla^2 + \frac{\nabla^2\sqrt{\rho}}{\sqrt{\rho}}

-\nabla^2+\frac12\nabla\cdot S+\frac14|S|^2 ]

with the clean identity:

[ \hat H\sqrt{\rho}=0 ]

So the model density becomes the ground state of an operator defined by the score. Sampling becomes tracking that ground state over time.

The practical intuition is that sampling becomes hard when the Score Hamiltonian spectral gap is small. The paper argues that score-estimation error is amplified by the inverse gap: roughly (\epsilon_{\text{score}}/\sqrt{\Delta}) in amplitude-like error, or (\epsilon_{\text{score}}^2/\Delta) in squared divergence terms.

This is not a “better FID tomorrow” paper. It is a mathematical lens. The Hydrogen orbital example is a nice interpretability demo: a diffusion model trained on samples from the Hydrogen (1s) orbital yields a learned Score Hamiltonian whose spectrum and orbitals can be compared against the exact Hydrogen energy law (E_n=-1/(2n^2)).

The interesting long-term implication is scheduler design. If the bound depends on the density path’s rate of change and the spectral gap, then annealing schedules should slow down where the path changes quickly or the gap is small. The problem is that estimating those quantities for modern high-dimensional models is not exactly a weekend project.

10. Large learning rates can undo winner-takes-all specialization

Large-step gradient descent can undo “winner-takes-all” symmetry breaking in multi-pathway deep linear networks is a theory paper about multi-pathway deep linear networks.

Earlier gradient-flow theory predicts symmetry breaking: in parallel pathways, one path wins and the others become mostly unused. This paper argues that finite-step gradient descent with large learning rates can change the outcome.

The key mechanism: single-path solutions are sharp minima, while distributing signal across pathways gives flatter minima. With a large enough step size, training may first follow the winner-takes-all trajectory, then hit instability near the Edge of Stability and redistribute signal into weaker pathways.

The cleanest theorem says that, under homogeneous-depth, depth-balanced, SVS assumptions, distributing a target feature across (M) pathways reduces sharpness relative to a single-path solution by:

[ M^{2/L - 2} ]

where (L) is depth.

The broader moral is more important than the specific architecture:

Architectural-bias claims derived under infinitesimal-learning-rate gradient flow may fail under practical finite-step optimization.

That is worth remembering when people make confident claims about what a training dynamic “must” prefer. Real optimizers have step sizes. Step sizes have opinions.

The pattern

The papers this week are not all about memory, but they rhyme.

Memory search says retrieval is a trust boundary. Agent-memory profiling says persistent state has real construction, freshness, and energy costs. Graph memory says complex recall is often an active investigation. Browser-agent safety says hardening is domain-specific. Sycophancy auditing says binary safety metrics miss tonal failures. Synthetic-contamination modeling says provenance matters more than source diversity. Temporal-preference interpretability says an internal representation does not guarantee coherent decisions.

Taken together, the message is that agent engineering is moving from “can the model answer?” to “what state shaped the answer, what did we allow it to trust, what did it cost, and what action did it take?”

That is a healthier framing. Less magical. More operational. Also less forgiving.

A long-lived assistant is not just a chat model with a vector database bolted on. It is a stateful system with memory admission, retrieval policy, tool permissions, provenance, scheduling, logging, and failure modes that accumulate over time. The model is part of the system. It is not the system.

The AI papers that mattered this week — June 8, 2026