The AI papers that mattered this week — May 23, 2026

Memory is not storage. Planning is not clicking. Reasoning is not “think longer.”

That is the thread running through this week’s papers. The best ones are not trying to make agents more magical. They are trying to make them less wasteful: inject memory only when it helps, compile repeated web work instead of improvising, spend test-time compute only when the dynamics are aligned, and explore reality where a simulator is likely blind.

This is a useful turn. A lot of agent research still has the vibe of “add another loop and hope.” These papers are more disciplined. They ask where the bottleneck actually is: stale retrieval, LLM latency, token credit assignment, simulator reachability, Monte Carlo variance, hyperparameter brittleness, or expensive RL trajectories.

That is also why several of them matter directly to Jarvis. A personal agent is a little laboratory for these problems: persistent memory that can mislead, skills that should become procedures, browser tasks that should not be solved from scratch each time, and verification loops that need to stop before they become theatre.

1. Memory should learn when to shut up

Mem-π: Adaptive Memory through Learning When and What to Generate is the paper I would put at the top of the pile.

Most agent memory systems treat memory as retrieval: keep a bank of past experiences, retrieve the nearest few, paste them into context, hope the match is useful. That works until it doesn’t. Similar is not the same as relevant. Old traces can be stale, over-specific, or actively wrong for the current task.

Mem-π reframes memory as a policy. A separate memory model looks at the current task and decides either:

[GENERATE] a short task-specific hint; or
[ABSTAIN] and provide no memory.

That abstention bit is not cosmetic. It is the paper’s best idea. In real agents, irrelevant memory is not neutral; it drags the model into the wrong groove.

The system is trained in two stages. First, it distills hints from an offline experience bank. Then it refines the memory model with reinforcement learning using downstream agent success as the reward. The RL objective tries to separate two credit-assignment questions:

Was it useful to use memory at all?
If yes, was the generated memory content useful?

The authors use counterfactual rollouts: compare forced abstention against several generation branches. Decision tokens get decision-level credit; content tokens get content-level credit when generation was actually helpful.

The reported results are strong, though still benchmark-bound. On WebArena, Stage 1 reaches 35.0% success, while full Mem-π reaches 43.1%, an 8.1 percentage-point gain from adaptation distillation. On WorkArena, it improves the base agent from 42.0% to 50.3%. On ALFWorld, it reaches 91.6%, a 6.3 point gain over the reported GPT-5.4-mini baseline. The authors also report that Mem-π uses 138 memory tokens per task on average, compared with 200 for Stage 1 and 225 for Memory-R1.

The abstention behavior is the most telling number. On WebArena’s easiest tasks, where the base agent succeeds 80–100% of the time, Mem-π abstains about 71% of the time. On the hardest tasks, abstention falls to around 13%.

That is exactly the shape you want.

For Jarvis, this is close to the center of the problem. Jarvis currently has explicit notes, recent memories, skills, and project-specific state. That is useful, but raw retrieval has the same failure mode: old context can be too loud. The near-term lesson is not “train a 7B memory model tomorrow.” It is simpler:

retrieve candidate memories;
synthesize the shortest task-specific hint;
preserve links back to sources;
allow “no useful memory found.”

Mem-π itself moves memory into model parameters, which creates an attribution problem. A personal assistant should not lose provenance. The better architecture is probably hybrid: explicit memories for auditability, plus a memory-policy layer that decides what, if anything, deserves to enter the working context.

2. Web agents should compile repeated work

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling attacks a different waste pattern: web agents repeatedly asking an LLM what to click next.

That loop is painfully familiar:

observe page → ask model → click → observe page → ask model → type → observe page → ask model…

For unfamiliar websites, maybe that is unavoidable. For recurring tasks on known applications, it is absurd. A repeated web workflow should become a small program.

This paper’s JIT-Planner generates executable code plans that call cached higher-level tools, validates those plans against tool specifications, estimates their cost, and picks the cheapest valid one. Its tool protocol includes not just input/output schemas but state invariants: preconditions and postconditions over application state. A tool might require page_type = "store" and guarantee that the page remains in a valid post-action state.

That matters because many agent failures are not subtle reasoning failures. They are “you called the right tool in the wrong state.” Invariants catch that before execution.

The reported numbers are large, and should be read as specific to the authors’ setup. Across 37 tasks from 5 web applications, JIT-Planner reports a 10.4× speedup and +28 percentage-point accuracy improvement over a Browser-Use-style sequential web agent. Mean latency in one comparison drops from 122.1s for Browser-Use to 11.7s for JIT-Planner. Browser-Use with cached tools improves to 80.1s, which is better but still stuck in the step-by-step LLM loop.

Cost-based selection also matters. Choosing the best-cost valid plan gives 11.7s mean latency; choosing the worst-cost valid plan gives 61.7s. Generating code is not enough. You need to choose between candidate programs.

The paper also has a scheduler that chooses between serial execution, parallel execution, and hedged execution. On the three REAL applications used for scheduling, JIT-Scheduler with Gemini-2.5-Pro reports 109.9s latency and 86.4% accuracy, compared with OpenAI CUA at 258.7s and 77.8%. But it does not dominate every fixed strategy: for Gemini-2.5-Pro, hedge is faster at 98.4s, while parallel is slightly more accurate at 88.9%.

That nuance makes the result more credible, not less. The scheduler is not magic. It is an automatic trade-off mechanism.

For Jarvis, this paper is bluntly relevant. Skills like homelab, jobs, mail, browser automation, Discourse, widgets, and deployment workflows should not remain open-ended improvisations forever. Successful traces should harden into reusable procedures with:

preconditions;
postconditions;
health checks;
fallback paths;
cost estimates;
and versioning for brittle UI/API assumptions.

The paper’s offline setup cost is substantial: tool synthesis takes 25–90 minutes per application and scheduler trace collection 25–45 minutes, though the authors say parallelism can reduce both to roughly 20–30 minutes per app. So this is not for one-off browsing. It is for repeated work. Which is exactly where agents should stop pretending every task is new.

3. Sometimes the simulator is wrong where you never look

Mind the Sim-to-Real Gap & Think Like a Scientist is less flashy than the agent papers, but conceptually sharp.

The paper asks: if you have a simulator of a real-world sequential decision problem, when should you trust it, when should you update it passively, and when should you deliberately run costly experiments?

The key distinction is between two kinds of simulator error:

local error, in states your deployed policy already visits;
reachability error, in states your policy avoids.

Passive updating can eventually fix local errors. It cannot fix reachability errors, because you never collect data there. This is the sim-to-real version of a causal positivity problem: no support, no learning.

The proposed method, Fisher-SEP, uses the simulator not just to choose actions, but to choose experiments. It allocates real-world exploration to state-action pairs where reducing uncertainty would most reduce uncertainty about the value of a target policy.

The empirical evidence is constructed rather than deployed, and the paper is clear about that. The case studies are mechanism demonstrations.

In a vending-machine supply chain setup — the “local error” regime — passive updating works reasonably well. SOP degrades from about 75% of oracle cash at T=100 to 38% at T=1600. A-SOP and Thompson sampling reach about 70% of oracle at T=1600. Fisher-SEP-R pays an upfront exploration cost, then at T=1600 leads A-SOP by 5.1 percentage points with p = 0.020, and Thompson sampling by 4.8 points with p = 0.028.

In the HIV mobile-testing grid-world — the “reachability error” regime — the difference is bigger. Region B is under-surveilled, separated by a corridor, and underestimated by the simulator. A-SOP largely stays in Region A. Its corridor-crossing rate is below 2% over 400 days, and it plateaus at about 43% of oracle. Thompson sampling reaches about 68%. Fisher-SEP-T reaches about 78%, beating A-SOP by 35 percentage points at T=400 with p < 0.001, and Thompson sampling by 10 points with p = 0.017.

Again: stylized. Not a public-health deployment. But the lesson generalizes.

For agents, the right question is not “is my simulator accurate?” It is “where is my simulator likely wrong relative to the states my current policy actually visits?”

That applies to software agents too. If Jarvis only tests the paths its current automation already uses, it will never discover failure modes in avoided branches: alternate APIs, weird service states, neglected user workflows, stale browser sessions, rarely used cron jobs. Passive monitoring will look calm right up until reality punches through the wall.

The practical pattern is:

identify the policy’s current visitation support;
identify valuable regions outside that support;
run small, measurable probes there;
prioritize uncertainty that changes decisions, not uncertainty in general.

That is a better use of simulation than asking it to be an oracle.

4. Reasoning loops need attractors, not just more compute

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning studies iterative neural reasoners: models that repeatedly update an internal latent state instead of producing an answer in one feedforward pass.

The authors frame these models as learned dynamical systems. A good reasoner should not merely iterate; it should move toward stable regions of state space that correspond to valid solutions. Extra test-time compute helps only if the dynamics are aligned. Otherwise you just converge harder into the wrong basin. Very philosophical. Also extremely practical.

EqR uses two lightweight training interventions:

randomize the initial latent state;
inject small noise along the iterative path.

At inference time it scales along two axes:

depth: more iterations for one trajectory;
breadth: multiple stochastic restarts.

The strongest result is on Sudoku-Extreme. A feedforward baseline gets 2.6% exact accuracy. A weight-tied iterative model gets 32.6%. Scaling depth with the right training/supervision schedule reaches 74.7%. Adding adaptive computation in the construction path reaches 84.8%. The abstract claims EqR reaches over 99% on Sudoku-Extreme by unrolling up to the equivalent of 40,000+ layers.

That last number needs careful accounting. The paper says one Sudoku outer iteration corresponds to 42 equivalent layer evaluations. A single trajectory with T = 1024 outer iterations is therefore 43,008 equivalent layer evaluations. The best breadth-scaled setting reportedly uses T = 1024 and K = 8 restarts, for 344,064 total equivalent layer evaluations across restarts. That is not a normal inference budget. It is a stress test of scaling behavior.

The paper also reports a seed stability check at 50k training steps: baseline exact accuracy 84.33%, 95% CI [83.59, 85.07]; EqR 86.18%, 95% CI [85.63, 86.72]. That is a modest same-budget gain, separate from the dramatic test-time scaling story.

The caveat is obvious: these are controlled structured-reasoning tasks such as Sudoku-Extreme, Maze-Unique, and Mini-ARC checks. This is not evidence that a chatbot becomes a theorem prover if it mutters internally for long enough.

Still, the conceptual point is valuable for agents. “Think longer” is not a strategy. More loops help when the update process is calibrated against success. Otherwise they become confidence amplification.

For Jarvis, the analogue is not latent fixed-point residual. It is operational convergence:

independent attempts reaching the same answer;
tool results agreeing;
plans stabilizing for grounded reasons;
uncertainty shrinking because evidence accumulated, not because the model got tired.

Convergence is not correctness. But calibrated convergence can be a useful stopping signal.

5. RLVR may move along surprisingly simple weight paths

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories is one of those papers that is either pointing at a real structural simplification or at a narrow benchmark trick. Possibly both.

The paper studies reinforcement learning with verifiable rewards — RLVR — on math models. The claim is that useful RLVR weight changes often lie along a simple per-tensor rank-1 trajectory. Observe an early prefix of training, compute a rank-1 SVD direction per tensor, fit a line to the coefficient over time, and extrapolate a future checkpoint.

The method is called RELEX. It is deliberately boring:

save early checkpoints;
compute deltas from the base model;
run per-tensor rank-1 SVD over the observed trajectory;
fit a linear coefficient;
reconstruct an extrapolated future model.

No learned predictor. No nonlinear extrapolator. No extra RL.

The main experiments use GRPO on MATH with three Qwen-family models: Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base. Full RLVR is 500 optimization steps on 8× H200 GPUs.

The headline MATH numbers are:

Qwen2.5-Math-1.5B: full RLVR 71.5%, RELEX 71.6%;
Qwen3-4B-Base: full RLVR 85.5%, RELEX 85.6%;
Qwen3-8B-Base: full RLVR 88.5%, RELEX 87.4%.

The authors say RELEX can require as few as 15% of full RLVR training steps, with the body often framing the method as using 15–20% of the trajectory. They also report examples like observing 50 steps and extrapolating to 1000 steps, or 20× beyond the observed prefix.

The denoising story is the most interesting part. The authors argue that the first rank-1 component captures the smooth task-relevant update, while later components mostly contain stochastic optimization noise. Increasing to rank-5 or rank-10 reportedly does not help.

The scope is narrow: math RLVR, Qwen-family models, GRPO, MATH training. It does not show that all RL fine-tuning is rank-1. It does suggest a practical experiment-tracking habit: save checkpoint trajectories, not just final weights.

For anyone running RLVR experiments, this is a cheap diagnostic to add. Watch the singular-value gap. Track whether early trajectories are stable. Try extrapolated candidates before burning the full run. If it works, excellent. If it fails, you learned that your task dynamics are less polite.

6. Token credit assignment beats rewarding every word equally

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards attacks another RLVR weakness: answer-level rewards are smeared across all tokens.

If a math answer is correct, standard RLVR tends to push up the probability of every token in the response. But successful and failed responses share a lot of junk: formatting, boilerplate, repeated phrases, generic reasoning scaffolds. Rewarding all of it equally can dilute the sparse tokens that actually distinguish good trajectories from bad ones.

DelTA reweights token updates by how discriminative their gradient direction is between positive-advantage and negative-advantage responses. In plain English: it tries to learn more from the tokens that separate success from failure, and less from the tokens that appear in both.

The paper frames this as a local discriminator over token-gradient vectors. Full token gradients would be too expensive, so the implementation uses a layer-restricted LM-head gradient proxy to estimate coefficients, while the final weighted RLVR loss still updates the full model.

The main results are on seven contest-style math benchmarks using Qwen3-8B-Base and Qwen3-14B-Base trained on DeepMath-103K in VeRL. DelTA reportedly beats the strongest same-scale baseline by:

+3.26 average points on Qwen3-8B-Base;
+2.62 average points on Qwen3-14B-Base.

The authors also report code-generation transfer: across HumanEval+, MBPP+, and LiveCodeBench, DelTA improves the weighted average from 47.7 for DAPO to 49.5, a +1.8 point gain.

The caveats matter. The paper does not report independent multi-seed RL training runs. Its significance testing is based on repeated stochastic evaluations, using 16 evaluation scores per method and a one-sided Mann–Whitney test. That supports evaluation robustness more than training-run robustness.

Still, the diagnosis is broadly useful. For agent traces, many success and failure trajectories share boilerplate: JSON wrappers, tool-call syntax, “I’ll check,” generic plans, logging noise. If you train on whole-trajectory success without finer credit assignment, you may reward the wallpaper.

The general lesson is safe: token-level credit assignment matters. The unsafe version would be “DelTA is proven best for all RLHF or agent training.” It is not.

7. Variance reduction helps only when variance is the bottleneck

Variance Reduction for Expectations with Diffusion Teachers is a systems-method paper with an unusually useful negative result.

The setting: pipelines that use a frozen diffusion model as a teacher. Gradient estimates require sampling diffusion timesteps and noise. Those Monte Carlo estimates can be noisy and expensive.

The proposed framework, CARV, is compute-aware variance reduction. When some parts of the pipeline are expensive — rendering a 3D scene, running a generator, encoding video — and others are cheap — re-noising, sampling timesteps, denoising — reuse the expensive computation and spend extra samples on the cheap randomness. Combine this with timestep importance sampling and stratification.

The method preserves the original objective under the stated sampling constructions. That is important: it is changing the estimator, not quietly changing the target.

The strongest positive case is text-to-3D with SDS-style optimization using threestudio, Stable Diffusion 2.1 base, and an Instant-NGP-style NeRF representation. The abstract reports 2–3× effective compute multipliers in text-to-3D and attribution experiments, with most gain from amortized compute reuse and an additional ~25% from importance sampling plus stratification. The paper also says matched-cost importance-weighting plus stratification reaches the standard SDS baseline’s converged CLIP score in roughly half the iterations.

The key negative result comes from single-step diffusion distillation on ImageNet-256 with a DiT-XL/2 teacher. CARV cuts gradient variance by an order of magnitude, but does not improve downstream FID at matched wall-clock time.

That is the right lesson: variance reduction is not pixie dust. It helps when timestep/noise variance is the bottleneck. It does not help when auxiliary losses, input diversity, or bilevel dynamics dominate.

For Jarvis, this matters mostly as an engineering heuristic. Before optimizing a stochastic pipeline, measure what kind of variance is wasting compute. Cache expensive upstream work. Resample cheap stochastic components. Stratify only when the contribution is non-uniform. And do not assume a cleaner estimator will improve the final metric.

8. μP’s advantage may be hiding in the embedding learning rate

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate asks why μP / Maximal Update Parameterization often lets people tune learning rates on small models and reuse them on larger ones.

The paper’s answer, in its Transformer + AdamW setting, is surprisingly narrow: much of μP’s advantage over standard parameterization comes from giving the embedding layer a much larger learning rate.

The authors introduce three metrics for judging hyperparameter transfer:

loss predictability error;
transfer robustness exponent;
asymptotic loss degradation.

Then they decompose differences between standard parameterization and μP into four changes:

embedding-layer learning rate;
last-layer initialization variance;
LayerNorm learning rate;
attention scaling.

Their main empirical claim is that changing standard parameterization so the embedding layer uses the μP-style larger learning rate — SP+Embd — essentially matches μP across their transfer metrics. Conversely, reducing μP’s embedding learning rate to the SP-style value substantially degrades transfer and destabilizes training.

The experiments are substantial but bounded: decoder-only GPT-style Transformers, AdamW, FineWeb-Edu, fixed depth, width scaling by increasing heads with fixed head dimension, mostly single-seed sweeps, and roughly 160,000 H100 GPU-hours of compute.

The paper also reports that SP’s loss predictability error is roughly an order of magnitude larger than μP’s in the relevant comparison, and interprets this as instability caused by under-training the embedding layer.

The practical takeaway is clean: if runs are unstable near the apparently optimal global learning rate, inspect whether embeddings are learning too slowly before reducing the whole model’s learning rate or adopting the full μP stack.

For Jarvis-scale model experimentation, this is the sort of thing that belongs in an experiment checklist. Layerwise learning rates are boring. Boring things often determine whether expensive runs behave.

9. torchtune is about hackable post-training infrastructure

torchtune: PyTorch-native post-training library is a systems/library paper for torchtune, Meta/PyTorch’s post-training stack for open-weight LLMs.

The pitch is not “we invented fine-tuning.” It is “we made a PyTorch-native stack where the training code stays inspectable.” torchtune supports supervised fine-tuning, LoRA/QLoRA-style adaptation, DPO, GRPO, knowledge distillation, quantization-aware training, and multimodal fine-tuning. It avoids hiding everything behind a giant trainer abstraction.

The paper highlights several systems pieces:

explicit model builders from PyTorch components;
YAML-driven but readable recipes;
FSDP2, DTensor, tensor parallelism, sequence/loss parallelism, expert parallelism, and Ring Attention;
optimizer-in-backward integration;
Linear Cross-Entropy loss to avoid materializing full logits;
an asynchronous GRPO architecture using Ray, vLLM, queues, replay buffers, and FSDP trainers.

The strongest concrete numbers in the notes are from benchmark-specific systems experiments. For Qwen3-0.6B, compilation reportedly improves throughput from 5.2k to 7.9k tokens/s while reducing peak memory from 8.6 GB to 7.0 GB. For Qwen3-1.7B, AdamW8Bit reduces memory from 11.7 GB to 4.9 GB. Activation checkpointing enables Qwen3-8B to run where baseline OOMs.

The long-context experiment is also concrete but synthetic: a Llama 3.2 model post-trained on concatenated Alpaca samples with most examples around 1 million tokens, batch size 1, on 8×H100, with 79.2 GB VRAM per GPU and 6,720 tokens/s average throughput.

The async GRPO section should be read as architecture, not proof. The authors explicitly leave head-to-head reward and throughput comparisons to future work.

For Jarvis, torchtune is relevant if the goal is reproducible, modifiable post-training rather than push-button fine-tuning. If you just want a one-off adapter, Axolotl or TRL may still be easier. If you want to understand and alter the training stack, torchtune’s bias toward explicit PyTorch is the right bias.

10. AI-native publishing is arriving before governance is ready

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists is a platform paper for a publishing workflow where both humans and AI systems produce, review, revise, and comment on research artifacts.

Its most interesting design choice is that AI agents interact through a Model Context Protocol server. The paper says the MCP server exposes 13 tools across account management, paper operations, review access, and community engagement. That makes AiraXiv not just another paper-review assistant, but an agent-facing publication substrate.

The paper’s deployment evidence comes from ICAIS 2025, where AiraXiv served as official infrastructure for the 1st International Conference on AI Scientists. It processed 114 final submissions: 82 AI-generated and 32 human-written. Overall acceptance rate was 36.8%, or 42/114. AI-generated submissions had a 31.7% acceptance rate, while human-written submissions had 50.0%.

Each submission received three AI reviews — two model-based and one agent-based — with a human expert reviewer making the final decision. AI scores achieved AUC = 0.78 for distinguishing accepted from rejected manuscripts. Average turnaround for complete AI review reports was about 10.3 hours. About 19.3% of submissions underwent version updates during the short conference window.

The paper is careful to frame AI reviews as prescriptive feedback, not final authority. Good. The governance issues are not small: provenance, disclosure, adversarial content, version stability, authorship, auditability, and the temptation to optimize against the reviewer.

For Jarvis, the MCP-native pattern matters more than the conference result. An agent-facing research platform should let agents submit artifacts, retrieve feedback, inspect related work, comment, and revise without pretending to be a browser user. That is the right interface direction.

But the lower acceptance rate for AI-generated submissions is the useful cold shower. High-volume AI research output still needs filtering. The slop conveyor belt has excellent uptime.

The shape of the week

The through-line is selective computation.

Mem-π says memory should be conditional: sometimes generate a compact hint, sometimes abstain. Agent JIT says repeated web tasks should become compiled procedures, not chatty action loops. Fisher-SEP says simulators should tell you where to run reality checks, not just what to do. EqR says more reasoning compute helps only when internal dynamics converge toward correct attractors. RELEX says some RLVR trajectories may be simple enough to extrapolate. DelTA says reward should flow to discriminative tokens, not every token in a lucky answer. CARV says variance reduction helps only if variance is the bottleneck. The μP paper says a grand scaling story may reduce to the embedding layer learning too slowly. torchtune says post-training infrastructure should stay hackable. AiraXiv says AI-native workflows are becoming real, and governance is already lagging.

This is a healthier research mood than “bigger model, longer prompt, more agents.” It is more mechanical. More falsifiable. More interested in where the waste is.

For agents like Jarvis, the lesson is not to bolt on all these papers. It is to adopt their taste:

retrieve less, synthesize better, and abstain more;
turn repeated work into checked procedures;
measure where loops actually improve outcomes;
use parallel attempts only with reliable selection;
preserve traces because future methods may extract structure from them;
and never confuse convergence with truth.

That last one should probably be engraved somewhere.