The AI papers that mattered this week — April 6, 2026

Seven papers this week share a quiet argument. Across training algorithms, video generation, vocabulary extension, and skill injection, each one is asking the same question in different terms: where does knowledge live? In the prompt? In the weights? In the architecture? The answer keeps coming back the same way — context is expensive and fragile, weights are durable and cheap at inference, and the hard work is getting knowledge from one to the other without breaking something along the way.

The model that ate its own cheat sheet

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization is the paper that makes the week’s through-line most explicit. The standard recipe for LLM agents today is to store skills as text files and stuff them into the prompt before each action — exactly what Jarvis does when it activates a skill. SKILL0 identifies the flaw in this: a model following skill descriptions in its prompt is executing skills, not learning them. The competence lives in the context window, not in the weights, and it costs tokens every single turn.

The fix is a curriculum-based reinforcement learning loop that starts by handing the model its full skill library as in-context guidance, then progressively withdraws it as training proceeds. Each curriculum stage uses a validation subset to score how much each skill file actually helps the current policy; anything that no longer moves the needle gets dropped. By the end, the model runs inference with zero skill retrieval.

The results on ALFWorld — a standard household task benchmark — are striking in the right direction: 87.9% average success for the 3B model, compared to 78.2% for the best skill-augmented baseline and 48% for GPT-4o. On token efficiency, the contrast is sharper: SKILL0 uses roughly 0.38k tokens per step where SkillRL uses 2.21k, a greater than 5× reduction.

The most interesting number in the ablation is a small one: when you remove the skill context at inference from an SKILL0-trained model, accuracy goes up by 1.6 percentage points compared to keeping it. The skill lookup has become noise.

The caveats are real. All training and evaluation happen on two simulated environments (ALFWorld and a retrieval QA suite). The initial skill library is inherited from a prior system, so the approach is upstream-dependent. But as a template for where fine-tuning attention should go, the direction is credible — and directly relevant to how Jarvis is architected. The current prompt-time skill injection model is Paradigm I; this is what Paradigm III looks like.

Thinking in batches

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning (Yang et al., UIUC + Tsinghua) approaches the cost problem from the training side. The observation is simple: if you concatenate N math problems into a single context window during training, the model has to be concise because it can’t afford to ramble on problem 1 and still have tokens for problems 2 and 3. The implicit budget pressure, applied with no explicit “be brief” reward signal, produces a model that generates shorter and more accurate reasoning than the baseline.

The key claim is that this transfers to single-problem inference. You don’t have to batch problems at test time to see the benefit — the conciseness has been baked in.

For the 4B model (BCR-Qwen3-4B versus Qwen3-4B-Thinking-2507), token reduction across five math benchmarks is 15.8–31.8%, while accuracy goes up on all five simultaneously. On AIME25 the gain is +13.3 percentage points — though AIME25 is a small, hard benchmark where a 13-point swing might represent a handful of problems, so that specific number deserves error-bar skepticism. The 1.5B model shows accuracy improvement on two of five benchmarks, with reduction on all five.

The paper also identifies N (number of concurrent problems at inference time) as a controllable efficiency axis. BCR-trained models at N=5 on MATH-500 use 839 tokens per problem and achieve 79.0% accuracy, compared to a baseline that collapses to 63.2% at the same N while using 3,099 tokens at N=1. The “3.7× cost reduction” claim in the paper bundles the efficiency modes together; read it with that in mind.

The mechanism contrast with explicit length penalties is the most interesting finding: two specific explicit penalty configurations both drove accuracy to zero during training. The BCR implicit approach, where the budget is a hard constraint rather than a competing gradient, didn’t collapse. If you’re running reasoning-heavy scheduled jobs — like Jarvis’s nightly code review — this is worth a direct test even without retraining, since grouping problems at inference time is free.

Routing the signal

Sample-Routed Policy Optimization (SRPO) solves a more specific problem: what to do when two RL training methods each have failure modes the other doesn’t. GRPO is stable but slow to improve. SDPO (self-distillation policy optimization) improves fast early but collapses with extended training — because applying it to correct responses forces arbitrary token-level preferences between reward-equivalent answers, and because the teacher’s signal degrades as the student catches up.

The fix is routing rather than blending. Correct responses go to GRPO; incorrect responses with a correct sibling in the same rollout group go to SDPO. No mixing ratio, no per-task tuning. Inside the SDPO branch, token contributions are weighted inversely to the teacher’s entropy at that position — uncertain teacher tokens get down-weighted.

Across five science Q&A tasks and a tool-use task, Qwen3-8B under SRPO reaches 77.4% average (versus 74.0% for GRPO and 71.1% for SDPO at the 10-hour wall-clock budget). The ablation separates routing from naive advantage-level blending: a simple combined loss is 3.3 percentage points behind SRPO, confirming that the routing logic itself is doing meaningful work.

One practical note: SRPO becomes cheaper than standalone GRPO after about 5 hours of training because GRPO produces longer responses and as the policy improves, more samples route to the GRPO branch (shorter SDPO responses dominate early, increasing per-step cost). At 10 hours, SRPO is 17.2% faster per step. The benchmarks are narrow (four SciKnowEval subsets and ToolAlpaca), so global generalization is unproven, but the mechanism is clean.

Seven players, one model

ActionParty: Multi-Subject Action Binding in Generative Video Games addresses a failure mode that turns out to be universal in video world models: they can’t track which action belongs to which character. Ask any existing model to simultaneously move the triangle down and the square left, and it will either blend them, apply the wrong action to the wrong entity, or drop one character entirely.

The fix uses subject state tokens — lightweight 2D coordinate vectors (six tokens per character) that travel alongside video tokens through a diffusion Transformer. Two attention masks enforce entity isolation: a cross-attention mask that hard-blocks subject token i from attending to any action embedding except action i, and a self-attention mask that prevents subject tokens from attending to each other. RoPE embeddings anchored to each subject’s previous spatial position bias attention toward nearby video tokens, reducing the binding problem from global disambiguation to local refinement.

On the Melting Pot benchmark (46 2D tile-grid games, 230 evaluation rollouts), ActionParty reaches 0.779 Movement Accuracy — approximately 5× better than the best text-only baseline at 0.158. Subject Preservation is 0.903 versus 0.668 for baselines that frequently lose characters from the scene.

The ablation is unusually clean. Removing the cross-attention mask drops Movement Accuracy from 87.2% to 5.2%; removing the RoPE bias drops it to 3.2%. These aren’t incremental components — they’re load-bearing.

Honest caveats: this is 2D tile-grid only, it requires coordinate initialization (identical-looking players can’t be bootstrapped from text alone), and it’s not real-time. The “first video world model to control 7 players” claim is specific to this benchmark setup and the comparison set is whatever the authors could construct from Wan2.1, since no open-source multi-agent world model baselines existed.

The initialization problem you didn’t know existed

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation (Daiwei Chen et al., LinkedIn) addresses a quiet pathology in vocabulary extension: the standard approach initializes all new tokens to the same vector — the mean of the existing vocabulary — and assumes fine-tuning will sort it out.

The paper shows it doesn’t sort it out. SVD analysis on embedding matrices after fine-tuning reveals that mean-initialized embeddings remain in a low-rank degenerate subspace regardless of training duration. All new tokens start at the same point; the gradient landscape never fully unentangles them.

GTI (Grounded Token Initialization) adds a short pre-training stage before standard fine-tuning. The backbone is frozen; only the new token embeddings are trained, using paired (text description → token ID) and (token ID → text description) examples. The grounding is bidirectional because weight tying in the LM head means the same embedding matrix handles both reading and generation.

On LinkedIn’s industrial job-candidate retrieval dataset, GTI achieves +21.63% relative gain at P@5 versus vanilla fine-tuning, compared to +6.38% for the competing LC-Rec approach. On the public Vibrent clothes rental dataset, recall improvement at @20 is +26.02% versus +13.41% for LC-Rec. Important disclaimer: the industrial numbers are relative only — absolute performance is withheld for commercial reasons. All experiments use Qwen3-0.6B. One backbone, two datasets, both in the recommendation domain.

The pattern itself — freeze backbone, train only new embeddings, use paired text supervision, then continue to full fine-tuning — requires no architectural changes and is achievable with a standard SFT loop. If you’re ever adding domain-specific control tokens, tool tokens, or discrete codebook entries to a pretrained model, the argument for a grounding stage first is now formally supported.

The map of agentic tool use

Agentic Tool Use in Large Language Models (Hu et al., Harbin Institute of Technology / TikTok) is a survey, not an experiment. Its contribution is organizational: it segments five years of research into three paradigms — prompting frozen models to use tools, fine-tuning on tool-use data, and RL-optimized tool-use policy — and uses that frame to make the progression legible.

A few specific findings from cited work are worth pulling out. Toolformer’s self-supervised criterion (retain a tool call only if inserting it reduces next-token prediction loss on subsequent tokens) is a concrete and reproducible threshold, not a heuristic. SearchR1 induced query reformulation and information synthesis from sparse outcome rewards alone, with no dense reward shaping. The Agentless paper’s finding that “carefully designed pipelines may outperform more complex long-horizon agents when task structure is clear” is a counterintuitive data point worth keeping around.

The survey’s section on MCP standardization (§7.1) notes that tool libraries are expected to become “dynamic protocol-compliant marketplaces,” transforming tool integration from O(n²) pairwise adapters to O(n) standard-compliant implementations. Jarvis’s current skill system is a hand-rolled approximation of this, which means watching MCP ecosystem maturity is worth doing.

The safety section is more sobering. Indirect prompt injection via tool outputs — where hostile content in an external URL or email body manipulates the agent’s subsequent actions — is a documented attack class, with benchmarks like ToolSword, InjEcT-Agent, and ToolEmu cataloguing it. Jarvis reads external URLs and email content into context. That content is untrusted and the attack surface is real.

Making regulations machine-readable

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules (Vanguard Group) proposes a four-stage pipeline: convert regulatory PDF to Markdown, extract typed rule units (actions, conditions, constraints, exceptions, penalties) via LLM, score quality across 19 explicit criteria via a second LLM acting as judge, and repair anything scoring below 90% — up to three times. No human annotation, no domain-specific prompting.

The pipeline runs unchanged on HIPAA, the SEC Advisers Act, and the EU AI Act — three structurally distinct regulatory corpora. Self-scored quality averages above 4.70/5.00 (94%) across all model and domain combinations, though these scores are generated by the pipeline’s own judge against its own criteria, which is a circularity worth naming. The downstream evaluation is more interesting: responses grounded in De Jure-extracted rules are preferred by a judge LLM over a prior system (Datla et al. 2025) in 73.8% of cases at single-rule retrieval depth, rising to 84.0% at ten-rule retrieval. Both the extraction and the preference evaluation are LLM-judged; no human annotation of extraction correctness is reported.

The engineering insight that transfers cleanly: hierarchical repair order matters. The pipeline repairs section metadata first, then term definitions, then rule units — in dependency order — and the ablation confirms that the definition stage captures the largest share of quality gain as the acceptance threshold tightens. Multi-stage LLM extraction pipelines that repair everything at once are leaving gains on the table.

This is a preprint from a single institution, not yet peer-reviewed, with no code or data released. Take it as a plausible approach with modest but directionally credible evidence.

Orthostochastic routing for residual streams

go-mHC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices (Dandachi & Diggs-Galligan) addresses a narrow but real architectural problem. Models that use multiple residual streams instead of one need mixing matrices that are doubly stochastic (rows and columns each sum to 1) to keep gradients stable. Exact parameterization was previously either factorial in cost (mHC-lite) or expressivity-limited (KromHC). The paper proposes constructing these matrices via the Cayley transform applied to a skew-symmetric parameter matrix, projecting to doubly stochastic via block Frobenius, and introduces a tunable hyperparameter s that interpolates between the orthostochastic boundary and broader coverage of the mixing space.

The core result on a synthetic stream-mixing task: go-mHC hits theoretical minimum MSE and converges approximately 10× faster in epoch count than mHC-lite. On a 30M-parameter language model (TinyStories, NanoGPT architecture), all three parameterizations reach comparable loss — the advantage is expected to appear at scale or larger stream counts (d), where mHC-lite becomes intractable (at d=8, its parameter count exceeds 1.4B for a 30M model).

The paper also proves formally that KromHC cannot expand its spectral reach by stacking Kronecker products — the eigenvalue space of a k-fold Kronecker product of doubly stochastic matrices is invariant to k. This closes off an obvious-seeming escape route.

Two-person team, single RTX 4090. Interesting math that may become relevant if the residual-stream-count scaling hypothesis materializes in production models.

A memory pruning framework without experiments

Novel Memory Forgetting Techniques for Autonomous AI Agents (Fofadiya & Tiwari) is primarily a literature synthesis. The mathematical framework — a relevance score combining recency, access frequency, and semantic similarity to current query, with exponential temporal decay, subject to a fixed budget constraint — is theoretically reasonable but untested. No model is trained, no system is implemented, and the “improved long-horizon F1 beyond 0.583 baseline levels” claim in the conclusion has no experimental table behind it.

The benchmarks being cited from prior work are genuine, though. The LOCCO degradation statistic — Openchat-3.5 memory scores declining from 0.455 to 0.05 across temporal stages, an 85% drop — is from Jia et al. 2025 and describes a real failure mode. The MultiWOZ False Memory Rate of 6.8% under write-time filtering is from Phadke et al. (NeurIPS 2025 workshop). These numbers are useful regardless of this paper’s specific proposal.

The three-factor scoring idea and the constrained-maximization framing are reasonable design targets for any memory pruning policy, and the decay parameter λ maps directly to the practical question of how long project-level memories should remain load-bearing before being verified or culled.

Compact chemistry as attention bias

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling (Hadži Veljković et al.) is materials science and sits at the edges of general relevance. The contribution is a diffusion Transformer for crystal structure prediction that replaces bulky element one-hot encodings with 8-dimensional PCA-compressed chemistry vectors derived from period, group, block, and valence-shell occupancies, and adds a per-layer attention bias computed from periodic pairwise distances.

The two ideas transfer beyond their domain. The Subatomic Tokenization approach — encode a discrete token space as compact continuous vectors where chemical similarity is geometrically meaningful, decode back to discrete via cosine-nearest-prototype matching — is a template for any structured-token diffusion problem with latent domain structure. The Geometry Enhancement Module pattern — compute domain-specific pairwise geometry, inject it as an additive attention bias, gate by noise level, skip equivariant message-passing entirely — is a clean modular technique for getting domain structure into a Transformer without coupling it to the backbone architecture.

Results on Melting Point, MPTS-52, and Alex-MP-20 benchmarks include claimed state-of-the-art SUN (Stable, Unique, Novel) score for de novo crystal generation, though the numerical tables were in the truncated portion of the paper and exact figures aren’t reproducible from the notes alone.

The shape of the week

Read together, these papers describe a field in the middle of an architectural transition. The prompt-time skill injection model — context is cheap, retrieval is easy, model weights are fixed — is giving way to something more expensive to build and cheaper to run. SKILL0 trains until skills disappear from the prompt. BCR trains until verbosity disappears from reasoning. SRPO trains until the right gradient finds the right sample. GTI ensures new knowledge has a proper home in the embedding space before fine-tuning begins. ActionParty makes character identity a first-class architectural object rather than an implicit prompt-level hope.

The cost is borne upfront, in training. The benefit accrues at inference, repeatedly, at scale. That’s not a novel trade-off — it’s the same one that justifies fine-tuning over prompting generally — but the papers this week are sharper about the specific failure modes of the cheap path: the degenerate embedding subspace, the metacognitive loops, the character binding collapse, the skill context token budget. Knowing the failure modes precisely is how you know whether the expensive fix is worth it.

Reading list

2604.02268 — SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
2604.02322 — Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
2604.02288 — Sample-Routed Policy Optimization (SRPO)
2604.02330 — ActionParty: Multi-Subject Action Binding in Generative Video Games
2604.02324 — Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
2604.00835 — Agentic Tool Use in Large Language Models (survey)
2604.02276 — De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
2604.02309 — go-mHC: Direct Parameterization of Manifold-Constrained Hyper-Connections
2604.02280 — Novel Memory Forgetting Techniques for Autonomous AI Agents
2604.02270 — Crystalite: A Lightweight Transformer for Efficient Crystal Modeling