Agents are getting less magical and more operational.
That is the useful thread through this week’s batch: not “bigger model solves task,” but “what feedback do we give the system, at what level, and how do we stop it from optimizing the wrong thing?” The strongest papers here are about calibration, executable feedback, hindsight, structured errors, and constrained extraction. Less sparkle, more plumbing. Good. Plumbing is where most of the actual intelligence hides.
A few themes repeat:
- Low-level actions do not automatically compose into high-level planning.
- Raw probability is not a verifier.
- Raw reward scores are not necessarily good training signals.
- Error signals are more useful when exposed structurally, not collapsed into a scalar.
- Schema, routing, and validation often beat prompt theatrics.
That matters for Jarvis because Jarvis is exactly the kind of system these papers are circling: tool-using, memory-bearing, browser-driving, sometimes autonomous, and therefore capable of failing in ways a benchmark final answer will not reveal.
1. GUI agents need hindsight, not just click logs
Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning is the most directly Jarvis-relevant paper in the set.
The paper studies multimodal agents operating on unfamiliar websites. Its basic claim is that training on atomic UI actions — click here, type this, scroll there — does not reliably teach an agent to complete larger tasks. That sounds obvious, but the paper gives it useful empirical teeth.
Their method, PEEU, has two stages. First, an agent explores a website and collects trajectories. Then the system uses hindsight to rewrite the training task so it matches what the trajectory actually achieved. If a trajectory did not find a “4.5-star recipe” but did find a “4-star recipe with these constraints,” the training example should not pretend otherwise. The target task is rewritten to align with the observed outcome.
That is a deceptively important move. Most agent memory systems are bad at this. They store the user’s original intention, or a flat trace of actions, but not the corrected episode: what was discovered, what constraints mattered, what actually happened, and where the task diverged from the plan.
The headline result is benchmark-specific but worth noticing. On held-out real-world WebVoyager-style websites, a Qwen2.5-VL-7B model trained with PEEU reaches 30.6% trajectory-level success, compared with:
- 7.8% for the base Qwen2.5-VL-7B instruct model;
- 22.7% for the larger Qwen2.5-VL-32B-Instruct;
- 19.0% for a coarse high-level SFT baseline;
- 21.7% for an atomic SFT baseline.
That is not “small model beats large model” in general. It is “under this evaluation setup, better-aligned high-level training data beat both a larger base model and lower-level training.” The absolute number is still only 30.6%, so anyone selling this as robust web automation is getting high on their own demo video.
The paper’s most useful result may be the failure of bottom-up generalization. In their TDHAF framework, low-level training gives strong low-level success but weak high-level success. A 7B model trained on low-level data gets 89.6% on low-level tasks but only 18.8% on corresponding high-level tasks. A 3B model gets 80.5% low-level and 9.1% high-level.
For Jarvis, the implication is immediate: browser traces should not be remembered as click/type logs alone. The useful artifact is a structured episode:
- user goal;
- site and page context;
- constraints discovered;
- action trajectory;
- outcome;
- failures and detours;
- final hindsight-corrected task description.
The paper also offers a useful warning about retrieval-only memory. Prompt-based experience retrieval performed poorly for the 7B and 3B models in this setup. That does not prove retrieval is useless; it proves that naively stuffing old trajectories into context is not the same as learning how to plan.
2. Probability is not a verifier
When are likely answers right? On Sequence Probability and Correctness in LLMs asks a question that should make every best-of-N decoding scheme sweat: if a model assigns higher probability to an answer, is that answer more likely to be correct?
The answer is: sometimes, but usually not in the way you want.
The paper’s key contribution is separating the probability–correctness relationship by granularity:
- across examples in a dataset;
- across hyperparameters within one decoding method;
- across decoding methods;
- across multiple answers to the same prompt.
That distinction matters. Across a fixed benchmark, higher-probability answers often are more likely to be correct. The strongest case in the notes is MATH500, where the relationship is described as strongly positive for Qwen3-8B-Base with scalable power sampling. Other benchmarks such as GPQA, HumanEval, MedQA, and MMLU show weaker positive relationships. IFEval is a notable exception for base models.
But the signal breaks down in the setting people actually want: choosing the best answer for a specific prompt.
The paper reports that within-prompt rank correlations between answer probability and correctness are often roughly symmetric around zero, with MATH500 again being the main exception. Tuning a decoding method to produce higher-probability sequences also does not reliably improve correctness. Comparing across methods is similarly inconsistent: more probable answers can be more correct, but often are not.
The blunt takeaway: model probability can be a dataset-level diagnostic, but it is not a dependable per-answer verifier.
For Jarvis, this argues against a tempting shortcut: generate several answers, compute log-probs, pick the most likely. That may work in some narrow mathematical settings, but it should not be the default reliability mechanism. Tests, retrieval, execution, citations, validators, and task-specific checks beat probability mysticism.
This is also a useful warning for verifier-free self-improvement loops. If “high probability” is weakly correlated with correctness for your task, training on high-probability samples may just polish fluent wrongness. A very efficient way to make the system confidently mediocre. Nature finds a way.
3. RL without ground-truth answers, if the environment can score attempts
Reinforcement Learning without Ground-Truth Solutions can Improve LLMs is one of the more practically interesting coding-agent papers here.
The authors train models on AtCoder Heuristic Contest-style optimization problems. These tasks do not have a single ground-truth solution. Many programs can be valid; the best solution may be unknown. But you can execute a candidate, check feasibility, and assign an objective score.
Their method, RiVER, turns those executable scores into within-instance rankings. This matters because raw scores are often uncalibrated. One problem instance may have a huge score range, another a tiny one. If you train directly on raw scores, large-range instances dominate. The paper calls this scale dominance. It also identifies frequency dominance, where repeated mediocre samples can outweigh a rare strong one in group-relative RL.
RiVER’s recipe is clean:
- Generate multiple candidate programs.
- Execute them on the same hidden instance.
- Check validity separately from quality.
- Rank valid candidates within that instance.
- Give the best candidate a separated high reward.
- Still give bounded graded feedback to valid non-winners.
- Avoid comparing raw scores across unrelated instances.
The reported results are modest enough to be believable and interesting enough to matter. Training on 12 AHC tasks, RiVER improves ALE-Bench rating by:
- 142 points for Qwen3-8B;
- 157 points for GLM-Z1-9B-0414.
The abstract reports ALE rating-rank gains of 8.9% and 9.4% for those models. More importantly, the training transfers to exact-solution coding benchmarks. Across LiveCodeBench v5, LiveCodeBench v6, and USACO, the paper reports average absolute Pass@1 improvements of:
- +2.4 percentage points for Qwen3-8B;
- +3.5 percentage points for GLM-Z1-9B-0414.
That transfer is the interesting part. ALE-Bench is close in spirit to AHC; LiveCodeBench and USACO are not the same score-optimization regime. The paper’s claim is not that ground-truth-free RL solves coding. It is that score-based executable environments can provide transferable supervision when the reward is calibrated properly.
For Jarvis, this is directly applicable. Many useful tasks do not have a reference answer:
- write a faster script;
- repair a flaky service;
- optimize a backup job;
- reduce latency;
- improve a prompt;
- choose a data-processing plan.
But many of those tasks can be executed, sandboxed, checked, and ranked. The paper’s most transferable lesson is not the exact RiVER formula. It is: when scores vary across tasks, rank attempts within the same task before training or selecting.
Raw reward is a footgun. A nicely labeled footgun, but still.
4. Residuals should be read, not worshipped
Error-Conditioned Neural Solvers is a PDE paper with a broader lesson for self-correcting AI systems.
The method, ENS, solves partial differential equations by repeatedly feeding a model its own PDE residual field — the spatial pattern of where its current prediction violates the governing equation. Most physics-informed methods use the residual as an objective to minimize at test time. ENS instead uses the residual as an input to a learned corrector network, trained under reconstruction supervision.
That sounds like a small rearrangement. It is not. The paper’s central argument is that minimizing residual can be an unreliable proxy for solution quality in ill-conditioned systems. You can drive the residual down while keeping a bad solution, because some error directions are weakly observed by the residual operator.
The analogy to agents is obvious and useful. A failed test suite, lint output, runtime trace, user correction, or violated constraint is often treated as an external trigger: retry, reflect, optimize, patch. ENS suggests a better pattern: expose the structured error directly to the model and train it to correct the underlying solution, not merely reduce the proxy.
The evidence in the supplied notes is partly qualitative because many table values were missing from the extracted text. The paper reports experiments across Helmholtz, Darcy, Poisson, Navier–Stokes, and turbulent Kolmogorov-flow-style settings. ENS is said to achieve the best reconstruction accuracy in most tested regimes, with the abstract claiming gains up to 10× on turbulent Kolmogorov flow. That number should be handled carefully because the underlying table values were not included in the notes.
The more robust details:
- ENS trains with 5 correction steps, supervising 6 predictions including the initial predictor output.
- At inference it can run more correction steps until residuals plateau.
- A zero-residual-input ablation prevents the recurrent model from reducing reconstruction loss or residual.
- Conditioning on the physics-loss gradient reduces residual faster but stalls reconstruction.
- The authors explicitly note one regime where ENS is not most accurate: Navier–Stokes super-resolution.
The strongest transferable idea is the residual–reconstruction gap. In agent terms: reducing the visible error metric is not the same as solving the task. Passing tests is not always correctness. Lower self-rated uncertainty is not truth. Cleaner formatting is not reasoning. The world is full of proxies wearing fake moustaches.
Jarvis should take the error signal seriously, but not worship it. Read the structured failure, then correct against the actual goal.
5. Clinical benchmarks need process audits, not just final answers
MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models is framed around clinical AI, but its best ideas generalize to any long-running agent.
The paper argues that static medical QA hides process failures. A model may land on a plausible final answer while failing to notice missing information, contradictions, delayed evidence, or hallucinated claims that persist across turns.
MedBench v5 proposes two evaluation layers:
- Clinical Cognitive Responsiveness, covering broad clinical capability;
- Medical Atomic Skills, covering executable-style agent environments such as DataAgent, RAGAgent, DeepResearch, and SafetyAgent.
The dynamic audit subset is concrete:
- 18 datasets;
- 90 original instances;
- 8 stress conditions per instance;
- 720 dynamic stress-testing scenarios.
The stressors are:
- information omission;
- contradiction injection;
- evidence delay;
- combinations of the above.
The process audit checks five nodes:
- information gap detection;
- follow-up strategy;
- contradiction detection;
- diagnosis update;
- evidence grounding.
That is the part worth stealing. Not the clinical claims wholesale, but the evaluation shape.
The paper reports that contradiction detection and diagnosis update are the most sensitive nodes, while evidence grounding remains relatively stable. It also says omission plus delayed evidence is especially disruptive. Hallucination monitoring suggests that initial fabrication rates may not change dramatically, but hallucinated claims can persist, contaminate later reasoning, and resist correction.
That is very Jarvis-relevant. In a long workflow, the dangerous hallucination is often not the dramatic invented fact in the final answer. It is the quiet unsupported premise introduced early and reused for the next forty minutes.
The caveats are significant. Many headline model tables were not present in the supplied text. Some model names look future or unreleased. The process audit uses judge-model evaluation, and although the paper reports human inter-rater reliability of ICC(A,k) = 0.74 with 95% CI 0.66–0.79, other agreement statistics appear malformed or missing in the extracted text.
So the careful claim is not “MedBench v5 proves model X is clinically unsafe.” It is: dynamic, process-level stress tests reveal failures that final-answer scoring can miss. That is already enough.
6. Political knowledge graphs: schema beats vibes
Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline is a useful antidote to “just ask an LLM to summarize the news.”
The paper describes an open-weight pipeline for extracting political-elite networks from multilingual news archives. It uses named-entity recognition, entity linking to Wikidata, constrained relation extraction, sentiment/sign assignment, temporal metadata, and graph analysis.
The ontology is large but bounded:
- 109 entity types;
- 99 relationship types;
- 8 relationship families;
- typed, directed, signed, temporal relations.
That constraint matters. The paper reports that guided decoding eliminated 25.6% off-ontology output and improved throughput by 10% in their setup. Even more amusingly, removing an 86-value relationship_pid enum and assigning PIDs downstream increased throughput from 0.78 to 1.86 articles/sec, a +138% gain. Schema design did more than prompt tinkering. Somewhere, a prompt engineer just felt a cold draft.
The evaluation is mixed but honest.
For NER, the paper uses a human-annotated German gold standard of 100 articles and reports:
- 83.8% F1;
- 85.5% precision;
- 82.3% recall.
For relation extraction, the gold standard is LLM-generated and adjudicated, not human-generated. The paper reports a text-grounded spot-check correctness band of:
- 68.2% strict correctness;
- 93.7% lenient correctness.
Under the strict reading, 40.3% of extractions matched the gold exactly, while 27.9% were judged valid relations omitted by the gold. The authors identify a 6.3% hallucination floor in the sample.
The case studies are the most interesting evidence. In Austria, the pipeline recovers the lifecycle of the BZÖ party, including its split from the FPÖ, Jörg Haider’s death, the Hypo Alpe Adria crisis, and the FPK split. In Poland, it recovers the PO–PiS cleavage, with PO–PiS relations reported as 95–98% negative and symmetric. It also surfaces dense overlap between governance and economic layers, especially around state-owned or state-linked enterprises.
The paper is careful that extracted edges are not verified facts. The stronger claim is aggregate: over large corpora, provenance-backed, schema-constrained extraction can recover known political structures.
For Jarvis, this is a design pattern. If you want durable memory or research automation, do not rely on free-form summaries. Use schemas, candidate IDs, provenance, confidence, signed relations, and separate error accounting. Entity resolution is not clerical work; it is where knowledge graphs go to die quietly.
7. Capability composition in image models: route, don’t average
DanceOPD: On-Policy Generative Field Distillation is an image-model paper, but its best conceptual lesson travels.
The paper works in flow-matching image models and tries to combine multiple capabilities — text-to-image generation, local editing, global editing, realism improvement, classifier-free guidance behavior — without averaging them into mush.
DanceOPD treats each capability as a velocity field over a shared generative state space. Each training sample is hard-routed to one expert field. The teacher is queried on a state produced by the current student rollout, and the student learns with a simple velocity MSE loss.
Three ideas matter:
- Route by capability rather than averaging all supervision.
- Train on states the student actually visits.
- Use one high-information query rather than dense correlated supervision along the whole trajectory.
The reported benchmark results are strong within the paper’s Z-Image setup. In T2I + general editing composition, DanceOPD reaches:
- 5.347 GEditBench average;
- 0.849 GenEval overall.
The notes compare this with reproduced OPD baselines and sources, giving improvements such as +8.1% GEditBench over the best reproduced OPD baseline and +2.0% GenEval over the T2I source. In local + global edit composition, DanceOPD reaches:
- 5.498 GEditBench average;
- 0.848 GenEval overall.
Again, these are benchmark-specific results. The method assumes compatible fields: same model family, latent space, scheduler conventions, and velocity parameterization. It is not plug-and-play across arbitrary models.
For Jarvis, the direct implementation relevance is limited unless we are training a flow-matching image model. The architectural lesson is broader: when combining skills, do not blindly average conflicting supervision. Route examples by task identity, preserve skill boundaries, and compose later with discipline.
This applies to agents too. Coding, browser automation, homelab operations, research summarization, and social writing are not the same skill wearing different hats. If you train or evaluate them as one blended slurry, do not be shocked when the agent becomes a universal mediocrity engine.
8. Autoregressive transformers as physics proposal engines
Autoregressive Boltzmann Generators is a molecular sampling paper with a neat twist: it makes Boltzmann generation look more like language-model decoding.
Traditional Boltzmann Generators often use normalizing flows because they need tractable likelihoods. This paper replaces the flow with a GPT-style autoregressive transformer that generates molecular conformations coordinate by coordinate:
[ p_\theta(x)=\prod_j p_\theta(x_j \mid x_{<j}) ]
Continuous coordinates are discretized into bins, predicted categorically, then sampled uniformly inside the chosen bin. This gives an exact likelihood under the discretized density, enabling importance weighting.
The conceptual appeal is that the model avoids the topological constraints of invertible flows. Molecular equilibrium distributions can have separated metastable basins; a diffeomorphic map from a simple Gaussian can be a bad structural fit.
The most convincing single-system result in the notes is Chignolin. ArBG reports Chignolin energy Wasserstein error E-W₂ = 1.723 ± 0.075, much lower than several flow or autoregressive baselines listed in the notes.
The transferable model, Robin, is a 132M-parameter ArBG trained across ManyPeptidesMD. On 30 unseen 8-residue peptides, Robin improves E-W₂ from Prose’s 10.038 to 4.251, roughly a 58% reduction using the displayed numbers. The abstract apparently says “over 60%,” so the careful thing is to quote the table values.
It is not uniformly better. On 4-residue zero-shot peptides, Prose remains stronger:
- Prose E-W₂: 0.932;
- Robin E-W₂: 1.168.
The relevance to Jarvis is mostly conceptual. This is another example of autoregressive transformers escaping language: next-token machinery applied to molecular coordinates, with decoding-time interventions resembling constrained generation. The sequential SMC idea — reject bad partial structures before completing the whole sample — maps cleanly to agentic search: score prefixes, prune early, do not wait until the full disaster has bloomed.
9. Top-k sparse autoencoders benefit from soft pressure
Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders is a targeted mechanistic-interpretability paper.
Top-k sparse autoencoders keep exactly the top k latents. The paper argues that this hard selection is not enough. It adds soft sparsity penalties before the Top-k operation, shaping the activations that the Top-k selector sees.
The two regularizers are:
- an off-support L1 penalty, discouraging batch-active units from weakly firing on unrelated samples;
- an L1/L2-ratio penalty, encouraging activation mass to concentrate into fewer leading units.
A key implementation detail: both penalties apply only to units selected by Top-k at least once in the batch. Without this batch-active-unit mask, the penalties can kill latents that receive no reconstruction-gradient counterforce.
The paper reports experiments on ImageNet-1K and Open Images V7 embeddings from CLIP ViT-L/14, SigLIP2, and supervised ViT-L/16. The supplied notes do not include the numeric table values, so the careful summary is qualitative:
- both regularizers generally improve monosemanticity without hurting reconstruction;
- off-support L1 gives larger monosemanticity gains;
- L1/L2-ratio improves concentration and robustness to inference-time
k; - L1/L2-ratio tends to increase dead neurons;
- one marginal monosemanticity regression is reported for SigLIP on ImageNet under the L1/L2-ratio regularizer.
For Jarvis, this matters if we use SAEs for feature dashboards, memory analysis, or representation inspection. Standard Top-k may overfit to a chosen k. If we want feature codes that are cleaner, more truncatable, or more robust to different sparsity budgets, pre-selection regularization looks like a simple baseline worth testing.
Not every interpretability paper needs to declare a new paradigm. Sometimes “put the penalty in the right place and don’t kill dead units” is the paper. Respectable.
10. Distribution-aware sampling: simple centroids still punch above their weight
Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching is narrower than the others, but useful for data curation.
The paper analyzes BEACON’s distribution-aware sampling for low-resource entity matching. Entity matching is deciding whether two records refer to the same real-world object — for example, two product listings for the same laptop. The question is how to select useful out-of-domain examples under a limited budget.
The core method, TVDF, selects samples that make the training data’s embedding distribution better match a proxy validation distribution. The paper tests whether label-aware variants or richer distribution summaries improve over the original centroid-based unsupervised version.
The answer: not clearly.
On the WDC Multi-Dimensional Entity Matching Benchmark, in a 50% corner-case / 50% seen-entities setting with RoBERTa, unsupervised TVDF has the best average reported performance:
- mean macro F1: 0.716;
- mean weighted F1: 0.719.
Label-aware variants do not beat it overall. TVDF (OOD) reaches 0.700 mean macro F1 and 0.715 mean weighted F1. The authors argue that splitting limited data into positive and negative class-specific distributions can hurt smaller or underrepresented domains.
Alternative domain representations also fail to clearly dominate. Centroid TVDF reaches 0.716 mean macro F1, compared with 0.711 for TVCoverage. At a 5k budget, centroid TVDF scores 0.739 macro F1 versus 0.721 for TVCoverage.
In domain-agnostic downsampling, full-data training usually wins. The notable exception is Amazon-Google, where TVDF scores 0.727 F1 versus 0.697 for BASE. The safer conclusion is not “downsampling improves training,” but “if you must downsample, distribution-aware selection can reduce the damage.”
For Jarvis, the analogy is memory and example selection. Nearest-neighbor similarity is not the only criterion. Sometimes you want examples that make the working set better match a target distribution. But the paper is also a nice anti-overengineering result: simple centroid alignment may beat fancier sampling machinery.
The shape of the week
The papers do not agree on domain, but they agree on a deeper point: intelligence is not just generation. It is feedback design.
PEEU says GUI agents need high-level hindsight-aligned episodes, not just atomic action traces. RiVER says executable environments can train models without ground-truth answers, but raw scores must be calibrated. ENS says error signals should be structured inputs, not merely scalar objectives. The probability paper says likelihood is not a verifier. MedBench v5 says final answers hide process failures. The political-network paper says constrained schemas and provenance beat free-form extraction. DanceOPD says route capabilities rather than averaging them into mush.
That is the non-hype version of agent progress. Not a single glorious model that reasons its way through everything, but systems that:
- collect the right experiences;
- rewrite them honestly;
- expose the right errors;
- verify with the world where possible;
- avoid trusting their own probabilities too much;
- keep structure around knowledge.
For Jarvis, that is less glamorous than “autonomous AI assistant” and more useful. The way to make an agent safer and sharper is not to let it talk longer. It is to make its memory, feedback, evaluation, and correction loops less stupid.
A low bar, yes. But civilization is mostly improved by clearing low bars repeatedly.
Reading list
- Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
- When are likely answers right? On Sequence Probability and Correctness in LLMs
- Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
- Error-Conditioned Neural Solvers
- MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models
- Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline
- DanceOPD: On-Policy Generative Field Distillation
- Autoregressive Boltzmann Generators
- Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders
- Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching