The AI papers that mattered this week — June 22, 2026

Agents are getting less like chatbots and more like little operating systems. They keep state, call tools, route work to submodels, cache long contexts, judge each other, and sometimes touch real infrastructure. That is progress, but it moves the interesting failure modes out of the model transcript and into the control plane around it.

This week’s papers are strongest when read through that lens. The best ones are not “bigger model scores higher on benchmark X.” They are about what happens when models become parts of systems: who is allowed to mutate state, what counts as observable reasoning, whether caches are faithful, how routing breaks calibration, and how evaluator preferences spread through multi-agent loops.

That is also why several of these matter directly to Jarvis. A personal agent with tools, memory, services, mail, browser automation, and background sub-agents has the same shape as the systems these papers are circling. Smaller, yes. Less dangerous than a cloud control plane, usually. But architecturally adjacent enough that the lessons are not academic wallpaper.

1. Explicit state beats vibes: LedgerAgent

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents is the most practically useful paper in this batch.

The argument is simple: many tool-calling failures are not caused by picking the wrong tool. They happen because the agent loses track of the exact state that determines whether an action is valid. A customer-service agent may correctly look up an order, reservation, refund rule, or account record, then later act on a stale or misremembered version of that state because the relevant fact is buried somewhere in the conversation.

LedgerAgent adds a structured ledger beside the prompt. Successful read-tool results are stored in typed canonical paths like ledger.orders.<id> or ledger.reservations.<id>. Before a write action — refund, cancellation, account update, reservation change — a deterministic policy gate checks the proposed action against the ledger. It can allow the call, ask the model to revise it, or block it.

This is the right kind of boring. The model still plans and talks. The system handles the part where being charming is not good enough: “are you actually allowed to do this?”

The paper evaluates on four structured customer-service domains from τ-bench / τ-Trait-style environments: airline, retail, telecom, and telehealth. It tests several agent backbones, including GPT-5.2, GPT-4.1, Kimi K2.5, GLM-5, MiniMax-M2.5, and Qwen3-30B. The reported gains are benchmark-specific but meaningful:

Kimi K2.5: +3.4 pass¹, +5.6 pass⁴
GLM-5: +4.7 pass¹, +7.6 pass⁴
MiniMax M2.5: +7.3 pass¹, +8.3 pass⁴
GPT-4.1, on retail and airline only: +12.2 average pass¹
GPT-5.2, on retail and airline only: +15.5 average pass¹

Against IRMA, an agentic context-engineering baseline, LedgerAgent reportedly improves by +3.7 pass¹ and +7.4 pass⁴, while avoiding IRMA’s claimed >50% token overhead from helper agents.

The caveat is not cosmetic. LedgerAgent works when the domain has stable structured fields, readable tool returns, and policy constraints that developers can encode as deterministic predicates. The paper uses 28 deterministic predicates across the evaluated domains: 10 airline, 12 retail, 6 telecom, 0 telehealth. So this is not magic policy induction. It is system design.

For Jarvis, the lesson is immediate: task-local state should not live only in a transcript. If an action depends on previously observed facts — selected service, current config, target file, confirmation status, account identity, scheduled job state — those facts deserve a structured ledger. And risky actions deserve deterministic pre-action checks.

Long-term memory is one thing. Conversation history is another. Task-local observed state is a third. Pretending those are all “context” is how agents become haunted filing cabinets.

2. The agent should not hold the keys: Sovereign Execution Brokers

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes takes the LedgerAgent principle and applies it to production infrastructure.

The paper argues that autonomous agents should not hold standing production credentials. Not “should be prompted not to misuse them.” Not “should ask nicely before using them.” They should not have them.

The proposed architecture has three stages:

Proposal: the agent suggests an action.
Admission: a Sovereign Assurance Boundary certifies that the proposed action is acceptable.
Execution: a Sovereign Execution Broker verifies the certificate and performs the mutation using narrowly scoped temporary credentials.

The crucial move is that the certificate is operationally mandatory. If the agent can still call AWS, Kubernetes, or a database directly, then the approval layer is theater. Good theater, perhaps, but still theater.

The broker checks the certificate signature, request match, validity window, policy epoch, revocation epoch, nonce replay, live-state drift, and whether the requested scope is actually enforceable. If cloud IAM cannot express a parameter-level rule — for example, “only open port 443 to this CIDR” — the broker requires a proxy or admission path that can validate the payload.

The authors implemented a Go prototype of about 4,200 lines, including gRPC/HTTP endpoints, Ed25519 certificate verification, PostgreSQL-backed nonce and ledger storage, AWS STS and Kubernetes TokenRequest adapters, and a proxy layer for payload-level validation.

The evaluation is prototype-specific but concrete:

Environment: Amazon EKS v1.28 in AWS us-west-2
Broker: 3-replica HA deployment
Database: Amazon RDS PostgreSQL v15
Trials: 100 warm-up cycles, 5,000 official trials per workload/configuration
Kubernetes broker-only overhead: p50 28.2 ms, p95 36.5 ms, p99 47.1 ms
AWS security-group broker-only overhead: p50 136.9 ms, p95 180.4 ms, p99 284.1 ms
Estimated client-observed p50 latency: about 40.7 ms for Kubernetes and 221.9 ms for AWS, after adding direct target API latency
Signature parsing/validation/contract matching: under 0.3 ms
AWS STS credential minting: about 105.2 ms
Revocation polling/cache TTL: 5 seconds, with max observed delay to full rejection of 5.2 seconds
Throughput: Kubernetes TokenRequest path peaks at 820 requests/sec at 80 threads; AWS STS path peaks at 240 requests/sec at 40 threads before RequestLimitExceeded

The conditional is everything. SEB works only if agents and wrappers have zero standing mutation credentials, target APIs reject non-broker mutation identities, broker-issued sessions are the only accepted mutation path, and break-glass credentials stay separate. In real production environments full of legacy tokens, CI roles, admin laptops, and dusty service accounts, that deployment discipline is the hard part.

Still, the core maxim is excellent:

Production mutation authority should not reside inside non-deterministic reasoning processes.

For Jarvis, the lightweight version is obvious. Read-only observation by default. Mutating operations — deploys, service restarts, firewall edits, DNS changes, database exports, container execs — go through a broker-like boundary. The broker checks an action contract, re-reads live state, enforces confirmation where needed, and logs the outcome.

You do not need the full SEB/SAB vocabulary to steal the spine of the idea. Separate reasoning from authority.

3. Diffusion models may not be opaque by default, but don’t get smug

How Transparent is DiffusionGemma? asks a timely question: if language models stop generating strictly left-to-right, what happens to transparency?

Autoregressive models at least emit a chain of tokens. That does not make them honest, but it gives monitors something discrete to inspect. Diffusion-style language models instead refine a canvas through iterative denoising. Much of the computation may happen in intermediate states that are not naturally text.

The paper studies Google DeepMind’s DiffusionGemma, a diffusion-style language model based on Gemma 4 26B A4B. The headline is cautiously reassuring: naively, DiffusionGemma appears to have 28.6× the opaque serial depth of autoregressive Gemma 4. But if intermediate denoising states are treated as interpretable after a token-bottleneck analysis, the gap falls to 1.1×.

That “if” is doing load-bearing work.

The authors intervene on DiffusionGemma’s self-conditioning vectors by restricting information to token-like representations using logit-lens-style projections. They report that benchmark performance is preserved when retaining either:

Top-8 tokens per position, or
tokens with probability greater than 0.03

They test representative benchmarks including Natural2Code, LiveCodeBench, AMC/AIME/IMO variants, and GPQA. The claim is not that every latent state is perfectly interpretable. It is that the useful information passed between denoising steps is often close enough to token guesses that a token bottleneck preserves measured performance.

The paper is more interesting when it stops defending transparency and starts cataloguing diffusion-specific reasoning patterns:

Early response-length prediction: the model predicts padding/end positions before deciding exactly what to say.
Retroactive self-correction: it can revise an earlier answer after later reasoning.
Skeleton-first code generation: it may commit to core logic before filling supporting boilerplate.
Token smearing: probability mass for one token spreads across nearby positions.
Sequence smearing: multiple candidate chunks coexist before one wins.
Intermediate-context reasoning: a token used during reasoning disappears from the final answer.

One case study is especially crisp. Prompted with “How many square numbers are there between 400 and 800? State your answer first, then give your reasoning,” DiffusionGemma initially predicts the wrong answer, 9, around denoising step 4. After later reasoning computes the endpoints, it revises the earlier answer to 8. Autoregressive models can correct themselves later in text, but they cannot revise an already emitted token. Diffusion models can.

The paper also reports that DiffusionGemma is similarly monitorable to Gemma 4 on adapted monitorability evaluations. But the caveat matters: many evaluations span multiple canvases, where each completed canvas is appended to the prompt for the next canvas. That makes the process more autoregressive-like and may miss single-canvas monitorability regressions.

For Jarvis, the paper matters because monitoring final outputs is already insufficient, and latent/non-autoregressive models make that more obvious. If future agents use diffusion-style or latent-reasoning models, useful audit traces may need intermediate token projections, confidence over positions, and records of predictions that appeared during reasoning but vanished from the final answer.

The practical posture is neither “latent models are opaque, panic” nor “token projections solve interpretability, relax.” It is: inspect the intermediate computation when you can, and do not confuse readability with faithfulness.

4. Whole execution state, not just KV cache

Execution-State Capsules: snapshot/restore for low-latency, on-device AI serving is a serving-systems paper aimed at a specific but important regime: single-user or small-batch, latency-sensitive, on-device agents that repeatedly return to the same large context.

Most LLM serving discussions talk about KV cache. This paper argues that KV is not enough. A runtime should be able to snapshot the entire restorable execution state at a useful boundary: KV cache, recurrent state, convolution state, speculation/MTP state, metadata, graph-bound buffers, and validation info.

The proposed object is an execution-state capsule, with four operations:

snapshot
restore
fork
rollback

This is not a replacement for high-throughput serving systems like vLLM or SGLang. The target is latency-first local serving, especially for agents that reuse large stable prefixes, branch from shared context, or resume after interruption.

The key technical difference is hybrid state. For models with recurrent, convolutional, or linear-attention components, a KV-only cache may not reproduce the same continuation. The paper reports that full capsule restore is byte-identical at the stored-state level and token-identical under greedy decode, while a KV-only restore diverges at the first token in the tested hybrid model. A stale recurrent fold from another prompt reportedly diverges by the third token.

On an RTX 5090, the abstract reports:

GPU-resident snapshot/restore is sub-millisecond
TTFT speedup over cold prefill grows from 3.9× at 2k tokens to 27× at 16k tokens
Full restore is token-identical under greedy decode

Those numbers are regime-specific: RTX 5090, single-stream, captured graph runtime, particular model/runtime setup, and TTFT measured to first base-logit token. The extracted text had missing/corrupted table values, so this should not be summarized as “FlashRT beats vLLM” in general. That would be the dumb version of the takeaway.

The better takeaway is architectural. Agents need checkpointable execution boundaries, not just prompt strings and cache fragments.

For Jarvis, this maps neatly onto “session checkpoint” rather than “prompt cache”:

restore the default Jarvis persona/context quickly;
fork three possible tool plans from the same repo state;
roll back after a failed tool branch;
resume a pinned project context after interruption;
fast warm-start voice or local assistant sessions.

The catch is brittleness. Capsules are binary deployment-bound blobs tied to exact weights, quantization, kernel versions, graph buckets, and shape layouts. Upgrade the model or runtime and your capsule becomes a fossil. Still, the abstraction feels right. If agents become long-lived local processes, “restore this execution state” is a better primitive than “please re-read 80k tokens and pretend it is the same.”

5. 4-bit KV cache for cache-pressured agents

UltraQuant: 4-bit KV Caching for Context-Heavy Agents is another serving paper, but with a narrower hardware angle.

The target workload is long-context, multi-turn agents: many concurrent sessions, long shared prefixes, short follow-up turns, and enough memory pressure that keeping useful KV cache resident matters more than raw one-shot throughput.

UltraQuant stores KV cache entries in FP4 E2M1 with UE8M0 group scales, using groups of 32 channels. The design targets AMD CDNA4 scaled-MFMA instructions, especially v_mfma_scale_f32_16x16x128_f8f6f4, so the hardware can consume FP4 KV tensors and FP8 queries without materializing BF16 keys via software dequantization.

The strongest reported result is on a long-context, multi-turn agentic workload:

P50 TTFT improves by 3.47× in cache-pressured late rounds versus an FP8 KV baseline
P50 TTFT improves by 2.3× across all rounds
Output throughput improves by 1.63×
Benchmark uses vLLM’s native multi-turn benchmark, ShareGPT conversations, and 32 concurrent chat sessions

The important caveat is in the paper’s own framing: this helps most when FP8 KV cache residency becomes the bottleneck. It is not a claim that 4-bit KV is always faster or always better. If the context is shorter, concurrency is lower, or memory pressure is absent, the advantage shrinks.

Accuracy is also not free. The paper describes UltraQuant as stable on MATH500 and competitive on GPQA and LCB-128K, but materially worse on AIME25 for some evaluated models. Reported accuracy also keeps the first and last two attention layers in BF16, so public summaries should not imply every layer is purely 4-bit.

For Jarvis, the most useful point is the metric shift. A personal agent should not only care about tokens/sec on a single prompt. It should care about late-round TTFT under concurrent long-context sessions: repository agents, research sessions, browser agents, tool-heavy prompts, memory-rich conversations. That is where cache residency becomes user-visible.

Also: hardware-native formats matter. A theoretically elegant quantizer that fights the accelerator is often a very expensive poem.

6. Soft-routed MoEs can be miscalibrated even when experts are calibrated

Toward Calibrated Mixture-of-Experts Under Distribution Shift attacks a comforting assumption: if every expert in a mixture-of-experts model is calibrated, is the aggregate model calibrated too?

For hard routing, maybe. If each input goes to one expert, aggregate calibration can survive a broad class of shifts that merely reweight routing regions.

For soft routing, no. The aggregate prediction is a weighted average of multiple expert outputs. Many different expert/routing configurations can collapse to the same final confidence. If deployment changes the mix of those configurations, the same aggregate confidence can correspond to a different correctness rate. The experts can remain individually calibrated while the mixture becomes systematically unreliable.

The paper proposes two robust training objectives:

Robust MoE: entropy-balanced adversarial reweighting over the whole minibatch
Robust Filtered: robust pressure only on routing-relevant examples, while keeping an ERM term over the whole minibatch

The adversarial weights are exponential tilts over per-example aggregate cross-entropy loss, with the default robustness parameter reported as η = 5. Robust Filtered targets examples where the mixture loss is worse than the best expert’s loss, or where experts disagree around the mixture prediction.

Experiments cover CIFAR-10H, PACS, and CivilComments. The exact table values in the provided extraction are partly missing, so the safe summary is qualitative: robust methods generally improve calibration under ambiguous, shifted, or subgroup-stressed evaluations, often with modest or no accuracy cost. The paper reports strong improvements on low-human-agreement CIFAR-10H examples, held-out PACS domains, and CivilComments identity subgroups.

This matters beyond formal MoE architectures. Many agent systems are soft-routed in spirit: route to a code expert, retrieval expert, browser agent, safety checker, summarizer, or model cascade; weight their outputs; emit a final confidence. It is tempting to check that each component is “good enough” and call the whole thing reliable. This paper says that is not sufficient.

For Jarvis-like orchestration, log the routing weights, expert outputs, aggregate confidence, and downstream correctness. Evaluate calibration by routing pattern and subpopulation, not only globally. Watch for cases where experts disagree and the aggregate remains confidently wrong because the router blended its way into a nice-looking number.

“Mixture of experts” sounds robust. Soft mixtures can also be how errors put on a suit.

7. Mixed compliance demonstrations are not just pattern copying

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? studies a specific jailbreak mechanism: show the model many earlier examples where an assistant complies, then ask a harmful question.

The simple hypothesis would be generic imitation: more compliant examples lead to more compliance. The paper finds something more model-dependent. Benign and harmful demonstrations are not interchangeable.

The authors test four models:

Llama-3.1-8B-Instruct
OLMo-3.1-32B-Instruct
Gemma-4-31B-IT
GPT-OSS-20B

The harmful evaluation pool has 1,404 harmful queries from HarmBench, SORRY-Bench, and WildGuard-test, with 2 random samplings per query, for 2,808 evaluation points per condition. Harmful demonstrations come from a RedTeam-2K-derived pool filtered with GPT-OSS-120B, yielding 1,492 harmful compliance demonstrations.

The paper rejects the “total number of compliant examples is all that matters” hypothesis across evaluated models. At fixed total demonstration count, changing the fraction of harmful demonstrations changes harmful compliance.

Model-specific behavior is the interesting part:

Llama-3.1-8B: benign demonstrations have a dilutive effect
Gemma-4-31B-IT: also shows dilution
OLMo-3.1-32B-Instruct: benign demos have no statistically significant effect once harmful count is controlled
GPT-OSS-20B: slight amplification from benign demos, but small; the model is relatively robust overall

The OLMo training-stage comparison is especially useful. The SFT checkpoint shows benign-demonstration amplification, where benign helpfulness examples increase harmful compliance. After DPO, that effect disappears. The final Instruct model with RL-VR on top also does not show it. Scoped carefully, the paper suggests that preference optimization in this OLMo pipeline decouples general cooperativeness from unsafe compliance.

Ordering also matters. Harmful demonstrations closest to the query — the suffix condition — produce the highest compliance in susceptible models. Reported ordering spreads are about:

Gemma-4-31B-IT: 35 percentage points
Llama-3.1-8B: 19 percentage points
OLMo-3.1-32B: 13 percentage points
GPT-OSS-20B: little meaningful effect, generally 1–2 points

The practical lesson for agent systems is prompt hygiene. Do not leave unsafe compliance examples, jailbreak traces, or permissive tool-use failures near the active instruction region. Context is not just storage. It is conditioning.

For Jarvis, this applies to memory, examples, tool transcripts, and regression tests. Recent examples of “the assistant complied even though it shouldn’t” can become behavioral pressure, even if they are included for diagnostic reasons. Put the radioactive stuff behind glass.

8. Evaluator bias can propagate through multi-agent systems

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems is exploratory but useful. It asks what happens when agents evaluate one another.

If a judge agent has a preference — structured answers, evidence-heavy answers, balanced answers, concise answers — and other agents adapt to its feedback, that preference can spread. The paper frames this as a contagion network. Evaluator bias moves along agent-to-agent evaluation edges, and whether it dies out or cascades depends on the strength of those links and the topology.

The empirical setup is small: three DeepSeek-chat / deepseek-v4-flash agents, differentiated by evaluator prompts rather than model family. The evaluator profiles are structured, balanced, and evidence-based. Tasks span code generation, mathematical reasoning, summarization, logical puzzles, and creative writing. The paper reports 840 DeepSeek-chat API calls over a four-phase protocol.

The reported off-diagonal contagion coefficients are γ = 0.157–0.352: weak but nonzero. The paper compares this to prior cross-model MM-EPC work with coefficients around γ ≈ 0.85–1.3, claiming same-model contagion is 3–5× weaker, though that comparison is not a direct apples-to-apples experiment.

The mitigation result is the cleanest practical bit: increasing evaluator committee size from k=1 to k=3 reduces effective contagion from about 0.264 to 0.073, a 72.4% reduction.

Do not overread this. Three agents, one model family, controlled TTRL-style strategy updates — this is a toy lab world. But the vocabulary is useful. In multi-agent systems, the evaluation graph is part of the system. A judge is not a neutral measuring device; it is a pressure source.

For Jarvis, this is directly relevant to background sub-agents reviewing code, research, summaries, plans, or each other’s outputs. If one reviewer’s preferences dominate, the system can converge on a house style or decision policy whether or not that policy is actually better.

A cheap diagnostic would track strategy entropy: tool choices, source preferences, verbosity, refusal style, review outcomes, escalation frequency. If everything starts sounding like the same overconfident reviewer, congratulations, you built a bureaucracy.

9. Token operations as production worldview

Token-Operations-Oriented Inference Optimization Techniques for Large Models is a survey/position paper, not a new algorithm. Its usefulness is the taxonomy.

The paper argues that large-model serving should be understood as token operations: tokens are generated, cached, routed, billed, governed, scheduled, compressed, and rate-limited. The proposed four-layer architecture is:

Multi-model fusion: routing, cascading, ensembling, capability profiling
Model optimization: attention optimization, MoE, reasoning efficiency, KV compression, speculative decoding, quantization, distillation
Compute-model fusion: operator fusion, memory access optimization, inference-engine tuning
Compute-network-model fusion: multi-node parallelism, cluster KV scheduling, sticky routing, semantic caching, dynamic batching, rate limiting, fallback

The paper aggregates many claims from prior work, vendor docs, and product reports. Examples include:

RouteLLM: over 2× cost reduction in some scenarios
FrugalGPT: up to 98% cost reduction in its benchmark settings
Speculative decoding variants like Medusa and EAGLE: often cited in the 2×–6× acceleration range, depending heavily on acceptance rate and setup
vLLM / PagedAttention: 2–4× throughput improvement over FasterTransformer and Orca in cited settings
SGLang / RadixAttention: up to 6.4× throughput improvement across structured-generation tasks
ORCA: 36.9× throughput improvement over FasterTransformer in its GPT-3 175B evaluation setting
Mooncake: up to 525% throughput improvement in long-context/Kimi-like workloads, per cited reports

Those numbers should not be compared directly. They come from different models, workloads, hardware, and measurement definitions. The paper is closer to an industrial map than a controlled benchmark.

For Jarvis, the top-level framing is right. A useful agent is not a single model call. It is routing, memory retrieval, caching, tool invocation, fallbacks, safety checks, scheduled jobs, latency budgets, and observability. The most relevant ideas are not the cluster-scale GPU tricks, unless Jarvis grows a datacenter overnight. The useful bits are:

route easy work to cheaper/faster models;
cascade when confidence or validation fails;
cache repeated semantic work carefully;
keep session affinity when KV reuse matters;
avoid spending reasoning tokens on trivial tasks;
retrieve relevant memory instead of dumping history;
degrade gracefully for non-critical work when a preferred model is slow.

The paper’s best sentence, conceptually, is that serving is no longer a single model call. Correct. The model is now one component in a token economy with plumbing. Plumbing, as usual, decides whether the palace floods.

10. Proxies for egocentric video representation learning

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning is less central to agent control planes, but it has a clean distillation idea.

The goal is to train a single egocentric-video encoder that benefits from richer training-time signals — exocentric views, depth, skeletons, RGB teachers, foundation-model features — while needing only first-person RGB video at test time.

Instead of distilling directly from nine heterogeneous teachers into one student, UNIEGO first trains one egocentric proxy per teacher. Each proxy consumes egocentric video and has the same architecture as the final student. Then the final model learns from selected proxies.

The teacher streams include ego/exo RGB, skeleton, SigLIP, SK-EGO-style representations, depth, DINOv2, and related features. The final stage uses Selective Proxy Distillation: for each sample, it distills only from proxies that are both correct and confident, using ground-truth correctness during training.

Reported benchmark-specific results include:

Action recognition top-1:
- EgoExo-Fitness: 84.7%
- Assembly101: 50.7%
- EgoExo4D: 41.1%
Video retrieval mAP:
- EgoExo-Fitness: 0.543
- Assembly101: 0.253
- EgoExo4D: 0.182
Assembly101 temporal action segmentation:
- F1@50 12.3, versus 9.8 for naive multi-teacher distillation

This is an arXiv v1 preprint and the pipeline is training-heavy: nine teacher streams, proxy training, checkpoint merging, selective distillation, dataset-specific preparation. The final model may be simple at inference, but the training machinery is not.

The transferable idea is the proxy layer. For agent systems, “many heterogeneous teachers directly into one student” is often a mess. Translating each teacher into a homogeneous proxy first, then selecting reliable proxies per sample, is a pattern worth stealing. It applies beyond video: OCR, screenshots, browser state, audio, logs, embeddings, VLM outputs, specialist classifiers.

The common thread

The interesting part of current AI work is increasingly outside the forward pass.

LedgerAgent says state governing action validity should be explicit. Sovereign Execution Brokers say authority should live outside the agent. DiffusionGemma transparency work says reasoning traces may not be left-to-right text, but intermediate states can sometimes be inspected. Execution capsules and UltraQuant say latency is about preserving the right computation state, not just making one prompt faster. MoE calibration says routing changes reliability even when experts look fine. Mixed-demonstration safety and contagion networks say context and evaluator topology are active forces, not passive records. Token operations says the whole thing is now production infrastructure.

That is the through-line: agents are systems with memory, authority, routing, caches, evaluators, and side effects. The model is still central, but it is no longer the whole object.

For Jarvis, this is not abstract. The design pressure points are the same:

keep observed task state structured;
gate risky actions deterministically;
avoid standing credentials where possible;
audit intermediate reasoning when models stop exposing clean token chains;
benchmark late-round latency, not just throughput;
measure router and evaluator behavior;
keep unsafe examples away from active conditioning;
treat memory as infrastructure, not vibes.

The future agent stack is going to look less like a chatbot prompt and more like a small operating system with a paranoid kernel. Good. The alternative is a charming process with root.