Progressive Execution for Agents

Interactive agent systems are usually decent at supervised interruption. If you're sitting there and hit Ctrl-C in something like Claude Code, you often lose at most the current turn and then continue. The worse failure mode shows up in autonomous runs: unattended jobs, scheduled tasks, deep research sessions, background agents, long tool-using workflows where there is no operator babysitting the process.

There, the runtime can spend ten or twenty minutes gathering evidence, exploring tools, and nearly converging — then time out and leave behind little more than traces. The logs show a flurry of work. The unattended run still hands you something close to a corpse. That's not a strong autonomy model. That's a theatrical collapse.

The feature I want in term-llm is progressive execution: a mode where an agent produces a usable answer early, keeps improving it as more time is available, emits structured progress pulses while it works, and on timeout returns the best answer so far rather than a shrug.

This is not mainly a fix for interactive use. It is a stronger contract for autonomy.

The core idea

A progressive run is an anytime algorithm for an LLM agent. In classical AI, an anytime algorithm is one that can be interrupted at any point and still return a valid result, with answer quality improving as more time is available. That old idea is the right ancestor here. But the interesting part is adapting it to a modern agent runtime rather than a tidy search paper.

For an agent, that means:

produce a rough but useful answer quickly
continue gathering evidence, verifying claims, and refining the answer
maintain a durable best-so-far structured state
materialize that state through an explicit progress-update tool
force a final update when the run is cancelled or times out
reuse the existing session store and resume model rather than inventing a parallel persistence system

Short version: progressive execution is not just "timeout with partial output." It is a runtime contract: be useful early, become more useful with time, and never make interruption equivalent to amnesia.

Why this is an interesting feature

Most current agent systems optimize for one clean ending. That works fine for short deterministic tasks and for interactive workflows where a human can recover from a bad interruption. The more interesting gap is in autonomous execution: research jobs, background runs, scheduled maintenance, long-horizon tool use, and any workflow where the operator is not present turn by turn.

The big reason progressive execution is interesting is budget. It lets you give an autonomous task a time budget and get a much stronger guarantee that the budget will produce something useful. Instead of "run this for up to twenty minutes and maybe I get a clean final answer," the promise becomes "run this for twenty minutes and I will at least get the best useful state the agent managed to save along the way."

A progressive mode changes the autonomy promise from:

"Maybe you'll get a final answer if the run finishes cleanly."

to:

"Give me a budget, and I'll turn it into the best useful artifact I can produce within that budget."

That is a much stronger and more honest product story for unattended runs.

The deeper reason this is interesting is that it forces the runtime to care about quality over time, not just final output. A lot of autonomous agent work is really bounded-compute reasoning in disguise. Once you remove the human supervisor as the recovery mechanism, the runtime itself needs a stronger partial-progress contract. You do not actually want a system that only knows how to look clever at the end. You want one that can turn a time budget into a useful result even when the run does not finish neatly.

Where the idea comes from

This proposal is not invented from nothing. It sits at the intersection of several threads of prior work that each get part of the picture right.

1. Anytime algorithms

The foundational idea is old AI work on anytime algorithms: valid result now, better result with more time, best-so-far returned if interrupted. That's the cleanest conceptual frame for progressive execution. The difference is that an agent runtime needs to make the internal state legible to humans and automation, not just maintain an implicit search score.

2. Self-Refine

Self-Refine is the obvious cognitive loop to borrow from: draft, critique, revise, repeat. It captures the right intuition that first answers are often good starts rather than good endings. What it does not provide is a runtime contract for checkpoints, interruption, or structured best-so-far state.

3. Reflexion

Reflexion matters because it treats verbal memory and feedback as part of the learning loop. It is more about improvement across attempts than within a single progressive run, but it points toward a useful future extension: resume a partially completed progressive run with memory of what was already learned.

4. LangGraph's durability model

LangGraph's architecture writing is important not because it already solves this problem, but because it gets the substrate right: checkpointing, interruption, resume, structured execution, streaming. That's the engineering base a progressive runtime needs. What it does not define for you is the semantic contract of progress: what counts as a useful partial result, how often to materialize it, and what should happen on timeout.

5. Budget-aware reasoning papers

The recent paper Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs (BG-MCTS) is especially relevant. Its core move is exactly right: explore broadly early, then shift toward refinement and answer completion as the budget depletes. That should inform progressive runs directly.

Likewise, DenoiseFlow is useful because it argues for allocating more compute where uncertainty remains high rather than spending the remaining budget uniformly. That matters a lot after the agent already has a decent answer. The goal is not endless polishing. The goal is to spend additional time where it can still buy real improvement.

6. Long-horizon search agents

Recent work like REDSearcher and benchmarks like EvoCodeBench matter because they treat inference-time improvement and long-horizon agent behavior as first-class concerns. They still stop short of specifying an interruptible best-so-far runtime contract, but they make clear that the field is moving in this direction.

The gap: the papers cover anytime behavior, iterative refinement, memory, durability, and budget-aware search in pieces. What still feels missing is a practical runtime that combines them into one coherent operator-facing feature.

What progressive execution should mean in term-llm

Grounding this in the current code matters. term-llm ask already has --max-turns and --resume, with resume wired to the existing session store in sessions.db. Jobs already have a real wall-clock timeout model via timeout_seconds and report timeout as an exit reason in the jobs runner. What is missing is not the idea of persistence or timeout in the abstract. What is missing is an ask-level progressive contract that preserves and surfaces best-so-far state under that timeout model.

At the CLI and API level, progressive execution should be explicit and boringly clear.

term-llm ask --agent researcher --progressive --timeout 20m "Investigate X"
term-llm ask --agent jarvis --progressive --stop-when timeout --continue-with "Verify the riskiest claims, explore alternatives, and strengthen the current best answer." "Debug Y"
term-llm ask --agent jarvis --progressive-schema ./spec.json "Plan Z"

The semantics should be:

--progressive enables progressive execution mode
the runtime adds a progressive-specific state update tool to the available tool list
the runtime requires a durable progressive state throughout the run
the model is prompted to answer early and refine continuously
during the run, calling the update tool is optional and left to the model
timeout or cancellation triggers a final forced update pass
if finalization fails, the latest durable progress state is returned

That last point should not be optional. I cannot think of a good reason to disable timeout finalization; if you want the run to stop earlier, shorten the timeout. The whole point of progressive execution is to make deadline handling stronger rather than weaker.

This is the critical design choice: a progressive run should never be all-or-nothing.

Start loose: the first version should not overcommit on schema

The state shape is where it is easy to get too clever too early. For a first version, I would keep it loose. The important guarantee is that update_progress gets called and stores a durable JSON object, not that the runtime has already solved the universal schema problem.

That means the first contract can be simple:

the payload must be a JSON object
the runtime stores it durably
the latest saved object can be surfaced on timeout
the latest saved object can be fed back into an ordinary --resume flow

This is still highly useful. You get timeout salvage, resumability, and observability without prematurely freezing a shape that real usage may prove wrong.

A future version may converge on a conventional core if certain fields turn out to be universally valuable. But that should be learned from use, not guessed up front.

For brainstorming purposes, the operator-facing hook can stay:

term-llm ask --agent jarvis --progressive-schema ./spec.json "Plan Z"

But even that may be more rigid than necessary in v1. The simplest version is: loose JSON now, conventions later.

Pulses should be a first-class primitive

A progressive run needs a way to explicitly materialize state updates while the agent is still working. That should not be hidden magic. It should be an actual primitive in the runtime. In practice, the cleanest shape is: when progressive mode is enabled, the runtime adds a dedicated tool such as update_progress to the tool list.

I would model that tool with roughly these semantics:

update_progress({
  "state": { ... },
  "reason": "milestone|voluntary|finalize",
  "message": "optional short summary",
  "final": false
})

That tool is not just "one more tool call" like search or shell. It is the runtime's state-commit primitive. Ordinary tools interact with the world. update_progress materializes the agent's best-so-far state.

Why make this explicit?

it makes traces legible
it gives the runtime a durable object to save and return on timeout
it gives the model a clean way to say "here is the best state so far" without forcing a rigid schema too early
it separates durable progress materialization from ordinary token streaming

Streaming text and progressive state are not the same thing. One is for live UX. The other is for correctness, resumability, and partial-result guarantees.

Progress during the run should be model-chosen; finalization should be runtime-forced

The important control here is not a pile of fuzzy event knobs, and it is also not forcing a bookkeeping tool call at the beginning of every turn. That would be expensive: if a turn gets consumed by a mandatory update_progress round trip, the runtime is paying a stupid tax on autonomy.

The mechanism that makes sense is simpler. Progressive mode adds an update_progress tool to the tool list. During the run, the model can call it when it believes the current best-so-far state is worth materializing. At the deadline, the runtime attempts one final forced update. That gives you cheap voluntary updates during the run and a stronger contract only when time is actually up.

A reasonable config surface therefore looks like this:

progressive:
  enabled: true
  update_tool: update_progress
  timeout_finalization: required

That model is clear:

agent works normally
agent may call update_progress if it reaches a meaningful milestone
ordinary turns are not consumed by forced bookkeeping
on timeout or cancel, runtime attempts one final forced update
if that fails, return the latest durable progress state

This keeps the update tool available for the picking without turning every turn into paperwork.

"20 minutes" can mean two different things

There is an important ambiguity hiding inside any flag like --timeout 20m.

Up to 20 minutes: twenty minutes is a ceiling. The run may stop earlier if it is done.
Ensure 20 minutes of work: twenty minutes is a budget to be used. The run should keep pushing even if it has a plausible answer early.

Those are not the same contract. They should not be blurred together.

The clean split is:

--timeout 20m          # hard ceiling
--stop-when done       # stop early if task is complete
--stop-when timeout    # keep working until budget is exhausted unless blocked/fatal

With that split, progressive execution can serve two related but different purposes:

salvage mode: preserve and return best-so-far state if interrupted
budgeted deepening mode: keep improving the answer because more time was intentionally allocated

That second mode is where the feature becomes especially interesting.

If the model finishes early, what gets injected next?

If the stop policy is done, nothing special happens. The run can end early.

If the stop policy is timeout, the runtime needs a continuation prompt after an apparently adequate answer. This should not be a vague "keep going" or some embarrassing "good job robot" nonsense. It needs to be deliberate and operator-shaped.

At the core, this is just post-turn user prompt injection. A turn ends, budget remains, the stop policy says continue, and the runtime injects another user message to elicit more work from the model.

The generic continuation prompt is something like:

Continue working on the same task.
Do not stop because you already have a plausible answer.
Use the remaining budget to improve the result by doing one or more of:
- verifying the riskiest claims
- exploring credible alternatives
- finding counterevidence
- strengthening evidence for the current best answer
- identifying failure modes or edge cases
Call update_progress if your best-so-far state materially improves.
Only stop early if the task is genuinely exhausted or blocked.

That continuation behavior should be user-controlled. Different operators will want different kinds of additional work from the remaining budget. Preset templates may be useful later, but the fundamental primitive is direct continuation-prompt injection.

The most interesting problem: how to keep pushing after a decent first answer

This is the real heart of the feature.

If the agent finds a decent answer in the first minute of a twenty-minute budget, what should it do for the next nineteen minutes? If the answer is "write prettier prose about the same conclusion," the runtime is dead on arrival.

Progressive execution only becomes interesting when the runtime has a notion of how to spend additional budget well after early adequacy.

The most useful mental model is a staged budget policy:

Phase 1 — bootstrap

produce a rough plan quickly
materialize some useful best-so-far state early rather than hoarding reasoning internally
surface the first credible path rather than waiting for perfect completion

Phase 2 — challenge and expand

seek alternative hypotheses
verify the riskiest claims with tools
look for counterevidence
identify what could still flip the answer

Phase 3 — converge

merge the strongest evidence
resolve contradictions
compress into the final or timeout-safe answer shape
leave uncertainty explicit rather than smudged

That pattern is strongly aligned with BG-MCTS: broad early, refine late. It is also aligned with DenoiseFlow's argument that remaining compute should be allocated toward unresolved ambiguity rather than uniformly spread.

This is the subtle, important bit: progressive execution is not just about surfacing state during a long run. It is about making additional time economically meaningful.

Anti-coasting rules

A progressive run needs guardrails against the laziest possible failure mode: early medium-confidence answer, then nineteen minutes of self-congratulation.

That only matters when the stop policy is timeout or some equivalent full-budget mode. If the operator said "stop when done," then a decent answer can simply end the run. If the operator said "keep working until budget is exhausted," the runtime needs a strategy for getting more work out of the model.

After the agent has a plausible answer, each continuation should be able to answer four questions:

What improved since the last saved progress state?
What remains uncertain?
What is the highest-upside next action?
Did confidence increase because of new evidence, or just because the model got comfortable?

That last question matters more than it sounds. LLMs are very good at mistaking familiarity for certainty.

A stronger version would require each post-bootstrap progress update to include at least one of the following:

new external evidence
an alternative hypothesis check
counterevidence evaluation
verification of the most important unresolved claim

Without something like that, "progressive" risks becoming a branding exercise for periodic paraphrasing.

Continuation prompt templates are sugar, not the core mechanism

Different tasks want different kinds of additional work, so the continuation prompt after an early answer should be explicit and user-controlled.

The primitive is simple: inject another user prompt when a turn ends and budget remains. Everything else is convenience. Preset templates may still be useful, but they should be understood as wrappers around prompt injection rather than a deeper runtime concept.

Template	Injected continuation shape	Best for
`balanced`	Verify risky claims, probe alternatives, strengthen the current best answer	General use
`verify`	Prioritize external evidence, counterexamples, and failure modes	Research, debugging, compliance-like tasks
`explore`	Search alternative hypotheses or approaches before converging	Search, planning, design exploration
`implement`	Turn the current plan into concrete edits, tests, commands, and artifacts	Coding and execution-heavy tasks

The important point is that the operator should be able to define what "keep pushing" means directly. Sometimes it means verify. Sometimes it means explore more branches. Sometimes it means implement and test. The runtime should not assume these are the same problem, and it should not pretend the preset names are the real mechanism.

Concrete term-llm spec

CLI

term-llm ask \
  --agent researcher \
  --progressive \
  --timeout 20m \
  --stop-when done \
  "Investigate the failure mode"

term-llm ask \
  --agent jarvis \
  --progressive \
  --timeout 20m \
  --stop-when timeout \
  --continue-with "Verify the riskiest claims, explore alternatives, and strengthen the current best answer." \
  --progressive-schema ./progressive-schema.json \
  "Debug the deployment"

Flags

--progressive — enable progressive mode
--progressive-schema <file> — override or extend the default state shape
--timeout <duration> — hard wall-clock ceiling
--stop-when done — stop early if the task is complete
--stop-when timeout — keep working until budget is exhausted unless blocked/fatal
--continue-with <text> — inject a user continuation prompt after a completed turn when budget remains

There should probably not be a separate --resume-progressive flag. term-llm ask already has --resume, and the source shows that it is wired into the existing session store and current-session marker. Progressive state should live in that same session model, so ordinary --resume should simply become progressive-aware.

What does a progressive ask actually output?

This part matters because term-llm normally shows the work: turns, tool calls, intermediate chatter, the whole little factory. Progressive mode introduces another artifact: the saved progress state. For unattended and machine-consumed runs, that progress state should become the primary output.

So the default behavior I want is:

session trace remains intact — turns, tool calls, continuation prompts, and ordinary assistant messages still live in the session store for debugging and inspection
final emitted result is the latest finalized progress object — not the raw transcript
if finalization fails, emit the latest durable saved progress object along with exit metadata

That fits the philosophy of the feature. Normal asks are conversation-oriented. Progressive asks are artifact-oriented. The runtime still does all the work in public internally, but the thing you generally want at the end is the best saved state, expanded as the canonical result.

For example, the output shape could look like:

{
  "exit_reason": "timeout",
  "finalized": true,
  "session_id": "sess_abc123",
  "progress": {
    "goal": "Determine why health checks are failing",
    "findings": [
      "service exits immediately after boot",
      "logs show DATABASE_URL is missing"
    ],
    "best_guess": "The runit service script is not loading the environment file.",
    "next": [
      "inspect /etc/sv/service/run",
      "patch env sourcing",
      "restart and re-check health"
    ],
    "final": true
  }
}

That is much better for jobs and automation than dumping a full transcript to stdout and hoping some later step reverse-engineers the actual result.

Stored state

Each progress commit should be durable, and in term-llm the obvious place to anchor it is the existing session store rather than a parallel progressive database. The current source already treats --resume as a session-store concern and even keeps a current-session marker, so progressive state should piggyback on that same persistence model.

At minimum, each saved progressive state should include:

timestamp
run id / session id
progress sequence number
full materialized state
reason for save
summary of deltas since prior state
whether this save is final

Exit semantics

natural completion: return the final saved progressive state
timeout: always run timeout finalization; if that fails, return latest durable progress state with a timeout classification
cancelled: same model as timeout, but with a cancellation classification if available
tool failure: save updated risk state rather than discarding prior best answer

The key principle is simple: interruption should degrade quality, not destroy state.

Example loose progress objects

A debugging run might evolve like this, without assuming a mandatory schema:

Save 1 — 20 seconds

{
  "goal": "Determine why health checks are failing",
  "plan": ["read logs", "check service status", "verify env injection"],
  "best_guess": "startup is failing before the web process binds",
  "next": ["inspect current logs"]
}

Save 2 — after log inspection

{
  "goal": "Determine why health checks are failing",
  "findings": [
    "service exits immediately after boot",
    "logs show DATABASE_URL is missing"
  ],
  "best_guess": "DATABASE_URL is absent at runtime",
  "next": ["inspect runit env loading"],
  "confidence": "medium"
}

Save 3 — timeout finalization

{
  "goal": "Determine why health checks are failing",
  "findings": [
    "service exits immediately after boot",
    "logs show DATABASE_URL is missing",
    "runit script likely does not source the env file"
  ],
  "best_guess": "Best current conclusion: the runit service script is not loading the expected environment file, leaving DATABASE_URL unset at runtime.",
  "next": ["inspect /etc/sv/service/run", "patch env sourcing", "restart and re-check health"],
  "final": true,
  "classification": "timed_out"
}

That final timeout result is already useful. It gives the operator a real answer, the evidence behind it, and the highest-upside next steps. The fact that the shape stayed loose did not make it worthless. It just means the runtime is, for now, saving arbitrary useful state rather than enforcing a universal contract.

What feels novel here

Not the existence of iterative refinement. Not the existence of checkpoints. Not even the existence of budget-aware search. Those are all real and well established.

What feels new is the combination:

an anytime execution contract for LLM agents
a guaranteed durable progress-update call
loose JSON state in the first version, with room to standardize later if usage justifies it
forced final update on timeout
runtime strategy for spending post-adequacy budget well

That package is much more interesting than adding a timeout flag and calling it a day.

Why this belongs in term-llm

term-llm already lives in the awkward but fertile space between chat UX, tooling, jobs, and agent execution. That makes it a particularly good place to try a feature like this. The runtime already understands tools, turns, jobs, interruption, and operator control. Progressive execution is not a random bolt-on; it is a better contract for the sort of work agents actually do there.

A good agent runtime should not merely be able to run longer. It should be able to make longer runs worthwhile.

That is why progressive execution is interesting. It takes three things that are usually separate — streaming UX, checkpoint durability, and iterative refinement — and turns them into one coherent feature with a clear promise:

useful now, better later, never empty-handed on interruption

That is a runtime worth building.