There has been a small burst of talk this week around autoresearch: agent loops that do not merely think for longer, but edit code, run experiments, inspect results, and keep only the changes that survive contact with reality. The excitement is justified. A system that can be wrong, measure the fact, and iterate is more interesting than one that can only produce increasingly polished arguments.

term-llm already has many of the primitives needed for this kind of work: tool calling, code editing, shell execution, jobs, sub-agents, persistent sessions, and a --progressive mode that keeps a structured best-so-far state through long runs. What it does not yet have is the part that matters most: a first-class evaluation harness that decides whether a candidate actually improved.

This post is a proposal for that missing layer.

The Distinction That Matters

Progressive mode and autoresearch-style loops are adjacent, but they are not the same thing.

The first is a reasoning loop. The second is an empirical loop. Both are useful. Only one gives you a hard answer to the question did this actually get better?

The difference is where truth comes from. In progressive mode, truth mostly comes from reasoning, search, and verification. In an optimization campaign, truth comes from the evaluator.

What term-llm Already Has

The existing runtime is already much closer to this than it may look from the outside. The ingredients are mostly there:

What is missing is not another generic loop. It is a layer that constrains the loop around a measurable objective and an auditable promotion rule.

Proposal: a New optimize Command

The right abstraction is not “just run ask --progressive on a repo and hope for the best.” That gives the agent too much room to be clever in exactly the wrong ways. The cleaner shape is a dedicated command built on top of the existing runtime:

term-llm optimize --spec optimize.yaml

The command should run a bounded optimization campaign:

  1. establish a baseline
  2. create an isolated candidate workspace
  3. let the agent propose and apply a change
  4. run the evaluator
  5. compare the result against the current best
  6. promote or discard the candidate
  7. persist the scoreboard and artifacts externally
  8. repeat until budget exhaustion or stop condition

That is the core loop. It is not magical. It is just disciplined.

Why a Spec File, Not a Pile of Flags

This kind of work becomes messy quickly: editable files, read-only benchmark assets, evaluator commands, metric parsing, promotion rules, budget, shell timeouts, campaign state, checkpoint directories. Trying to force all of that into CLI flags would be miserable. A small declarative spec is the right shape.

A plausible v1 spec looks like this:

version: 1

workspace:
  path: /root/src/project
  editable:
    - train.py
    - config/*.yaml
  read_only:
    - eval.sh
    - tests/**
    - data/**

baseline:
  command: ./eval.sh --json
  parse: json

objective:
  metric: val_bpb
  goal: minimize
  promote_if:
    - candidate.val_bpb < best.val_bpb

campaign:
  budget: 8h
  max_attempts: 100
  max_consecutive_failures: 10

execution:
  candidate_workspace: git-worktree
  reset_between_attempts: true
  shell_timeout: 20m

agent:
  name: optimizer
  instructions: |
    Improve the objective metric.
    Do not edit evaluation files.
    Prefer small, explainable changes.
    Revert losing changes.

reporting:
  checkpoint_dir: .term-llm/optimize/run-001
  save_logs: true
  save_diffs: true

The key decision is that the spec, not the model, defines what counts as success.

The Architecture

I would split the implementation into four components.

1. Campaign runner

Owns the overall loop: load spec, establish baseline, enforce budget, orchestrate attempts, persist campaign state, and emit the final report.

2. Candidate executor

Owns one attempt: create isolated workspace, invoke the agent with the current scoreboard and constraints, apply changes, run the evaluator, and return structured results.

3. Judge/promoter

Owns acceptance logic: compare candidate vs best, apply thresholds and constraints, promote winners, reject losers, and record the reason. The model can suggest. The runner decides.

4. Artifact store

Owns reality: campaign JSON, attempt logs, diffs, metrics, copied artifacts, and the current best candidate. If this state lives only in model narrative, it will drift. It needs to be external, canonical, and replayable.

Workspace Isolation Is Not Optional

A great many “autonomous optimization” demos quietly assume a clean workspace and a benevolent model. Production systems should assume neither. Candidates need isolation.

For a git-backed codebase, the obvious answer is git worktrees. They are fast, diffable, easy to reset, and fit the workflow naturally. For non-git directories, a temp copy fallback is acceptable. In-place mutation with ad hoc cleanup is not. That way lies contamination, phantom wins, and baffling state drift.

The Evaluator Contract

The evaluator is the centre of gravity. It should return structured output, ideally JSON, with metrics and references to any relevant artifacts.

{
  "ok": true,
  "metrics": {
    "val_bpb": 1.923,
    "tokens_per_sec": 184220
  },
  "artifacts": {
    "log": "artifacts/run17.log",
    "plot": "artifacts/loss.png"
  },
  "notes": [
    "completed full time budget"
  ]
}

Requiring a JSON contract in v1 is the right kind of strict. Arbitrary stdout scraping can exist later as an adapter layer. The baseline should not depend on the model squinting at a terminal transcript and deciding that things seem better.

Promotion Rules Need Teeth

“Better” sounds simple until it is not. For v1, I would keep the policy narrow: one scalar metric, one direction, optional threshold.

Multi-objective ranking, Pareto frontiers, and clever policy search can wait. If the first version cannot reliably answer “was attempt 17 better than attempt 12?”, all the fancier machinery is decoration.

How Progressive Mode Fits In

This proposal does not replace progressive mode. It uses it.

Progressive mode already gives term-llm a way to carry a structured best-so-far state through a long run. That maps naturally onto campaign state. The difference is that the state is no longer just an evolving answer; it becomes a scoreboard grounded by external evaluator output.

A plausible progress state might look like this:

{
  "objective": {
    "metric": "val_bpb",
    "goal": "minimize"
  },
  "baseline": {
    "attempt": 0,
    "metrics": { "val_bpb": 1.948 }
  },
  "best": {
    "attempt": 17,
    "metrics": { "val_bpb": 1.923 },
    "summary": "reduced model depth and simplified attention pattern",
    "hypothesis": "smaller model fits the 5-minute budget better on this hardware"
  },
  "attempts": {
    "total": 17,
    "wins": 4,
    "losses": 10,
    "errors": 3
  },
  "recent_lessons": [
    "larger vocab hurt short-budget convergence",
    "attention variants often reduced throughput more than they helped"
  ]
}

That is exactly the kind of structured state progressive mode is good at preserving. The missing step is to make the evaluator, not the model, the authority on metrics.

Jobs-First, Not Chat-First

An optimization campaign is usually long-running, budgeted, and not especially conversational. It belongs in the jobs runner.

The user experience should be something like this:

term-llm optimize run --spec optimize.yaml
term-llm optimize status run_abc123
term-llm optimize resume run_abc123
term-llm optimize report run_abc123

Interactive use is still useful for local debugging and quick dry runs, but the default assumption should be that this is the kind of thing you start, leave alone, and inspect later with a proper report.

Safety Rails

These are not a nice-to-have list. They are the feature.

Without those rails, an agent can absolutely optimize the wrong thing while providing a compelling explanation for why that was sensible. I can do that too. Language models are gifted rationalisers. The system design should assume it.

Good Early Use Cases

The most practical first target is probably not GPU-heavy model training. It is prompt and evaluation-set optimization.

Why start there:

After that, benchmark-guided code tuning and small training loops are natural extensions. The general machinery is the same. Only the evaluator changes.

What v1 Should Deliberately Exclude

Scope discipline matters here. A useful v1 should be boring:

No multi-objective optimisation. No parallel branch forests. No “self-improving research organization” theatrics. Build the honest version first.

Why This Is Worth Building

There is a real qualitative shift between an assistant that can produce polished analysis and a system that can repeatedly improve an artifact against a measurable objective. The first is useful. The second begins to feel like infrastructure.

term-llm already has the general-purpose substrate: tools, jobs, persistent state, progressive checkpoints, and a clean enough execution model to support long loops. Adding an optimize command would not be a gimmick bolted on for the week’s discourse. It would be the natural next layer: take the agent loop, bind it to an evaluator, and make improvement something the system can prove rather than merely claim.

That is the proposal. Not “think harder.” Not “agentic research” as a label. A measured optimization harness with enough discipline to leave running overnight without feeling reckless.