Every agent harness eventually discovers the same boring truth: the model is not the only unreliable thing in the loop.
The network drops. The browser tab sleeps. The provider sends half an answer then closes the stream. The WebSocket survives long enough to make you trust it, then dies immediately after a tool call. A reverse proxy decides silence means death. A mobile client reconnects with no idea what happened while it was gone. Somewhere in the middle of this the model may have asked to edit a file, send an email, or run a shell command.
So the naive version of resilience is easy to write and hard to live with:
for attempt := 0; attempt < 5; attempt++ {
err := run()
if err == nil { return nil }
time.Sleep(backoff(attempt))
}
That loop is not resilience. It is a photocopier aimed at side effects.
A real agent harness needs to know what kind of failure it saw, what already became visible, what already mutated the world, and whether the next action is retry, reconnect, resume, fallback, or stop. Those are not synonyms. Treating them as synonyms is how you get duplicate tool calls, phantom assistant messages, and the distinctive smell of an autonomous system quietly doing the same bad thing three times.
I spent time comparing how several current harnesses handle this: term-llm, pi, opencode, codex, and the decompiled npm bundle for Claude Code 2.1.71. This is not a shopping list of features to steal. It is a set of design distinctions harness builders should make before their agents become long-running enough to matter.
Start with the failure boundary
The most important question is not “is this error retryable?”
The most important question is: has the attempt committed?
A model request can fail before anything escapes. It can fail after invisible setup. It can fail after text was streamed to the user. It can fail after a tool call was emitted. It can fail after a tool actually ran. Those are different worlds.
In term-llm, the generic retry wrapper buffers provider events until the attempt becomes externally visible. The comment in internal/llm/retry.go is the contract:
Before any externally-visible side effects, events are buffered so retryable failures can be retried without leaking partial output into the outer stream.
Buffering stops once the attempt has visibly committed to the caller:
- assistant text deltas
- reasoning deltas
- warning-prefixed phase updates
- interjections injected into the conversation
- tool calls, which are durable model actions and may be followed by side effects
- synchronous tool requests
- provider-native tool execution
After that point the attempt has already escaped, so retrying would duplicate visible output or side effects.
That is the line I want more harnesses to draw explicitly. Before commit, retry is fine. After commit, automatic retry becomes a different operation with a different proof burden.
The implementation is not magical. forwardAttempt tracks whether it is still buffering or has gone live. Once live, a later stream error is wrapped as non-retryable:
if err != nil {
if live {
return &committedError{err}
}
return err
}
The error may still be transient. The network may genuinely have hiccuped. But the correct response is no longer “repeat the request.” It may be “ask the user to continue,” “resume from provider state,” “summarize partial progress,” or “fail loudly.” Replaying the same prompt after the model already asked to run a tool is not recovery. It is roulette with receipts.
Retry, reconnect, resume, and fallback are different verbs
Harnesses get cleaner when these four verbs stay separate.
Retry means repeating an operation that did not commit. The same input goes again because nothing externally meaningful happened.
Reconnect means the operation may still be running somewhere else, and the client is only re-attaching to the event stream.
Resume means continuing from durable progress. You are not pretending the previous attempt never happened.
Fallback means changing transport or provider strategy because a path is unhealthy.
Codex is a good example of making the transport part explicit. Its model client has a session-scoped state bit for disabling WebSockets after they fail. The comments in core/src/client.rs describe the shape:
WebSocket fallback is session-scoped: once a turn activates the HTTP fallback, subsequent turns will also use HTTP for the remainder of the session.
The stream loop in core/src/session/turn.rs gives WebSockets a retry budget, then degrades the session to HTTP:
let max_retries = turn_context.provider.info().stream_max_retries();
if retries >= max_retries
&& client_session.try_switch_fallback_transport(
&turn_context.session_telemetry,
&turn_context.model_info,
)
{
sess.send_event(... "Falling back from WebSockets to HTTPS transport" ...).await;
retries = 0;
continue;
}
That is fallback, not retry. The harness has learned something about the transport path. It changes route.
term-llm has the same kind of WebSocket-to-HTTP fallback in its Responses client. I initially missed this because I was staring at the wrong layer. In ResponsesClient.Stream, WebSocket setup and pre-commit read failures are retried, then the client disables WebSocket and uses HTTP/SSE:
if c.UseWebSocket && !c.websocketDisabled {
... retry websocket ...
c.websocketDisabled = true
c.closeWebSocket()
}
return c.streamHTTPPrepared(ctx, httpPayload, buildFullInput, responseStateGeneration, debugRaw)
The small but important follow-up is to test that fallback is sticky. I opened term-llm PR #646 to harden that test: after the client exhausts WebSocket setup attempts and falls back to HTTP, the next stream should go straight to HTTP rather than probing WebSocket again. Otherwise the system “falls back” in name only, while every turn pays the same failure tax again.
Client disconnect is not model cancellation
Web UIs make this especially easy to get wrong. If the HTTP request goes away, should the model run stop?
For a short autocomplete call, maybe. For a long-running agent, usually not.
term-llm detaches response runs from the request context. The comment in cmd/serve_response_runs.go is blunt:
Runs must survive client disconnects so that:
- SSE connections are fragile (network blips, mobile tab switches, etc.); killing a run on disconnect would waste partial work.
- Clients reconnect via
GET /v1/responses/{id}/events?after=Nand replay events they missed, which only works if the run kept going.- Explicit cancellation is available via
POST /v1/responses/{id}/cancel.
That is reconnect semantics. The browser tab is not the owner of the inference. It is a subscriber.
This distinction gets more important as agents become slower and more useful. A 45-second code search, a 5-minute refactor, a 20-minute research pass, a background job running sub-agents: these cannot be tied to the liveness of a single TCP connection from a single tab.
The usual pattern is:
- Create a durable or semi-durable run id.
- Append events to a run log with sequence numbers.
- Let clients subscribe from
after=N. - Treat disconnect as unsubscribe, not cancel.
- Provide an explicit cancel endpoint.
The obvious next step is persistence across server restart. In-memory detached runs survive browser disconnect. They do not survive process death. If your harness wants “come back after deploy and keep watching,” the event log must be durable.
Backoff policy is product behavior
Backoff sounds like plumbing. It is actually how your product behaves under stress.
A linear backoff says “try again soon, then slightly less soon.” An exponential backoff says “the world may be on fire; stop adding kindling.” A Retry-After header says “the server knows more than you, at least this time.”
Claude Code’s bundled Anthropic SDK retry behavior is conventional and solid. In the decompiled 2.1.71 npm bundle, the SDK retries on 408, 409, 429, and >=500, with x-should-retry override support:
if (q === "true") return true;
if (q === "false") return false;
if (A.status === 408) return true;
if (A.status === 409) return true;
if (A.status === 429) return true;
if (A.status >= 500) return true;
It also honors retry-after-ms and retry-after, including HTTP dates, but only accepts server-requested delays under 60 seconds before falling back to its default exponential timeout. Reasonable. Not holy writ, but reasonable.
OpenCode’s retry helper is similarly careful with server hints. packages/opencode/src/session/retry.ts handles retry-after-ms, numeric retry-after, HTTP-date retry-after, and caps exponential delay to 30 seconds when there are no headers:
const retryAfter = headers["retry-after"]
if (retryAfter) {
const parsedSeconds = Number.parseFloat(retryAfter)
if (!Number.isNaN(parsedSeconds)) {
return Math.ceil(parsedSeconds * 1000)
}
const parsed = Date.parse(retryAfter) - Date.now()
if (!Number.isNaN(parsed) && parsed > 0) {
return Math.ceil(parsed)
}
}
return Math.min(
RETRY_INITIAL_DELAY * Math.pow(RETRY_BACKOFF_FACTOR, attempt - 1),
RETRY_MAX_DELAY_NO_HEADERS,
)
term-llm had generic provider retry, but its fallback delay was linear. PR #646 changes that to exponential jitter and expands Retry-After parsing. This is the new shape:
// Exponential backoff with jitter: base * 2^(attempt-1) * jitter
jitter := 0.5 + rand.Float64()
multiplier := 1 << max(attempt-1, 0)
delay := time.Duration(float64(r.config.BaseBackoff) * float64(multiplier) * jitter)
The detail that matters is not the exact multiplier. The detail is that retry policy is visible behavior. If your harness silently retries for 90 seconds, that is part of the UI. If it retries a rate limit immediately five times, that is part of your relationship with the provider. If it hides retry state, users experience it as freezing.
So emit retry events. Show the attempt count. Show the wait. Make cancellation work during sleep. This is not polish. It is operational honesty.
Separate request retry from stream retry
Codex is the cleanest example of separate budgets. Its provider config exposes request retries, stream retries, and stream idle timeout as distinct values. In model-provider-info/src/lib.rs:
const DEFAULT_STREAM_IDLE_TIMEOUT_MS: u64 = 300_000;
const DEFAULT_STREAM_MAX_RETRIES: u64 = 5;
const DEFAULT_REQUEST_MAX_RETRIES: u64 = 4;
const MAX_STREAM_MAX_RETRIES: u64 = 100;
const MAX_REQUEST_MAX_RETRIES: u64 = 100;
The low-level request retry policy in codex-client/src/retry.rs is deliberately narrow:
pub struct RetryOn {
pub retry_429: bool,
pub retry_5xx: bool,
pub retry_transport: bool,
}
Then the model-provider layer chooses:
retry_429: false,
retry_5xx: true,
retry_transport: true,
That retry_429: false may look surprising if you come from a generic HTTP client. But for an agent harness it can be the right boundary. A 429 may represent usage exhaustion or a long quota window, not merely “wait 200ms and try again.” Codex handles some usage-limit paths higher up, where it can update rate-limit state and surface a meaningful event.
A stream retry is a different animal. The initial POST may have succeeded. The provider may be generating. You may have seen some events. The stream can go idle or break. Codex wraps SSE polling with an idle timeout in codex-api/src/sse/responses.rs:
let response = timeout(idle_timeout, stream.next()).await;
That line is boring in the best possible way. Without an idle timeout, an agent can hang forever on a socket that is technically open and spiritually dead.
Harnesses should usually have at least these knobs, even if hidden behind defaults:
- request creation retry budget
- stream disconnect retry budget
- stream idle timeout
- transport fallback budget
- maximum server-requested retry delay worth honoring automatically
One retry counter for all of that is too blunt.
High-level agent retry is a different layer again
Pi retries at the agent-session layer. When the agent ends with a retryable assistant error, AgentSession removes the final assistant error from active state, waits with exponential backoff, and calls agent.continue():
// Remove error message from agent state (keep in session for history)
const messages = this.agent.state.messages;
if (messages.length > 0 && messages[messages.length - 1].role === "assistant") {
this.agent.replaceMessages(messages.slice(0, -1));
}
await sleep(delayMs, this._retryAbortController.signal);
setTimeout(() => {
this.agent.continue().catch(() => {})
}, 0)
Its defaults are understandable: enabled, 3 retries, 2-second base delay. That is good product engineering for a local coding agent. The user sees retry events; the prompt waits for retry completion; abort cancels retry sleep.
But this layer should not be confused with transport retry. Pi is not retrying a failed TCP connection before commit. It is continuing the agent after an error message. That can be exactly what you want. It can also hide the difference between “nothing happened” and “the model partially acted then failed.”
OpenCode’s processor takes an even more aggressive approach. In packages/opencode/src/session/processor.ts, retryable errors update session status and continue the processing loop:
const retry = SessionRetry.retryable(error)
if (retry !== undefined) {
attempt++
const delay = SessionRetry.delay(attempt, ...)
await SessionStatus.set(input.sessionID, {
type: "retry",
attempt,
message: retry,
next: Date.now() + delay,
})
await SessionRetry.sleep(delay, input.abort).catch(() => {})
continue
}
The upside is liveness. The system tries very hard to keep the agent moving.
The risk is that “very hard” needs a ceiling, or at least a clearly intentional no-ceiling mode. Infinite retry can be right for a background worker waiting for a provider outage to clear. It is wrong for an interactive session silently burning time and ambiguity. If the loop is unbounded, make that visible as a policy, not an accident.
Preserve state only when you know whose state it is
Stateful provider APIs tempt harnesses into clever continuation. Previous response ids, connection-local deltas, sticky routing tokens, cached WebSocket sessions: all useful, all footguns.
Codex is careful about this. Its ModelClientSession is turn-scoped. The comment in core/src/client.rs says:
Create a fresh
ModelClientSessionfor each Codex turn. Reusing it across turns would replay the previous turn’s sticky-routing token into the next turn, which violates the client/server contract and can cause routing bugs.
That is the right instinct. Continuation state is not a cache blob you smear across the app. It has a scope.
term-llm has similar caution around WebSocket previous-response state. prepareWebSocketContinuationLocked only uses previous_response_id when the non-input fields match and the new input is an extension of the previous request/output baseline. If tool schemas, model params, or comparable request fields changed, it sends full state instead.
That is a general rule: provider continuation state is valid only under the provider’s contract and your own invariants. If you cannot prove the next request is an extension of the previous one, do not use previous-state magic. Send full history or start a new chain.
Keep retry visible in the event model
Retries should not be log lines only.
term-llm emits EventRetry with attempt, max attempts, and wait seconds. Its CLI and JSON stream surface it. Pi emits auto_retry_start and auto_retry_end. OpenCode uses SessionStatus with type: "retry", attempt, message, and next retry time. Codex sends reconnect warnings like Reconnecting... {retries}/{max_retries}.
These are not merely nice UX. They are the observable state of a distributed system. If the user cannot distinguish “the model is thinking,” “a tool is running,” “we are backing off after a 503,” and “the stream died but the run is still alive,” they will invent a worse explanation.
Also: retry events are where cancellation can attach. A visible retry with a countdown can be stopped. An invisible sleep is just latency with an alibi.
A practical checklist
If you are building an agent harness, I would start with these questions.
1. What counts as commit?
Define it in code. Text delta? Reasoning delta? Tool call emitted? Tool execution started? External message sent? File edit written?
Before commit: retry may be safe.
After commit: retry should require a stronger mechanism than “same request again.”
2. Can the client disconnect without killing the run?
If not, decide whether that is genuinely what you want. For long-running agents, it usually is not.
Use run ids, sequence-numbered event logs, and explicit cancellation.
3. Do you distinguish retry from resume?
Retry repeats an uncommitted operation.
Resume continues from durable progress.
If you use provider response ids or a persisted transcript, call that resume and design it as resume.
4. Are transport failures routed differently from model failures?
A WebSocket dying is not the same as a model refusing a prompt. HTTP fallback should not be mixed into prompt repair logic.
5. Do stream reads have idle timeouts?
If a stream can be silent for legitimate reasons, set the timeout generously. But set one. A dead socket that never returns is not a philosophical problem; it is a missing timer.
6. Are retry budgets scoped?
Separate request retry from stream retry from agent-level continue. Make the defaults boring. Cap user configuration unless there is an intentional “daemon mode.”
7. Do you honor Retry-After sanely?
Parse numeric seconds. Parse HTTP-date. Decide how long is too long for automatic waiting. Surface the wait.
8. Is fallback sticky?
If you decide a transport is unhealthy, remember that decision for the appropriate scope. Otherwise every request rediscovers the same failure.
9. Can retry sleep be aborted?
If the user cancels while you are backing off, stop backing off. This sounds obvious until you find the goroutine still faithfully waiting to do the thing nobody wants anymore.
10. Do tests prove the semantics, not just the happy path?
Write tests for:
- failure before visible output retries
- failure after visible output does not retry
- failure after tool call does not duplicate the tool
- WebSocket connect failure falls back to HTTP
- fallback stays disabled for subsequent streams
- stream idle timeout releases the turn
- next user turn works after stream error
Retry-Afternumeric and date parsing
These tests are cheaper than explaining to someone why their agent edited the same file twice.
The database analogy
Database people learned this a long time ago. Transactions have commit points. Logs have sequence numbers. Clients can disconnect without necessarily aborting server work. Retrying a transaction after an unknown commit state is dangerous unless operations are idempotent or you have a recovery protocol.
Agent harnesses are walking into the same swamp, except the transaction body is natural language and the side effects include “send this message to a human.” Fun little industry we have here.
The right mental model is not “make the request reliable.” The right mental model is:
- know what committed
- know what can be replayed
- know what must be resumed
- know what must be abandoned
- tell the user which one is happening
A harness that does this will feel less magical in the short term. It will also be the one you trust with longer work.
Retries are easy. Recovery is accounting.