Overview / What We’re Building

Jarvis Voice is a minimal iOS app that gives Sam a natural two-way voice interface to his existing Jarvis AI agent running on a homelab server at 10.0.0.52:8081. The architecture is a thin-router pattern: a cloud-based voice LLM (OpenAI’s gpt-realtime) handles all speech input/output — microphone capture, voice activity detection, speech-to-speech audio generation — but contains zero reasoning logic itself. Every meaningful user request is routed via a single ask_jarvis() tool call over Tailscale VPN directly to term-llm’s OpenAI-compatible HTTP API (/v1/chat/completions or /v1/responses) on the Jarvis backend, which does the actual thinking. The voice LLM speaks Jarvis’s response back to the user. The iOS app is the WebRTC audio transport layer, the tool-call dispatcher, and nothing else. The result: Sam speaks naturally, hears Jarvis respond in a natural voice, with sub-7-second end-to-end latency for most requests, zero cloud storage of conversation content, and full access to all of Jarvis’s existing capabilities without re-implementing them.


Voice LLM API Options

Comparison Table

DimensionOpenAI gpt-realtimexAI Grok Voice AgentGemini Live 2.5 FlashElevenLabs Conv. AIHume EVI 3
ArchitectureTrue S2STrue S2STrue S2S (Native Audio)Pipeline (STT→LLM→TTS)True S2S
TransportWebRTC + WebSocket + SIPWebSocket + LiveKitWebSocketWebSocket + WebRTCWebSocket
Tool calling✅ Native, first-class✅ Native + built-in (web/X search)✅ Native + Google Search✅ Client-side + server-side✅ Requires external LLM
Official iOS SDK❌ (community: m1guelpf)❌ (use LiveKit iOS SDK)❌ (DIY WebSocket or Pipecat)✅ Native Swift SDK (v2.1.0+)✅ HumeAI Swift SDK
Pricing$32/$64 per 1M audio in/out tokens (~$0.10–0.20/min typical)$0.05/min flat~$0.015–0.02/min (cheapest)$0.04–0.10/min$0.04–0.07/min
Latency (TTFA)~200–500ms (no tools)<700ms avg150–400ms (variable)~300–600ms (pipeline)~200–400ms
Context window128K tokens (gpt-realtime GA)S2S managed1M tokensDepends on LLMDepends on LLM
LLM flexibility❌ GPT-4o only❌ Grok only❌ Gemini only✅ BYO-LLM✅ External LLM (Claude/GPT)
Function calling reliability⭐⭐⭐⭐⭐ (66.5% on evals, full model)⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ (needs ext. LLM)
Voice quality⭐⭐⭐⭐ (marin, alloy, etc.)⭐⭐⭐⭐ (5 voices: Ara/Rex/Sal/Eve/Leo)⭐⭐⭐⭐⭐⭐⭐⭐ (5000+ voices)⭐⭐⭐⭐⭐ (clone support)
Semantic VAD✅ (semantic_vad mode)
Interruption handling✅ Auto (server_vad)
Prompt caching✅ 94% audio discount on cachedN/AN/A
OpenAI API compat✅ native
Ecosystem maturity⭐⭐⭐⭐⭐⭐⭐⭐ (Dec 2025 launch)⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐

Analysis

OpenAI gpt-realtime is the clear winner for this use case despite being the most expensive option. Reasons:

  1. Tool calling is the entire architecture — the ask_jarvis() pattern lives or dies on tool call reliability. OpenAI’s GA gpt-realtime model scored 66.5% on function calling evals vs 49.7% for the preview model. xAI Grok is competitive but newer/less proven. The mini model is explicitly worse at function calling — use the full model.

  2. Semantic VADgpt-realtime supports semantic_vad which understands natural speech pauses vs sentence endings. For Sam’s use case (complex queries to Jarvis), this matters enormously — you don’t want the model cutting off mid-sentence.

  3. WebRTC as first-class transport — OpenAI explicitly recommends WebRTC for mobile/iOS. The ephemeral key flow is clean. The community Swift reference implementation (m1guelpf/swift-realtime-openai) is production-quality.

  4. Auto-waiting built in — the GA gpt-realtime model automatically says “I’m still waiting on that” if a tool call takes too long. No custom implementation needed.

  5. Prompt caching — 94% discount on cached audio input tokens means long sessions get dramatically cheaper after the first few turns.

Why not ElevenLabs? Despite the excellent native Swift SDK, ElevenLabs is pipeline-based (not true S2S), which adds latency. More importantly, it depends on your LLM choice for function calling — you’d be paying for both ElevenLabs and GPT-4o API costs, at higher combined latency.

Why not Grok? $0.05/min flat rate is appealing, but: (1) launched December 2025, tiny community, (2) no native iOS SDK, (3) LiveKit plugin was Python-only at launch, (4) function calling not as proven as OpenAI’s.

Why not Gemini Live? Cheapest option by far (~$0.015/min) and 1M token context window is incredible, but no native iOS SDK means 1–2 weeks of extra engineering to build the WebSocket layer or deploy a Pipecat relay server.

🏆 Recommendation: OpenAI gpt-realtime via WebRTC

Use gpt-realtime (full model, not mini) via WebRTC with ephemeral keys. Revisit gpt-4o-mini-realtime-preview only after validating that tool calling behavior is acceptable in testing — expect degraded function calling reliability on mini.


The Thin-Router Pattern

The voice LLM is not the brain. It is the mouth and ears of Jarvis. All reasoning, memory, tool use, and response generation happens on the homelab. The voice LLM’s only job is:

  1. Convert Sam’s speech to text (S2S)
  2. Decide: “this is a real request” → call ask_jarvis()
  3. Pass the result back as natural speech

This is better than making the voice LLM do everything because:

Full Architecture Diagram

┌──────────────────────────────────────────────────────────────────────────────┐
│                              iOS Voice App                                    │
│                                                                               │
│  ┌──────────────────────────────┐   ┌────────────────────────────────────┐   │
│  │      SwiftUI Layer           │   │     WebRTC Peer Connection          │   │
│  │  @Observable VoiceViewModel  │   │  RTCPeerConnection (audio track)    │   │
│  │  Pulsating orb / waveform    │   │  RTCDataChannel (JSON events)       │   │
│  │  .idle → .listening →        │   │  Ephemeral key auth                 │   │
│  │  .processing → .speaking     │   │  ICE/DTLS/SRTP encrypted           │   │
│  └──────────────┬───────────────┘   └──────────────────┬─────────────────┘   │
│                 │ @MainActor                            │                     │
│  ┌──────────────▼───────────────────────────────────┐  │                     │
│  │           RealtimeEventHandler (actor)           │  │                     │
│  │  • Registers ask_jarvis() tool in session.update │  │                     │
│  │  • Watches response.output_item.done events      │  │                     │
│  │  • Fires async Task → JarvisClient               │  │                     │
│  │  • Submits conversation.item.create (tool result)│  │                     │
│  │  • Sends response.create to trigger speech       │  │                     │
│  └──────────────┬───────────────────────────────────┘  │                     │
│                 │ async/await                           │                     │
│  ┌──────────────▼───────────────┐                      │                     │
│  │    AVAudioEngine (actor)     │                       │                     │
│  │  AVAudioSession.voiceChat    │                       │                     │
│  │  Hardware AEC + AGC          │                       │                     │
│  │  48kHz Float32 tap           │                       │                     │
│  │  AVAudioConverter → 24kHz    │                       │                     │
│  │  Int16 PCM for WebRTC        │                       │                     │
│  │  AVAudioPlayerNode (TTS out) │                       │                     │
│  └──────────────────────────────┘                      │                     │
│                                                         │                     │
│  ┌──────────────────────────────┐                       │                     │
│  │     JarvisClient (actor)     │                       │                     │
│  │  URLSession with 10s timeout │                       │                     │
│  │  Bearer token from Keychain  │                       │                     │
│  │  Circuit breaker pattern     │                       │                     │
│  │  session_id UUID tracking    │                       │                     │
│  └──────────────┬───────────────┘                       │                     │
└─────────────────│─────────────────────────────────────│─────────────────────┘
                  │                                       │
     Tailscale WireGuard VPN                    WebRTC + DTLS/SRTP
     (System VPN, On-Demand rules)              (UDP, direct path)
                  │                                       │
                  ▼                                       ▼
     ┌────────────────────────┐           ┌──────────────────────────────┐
     │   Homelab 10.0.0.52    │           │   OpenAI Realtime API        │
     │                        │           │   model: gpt-realtime         │
     │  ┌────────────────────────┐  │           │                              │
     │  │  term-llm HTTP API   │  │           │   Session config:            │
     │  │  /v1/chat/completions│  │           │   - Semantic VAD             │
     │  │  /v1/responses       │  │           │   - Tool: ask_jarvis()       │
     │  │  Bearer + session_id │  │           │   - Voice: marin             │
     │  └────────┬─────────────┘  │           │   - 128K context             │
     │           │                │           └──────────────────────────────┘
     │  ┌────────▼─────────────┐  │
     │  │  Jarvis Agent        │  │
     │  │  (term-llm)          │  │
     │  │  memory/tools/search │  │
     │  └──────────────────────┘  │
     └────────────────────────┘

Why Not Make the Voice LLM Do Everything?

ApproachVoice LLM Does EverythingThin-Router (Recommended)
Jarvis memory/toolsDuplicated or lostFully preserved
Cost per complex queryHigh (audio tokens for reasoning)Low (audio tokens only for routing)
Jarvis upgradesRequire app updateTransparent
Context window burnFast (audio tokens expensive)Slow (minimal turns)
Session lengthShort (~15 min before context fills)Longer (Jarvis manages its own context)

The Tool Calling Mechanism

The ask_jarvis() Tool Definition

{
  "type": "function",
  "name": "ask_jarvis",
  "description": "Routes any meaningful user request to the Jarvis AI agent backend running on Sam's homelab. Use this tool for EVERY real question, task, or request. Do not attempt to answer from your own knowledge.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The user's complete request in natural language, including all relevant context they stated (names, dates, quantities, locations). Do not abbreviate or reframe."
      }
    },
    "required": ["query"]
  }
}

Complete Event Flow (Source-Verified from OpenAI Docs)

┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: USER SPEAKS                                                         │
│                                                                               │
│  Sam: "Hey, what's on my calendar tomorrow?"                                 │
│                                                                               │
│  iOS mic → PCM16 24kHz → WebRTC audio track → OpenAI Realtime API           │
│                                                                               │
│  Server events (client receives):                                             │
│    ← input_audio_buffer.speech_started                                       │
│    ← input_audio_buffer.speech_stopped    (semantic_vad fires)               │
│    ← input_audio_buffer.committed                                            │
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 2: MODEL DECIDES TO CALL TOOL                                          │
│                                                                               │
│  ← response.created                                                          │
│  ← response.output_item.added  { type: "function_call", name: "ask_jarvis" }│
│  ← response.function_call_arguments.delta  × N  (streaming JSON)            │
│     e.g. delta: '{"q'  →  '{"query'  →  '{"query":"What'  → ...            │
│  ← response.function_call_arguments.done                                     │
│     final: { "query": "What is on my calendar tomorrow?" }                   │
│  ← response.output_item.done  {                                              │
│       type: "function_call",                                                 │
│       name: "ask_jarvis",                                                    │
│       call_id: "call_abc123",                                                │
│       arguments: "{\"query\":\"What is on my calendar tomorrow?\"}"          │
│     }                                                                        │
│  ← response.done  (status: "completed" — model spoke filler, stopped)       │
│                                                                               │
│  [CONCURRENTLY: model has already spoken filler phrase audio]                │
│  [e.g. "Let me check with Jarvis." plays while HTTP is in flight]            │
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 3: iOS APP CALLS JARVIS                                                │
│                                                                               │
│  Task {                                                                       │
│    POST http://10.0.0.52:8081/v1/chat/completions                            │
│    Authorization: Bearer <token-from-keychain>                               │
│    Content-Type: application/json                                            │
│    session_id: <jarvis-session-uuid>                                         │
│    { "messages":[{"role":"user","content":"What is on my calendar tomorrow?"}],│
│      "stream": false }                                                      │
│  }                                                                            │
│                                                                               │
│  ← HTTP 200 { "choices":[{"message":{"content":"You have a dentist..."}}], ... }│
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 4: APP SUBMITS TOOL RESULT                                             │
│                                                                               │
│  → conversation.item.create {                                                │
│      "type": "conversation.item.create",                                     │
│      "item": {                                                               │
│        "type": "function_call_output",                                       │
│        "call_id": "call_abc123",                                             │
│        "output": "You have a dentist at 10am and team standup at 2pm."      │
│      }                                                                       │
│    }                                                                         │
│                                                                               │
│  → response.create {}                                                        │
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 5: MODEL SPEAKS THE ANSWER                                             │
│                                                                               │
│  ← response.created                                                          │
│  ← response.output_audio.delta × N   (streamed PCM16 audio chunks)          │
│  ← response.output_audio_transcript.delta × N  (streamed transcript text)   │
│  ← response.done                                                             │
│                                                                               │
│  [iOS plays audio chunks via AVAudioPlayerNode as they arrive]               │
└─────────────────────────────────────────────────────────────────────────────┘

Key Implementation Detail (Swift)

func handleRealtimeEvent(_ event: RealtimeServerEvent) {
    guard case .responseOutputItemDone(let item) = event,
          item.type == "function_call",
          item.name == "ask_jarvis",
          let callId = item.callId,
          let argsStr = item.arguments,
          let args = try? JSONDecoder().decode(JarvisArgs.self,
                          from: argsStr.data(using: .utf8)!)
    else { return }

    Task {
        do {
            let result = try await jarvisClient.ask(
                query: args.query,
                sessionId: currentSessionId,
                timeout: 10.0
            )
            // Truncate long responses for voice
            let voiceOutput = result.response.truncatedForVoice(maxWords: 150)
            await submitToolResult(callId: callId, output: voiceOutput)
            await sendResponseCreate()
        } catch {
            let errorMsg = errorMessage(for: error)
            await submitToolResult(callId: callId, output: errorMsg)
            await sendResponseCreate()
        }
    }
}

func submitToolResult(callId: String, output: String) async {
    let event: [String: Any] = [
        "type": "conversation.item.create",
        "item": [
            "type": "function_call_output",
            "call_id": callId,
            "output": output
        ]
    ]
    await dataChannel.send(JSON(event))
}

Latency Breakdown Table

StageLatency (LAN)Latency (4G/5G Cellular)Notes
Semantic VAD fires after speech ends200–600ms200–600msSemantic VAD more accurate, slightly slower than server_vad
Model processes audio + decides to call tool150–400ms150–400msIncluded in response initiation
Model speaks filler phrase800ms–1.5s800ms–1.5sConcurrent with HTTP call below
HTTP to Jarvis via Tailscale WireGuard5–30ms80–300msLAN: nearly instant. Cellular: DERP relay may add latency
Jarvis LLM inference (Claude Sonnet)1,000–4,000ms1,000–4,000msDominant cost for complex queries
Model generates first audio byte after tool result200–500ms200–500msAfter response.create sent
Total perceived gap~2–5s~3–7sFiller phrase masks Jarvis inference time

Key insight: The filler phrase (1–1.5s) buys you almost all the time you need for Jarvis to respond on LAN. The user hears “Let me check with Jarvis” and then almost immediately hears the answer. The silence gap that needs to be hidden is usually under 2 seconds.


Filler Phrase Strategy

The Problem

When ask_jarvis() is called, the voice LLM stops speaking and waits for the tool result. Without intervention, Sam hears dead silence for 2–7 seconds. This feels broken.

The Solution: Pre-Call Verbal Acknowledgment

Instruct the model in the system prompt to say one short phrase before calling the tool. This speech happens in the response.output_audio.delta stream that accompanies the function call. When response.done arrives (marking the end of the filler + the tool call), you fire the HTTP request to Jarvis.

System Prompt Language

# Tool Usage
Before calling ask_jarvis(), always speak one short, natural acknowledgment.
These must be varied — never use the same phrase twice in a row:
- "Let me ask Jarvis."
- "One moment."
- "Checking now."
- "On it."
- "Let me look that up."
- "Sure, give me a second."

Then call ask_jarvis() immediately. Do not say anything else before calling.

The Auto-Waiting Feature (Built Into gpt-realtime)

From OpenAI’s official prompting docs (confirmed source):

“If you ask the model for the results of a function call, it’ll say something like ‘I’m still waiting on that.’ This feature is automatically enabled for new models — no changes necessary.”

This means: if Jarvis takes longer than expected (e.g., complex multi-step query), the model will naturally fill the silence with “I’m still waiting on that…” without any code on your end. This is a gpt-realtime GA feature, not available in the preview models.

What NOT To Do


iOS Audio Pipeline

AVAudioEngine vs AVAudioSession — The Mental Model

These are not alternatives — they are two layers of the same stack.

LayerRoleWhat It Controls
AVAudioSessionOS contract — tells iOS how you intend to use audioRouting, interruption policy, AEC mode, category
AVAudioEngineSignal graph — the actual audio processing pipelineNodes, taps, converters, players

Configure AVAudioSession first, then build AVAudioEngine on top.

The Correct Setup for Full-Duplex Voice AI

// Step 1: Configure the session
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord,
                        mode: .voiceChat,        // ← KEY: enables hardware AEC + AGC
                        options: [.defaultToSpeaker, .allowBluetooth])
try session.setPreferredSampleRate(24000)         // Request 24kHz (hardware may ignore)
try session.setPreferredIOBufferDuration(0.01)    // 10ms buffer = ~240 samples
try session.setActive(true)

// Step 2: Build the engine
let engine = AVAudioEngine()

// Step 3: Enable voice processing (AEC) on the input node
// MUST be called while engine is STOPPED
try engine.inputNode.setVoiceProcessingEnabled(true)

// Step 4: Tap at NATIVE hardware format (48kHz Float32 — DO NOT try to force 24kHz here)
let nativeFormat = engine.inputNode.inputFormat(forBus: 0)  // 48kHz Float32 mono

// Step 5: Set up converter to OpenAI's required format
let targetFormat = AVAudioFormat(commonFormat: .pcmFormatInt16,
                                  sampleRate: 24000,
                                  channels: 1,
                                  interleaved: true)!
let converter = AVAudioConverter(from: nativeFormat, to: targetFormat)!

// Step 6: Install tap and stream to WebRTC
engine.inputNode.installTap(onBus: 0, bufferSize: 4800, format: nativeFormat) { buffer, _ in
    let frameCount = AVAudioFrameCount(24000) * buffer.frameLength /
                     AVAudioFrameCount(nativeFormat.sampleRate)
    let convertedBuffer = AVAudioPCMBuffer(pcmFormat: targetFormat,
                                           frameCapacity: frameCount)!
    var error: NSError?
    var consumed = false
    converter.convert(to: convertedBuffer, error: &error) { _, outStatus in
        if !consumed { outStatus.pointee = .haveData; consumed = true; return buffer }
        outStatus.pointee = .noDataNow; return nil
    }
    // convertedBuffer.int16ChannelData![0] = raw Int16 PCM at 24kHz
    // Base64-encode and send over WebSocket, OR feed directly to WebRTC audio track
    let audioData = Data(bytes: convertedBuffer.int16ChannelData![0],
                         count: Int(convertedBuffer.frameLength) * 2)
    realtimeSession.sendAudio(audioData)
}

// Step 7: Attach playback node (for TTS output from OpenAI)
let playerNode = AVAudioPlayerNode()
engine.attach(playerNode)
engine.connect(playerNode, to: engine.mainMixerNode, format: targetFormat)

try engine.start()

PCM16 at 24kHz — The Numbers

10ms frame  =  240 samples × 2 bytes =   480 bytes
20ms frame  =  480 samples × 2 bytes =   960 bytes  (good for VAD processing)
100ms chunk = 2400 samples × 2 bytes = 4,800 bytes  (good WebSocket granularity)

Uplink bandwidth:   ~48 KB/s (24kHz mono PCM16)
Downlink bandwidth: ~48 KB/s (AI voice response, same format)

AEC — The Right Configuration

Use Option A (.voiceChat mode) + Option B (setVoiceProcessingEnabled(true)) together. This is the path of least resistance and handles 95% of echo cancellation needs.

Do NOT use setPrefersEchoCancelledInput(true) — this is iOS 18.2+ only, hardware-gated to 2024 iPhones, and cannot be combined with Voice Processing IO APIs. It’s designed for music apps, not voice AI.

Key Gotchas

GotchaDetailFix
The disconnect bug (WebSocket code 1000)URLSessionWebSocketTask closes immediately after first audio packetAudio is not properly formatted as PCM16 Int16. Ensure base64 encodes raw Int16 (little-endian) bytes, not Float32. Confirm session.created received before sending audio.
Volume dropVoiceProcessingIO reduces playback volume ~3–6dBThis is by design (headroom for AEC). Adjust playerNode.volume upward, or use engine.mainMixerNode.outputVolume.
Cannot force tap formatSetting custom format on installTap silently fails or produces zero buffersAlways tap at native 48kHz Float32, use AVAudioConverter to resample.
Route change resets AECHeadphone insertion/removal requires engine restartListen to AVAudioSession.routeChangeNotification, pause engine, call try? session.setActive(true), restart engine.
Engine config changeHardware change (USB mic, headphones) auto-stops engineListen to AVAudioEngineConfigurationChangeNotification, rewire graph and restart.
Media services resetRare but possible — iOS kills audio serverListen to AVAudioSession.mediaServicesWereResetNotification, full teardown + rebuild.
Bluetooth A2DP → HFP.allowBluetooth forces BT into HFP (narrowband) for AECExpected behavior. HFP = 8kHz or 16kHz voice profile. A2DP = high quality but no AEC.
conversation_already_has_active_responseSending response.create while response in flightAlways gate response.create on response.done or response.cancelled

WebRTC vs WebSocket for iOS

Why WebRTC Wins

DimensionWebRTCWebSocket
Packet loss handlingBuilt-in FEC (Opus codec), can drop late packetsTCP: retransmits, causes jitter/delay
Head-of-line blockingNone (UDP-based)Yes (TCP) — a dropped packet stalls all subsequent audio
AEC integrationFramework-level AEC built in to WebRTC iOS SDKManual (must implement via VoiceProcessingIO)
Network transitionsICE restart handles wifi→cellular gracefullyURLSessionWebSocketTask often drops on network change
Jitter bufferBuilt in (adaptive)Must implement manually
LatencyLower (UDP, adaptive bitrate)Higher (TCP overhead)
OpenAI recommendation✅ Explicitly recommended for mobile/iOS“Server-to-server tool” per OpenAI docs

From the OpenAI docs (verified source):

“WebSocket is explicitly described as a ‘server-to-server’ tool. However, WebSocket is still viable on mobile when you control the full audio pipeline.”

Translated: use WebRTC. WebSocket is for your server talking to OpenAI, not your iOS app.

The Ephemeral Key Flow

1. Your backend server:
   POST https://api.openai.com/v1/realtime/client_secrets
   Authorization: Bearer <OPENAI_API_KEY>
   → Returns: { "client_secret": { "value": "ek_xxx...", "expires_at": ... } }

2. iOS app fetches ephemeral key from YOUR backend (never store raw OpenAI key on device)

3. iOS app:
   POST https://api.openai.com/v1/realtime/calls
   Authorization: Bearer ek_xxx
   Content-Type: application/sdp
   Body: <SDP offer from RTCPeerConnection>
   → Returns: SDP answer

4. Set RTCPeerConnection remote description with SDP answer
5. ICE negotiation completes → audio stream is live
6. Tool call events arrive on RTCDataChannel

Note: Ephemeral keys have a short TTL (minutes). Generate a new one per session start. Store the raw OpenAI API key server-side (your backend or Keychain-protected relay), never in the iOS app bundle.

Reference Implementations


Networking: Reaching Jarvis from Outside

The Right Approach: Tailscale VPN On Demand

Tailscale’s VPN On Demand feature (available since Tailscale iOS 1.48, verified January 2026) allows iOS to automatically activate the WireGuard VPN tunnel whenever a DNS query for *.ts.net domains is made. This means:

  1. Sam opens the voice app
  2. App makes HTTP request to jarvis.tail-xxxx.ts.net
  3. iOS VPN On Demand kicks in, activates Tailscale WireGuard tunnel
  4. Request reaches 10.0.0.52:8081 on the homelab
  5. No manual VPN management required

Why tsnet Doesn’t Work on iOS

The tsnet package — Tailscale’s embeddable Go library — allows embedding Tailscale directly into a Go binary so it acts as its own Tailscale node without a separate install. However:

The correct approach: Require the Tailscale iOS app to be installed separately. Use VPN On Demand rules.

MagicDNS Configuration

In your Tailscale admin console, enable MagicDNS. Your homelab server gets a stable DNS name like jarvis.tail-xxxx.ts.net. Configure the iOS VPN On Demand rule to trigger for *.ts.net or *.tail-xxxx.ts.net.

// iOS: How to call Jarvis via Tailscale
let jarvisURL = URL(string: "http://jarvis.tail-xxxx.ts.net/chat")!

// The Tailscale VPN On Demand activates automatically
// when this DNS name is resolved. No extra code needed.

Bearer Token Management

Store the Jarvis API bearer token in iOS Keychain, not in UserDefaults or app bundle:

import Security

func storeJarvisToken(_ token: String) {
    let query: [String: Any] = [
        kSecClass as String: kSecClassGenericPassword,
        kSecAttrAccount as String: "jarvis-api-token",
        kSecValueData as String: token.data(using: .utf8)!,
        kSecAttrAccessible as String: kSecAttrAccessibleWhenUnlockedThisDeviceOnly
    ]
    SecItemAdd(query as CFDictionary, nil)
}

Fallback When Homelab Is Unreachable

When Jarvis is down (server off, VPN unreachable, timeout), the tool result must still return something sensible:

enum JarvisError: Error {
    case timeout
    case serverDown
    case authFailed
    case unknownError(Int)
}

func errorMessage(for error: Error) -> String {
    switch error {
    case JarvisError.timeout:
        return "I wasn't able to reach Jarvis — the request timed out. The homelab may be busy."
    case JarvisError.serverDown:
        return "Jarvis appears to be offline right now. I can't reach the homelab."
    case JarvisError.authFailed:
        return "Authentication to Jarvis failed. You may need to update the API token in settings."
    default:
        return "Something went wrong reaching Jarvis: \(error.localizedDescription)"
    }
}

The voice LLM will speak these error messages naturally.


Jarvis Backend API

✅ Correct Finding: term-llm Already Exposes a Full HTTP API

term-llm serve --platform web exposes a production HTTP API that is OpenAI-compatible. No custom HTTP wrapper is required for Jarvis Voice.

Live homelab instance: http://10.0.0.52:8081 (reachable over Tailscale)

Available Endpoints

Auth + Session Continuity

iOS Integration Pattern (Tool Handler)

From the ask_jarvis() tool handler, call term-llm directly at /v1/chat/completions (or /v1/responses) and set session_id to the voice-session UUID tracked in VoiceViewModel.

let jarvisSessionId = currentVoiceSession.jarvisSessionId
request.setValue("Bearer \(token)", forHTTPHeaderField: "Authorization")
request.setValue(jarvisSessionId, forHTTPHeaderField: "session_id")

Example Request/Response (/v1/chat/completions)

POST http://10.0.0.52:8081/v1/chat/completions
Authorization: Bearer <jarvis-token>
Content-Type: application/json
session_id: my-voice-session-uuid

{
  "messages": [
    { "role": "user", "content": "What's on my calendar tomorrow?" }
  ],
  "stream": false
}
{
  "id": "chatcmpl_abc123",
  "object": "chat.completion",
  "model": "jarvis",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "You have a dentist appointment at 10 AM and team standup at 2 PM."
      },
      "finish_reason": "stop"
    }
  ]
}

"stream": true is also supported (SSE).

/v1/responses as the Newer Alternative

POST /v1/responses is available and follows OpenAI’s newer Responses API model. Either endpoint works for Jarvis Voice.

Capability Inheritance (Why This Is Great)

The Jarvis agent behind this API already has full memory, tools, web search, and orchestration. By routing voice requests into term-llm, the iOS voice app inherits all of those capabilities immediately — no separate mobile-side reimplementation required.


Conversation State Design

Two-Layer Model

There are two distinct conversation contexts that must be managed independently:

Layer 1: Realtime API Context (Voice Layer)

Layer 2: Jarvis Session (Reasoning Layer)

// VoiceViewModel holds both IDs
struct SessionState {
    let realtimeSessionId: String   // from session.created event
    let jarvisSessionId: String     // UUID sent as session_id header on each /v1/chat/completions call
    let startedAt: Date
}

Session Keying

// Generate at each new voice session start
let jarvisSessionId = UUID().uuidString

// Include in every Jarvis HTTP call as a request header
var request = URLRequest(url: URL(string: "\(jarvisBaseURL)/v1/chat/completions")!)
request.setValue(jarvisSessionId, forHTTPHeaderField: "session_id")

Session Reset/Timeout

When to create a new Jarvis session (new jarvisSessionId):

When NOT to reset:


Response Handling for Voice

The Problem: Jarvis Returns Long Text

Jarvis’s responses are optimized for reading (markdown, lists, long explanations). Voice needs short, natural-sounding prose. A 500-word markdown response read aloud verbatim is terrible UX.

The 150-Word Heuristic

Truncate Jarvis responses at ~150 words for voice output. This is roughly 45–60 seconds of speech at natural speaking pace — enough to convey rich information without making Sam’s arm go numb holding his phone.

extension String {
    func truncatedForVoice(maxWords: Int = 150) -> String {
        let words = self.split(separator: " ")
        if words.count <= maxWords { return self }

        let truncated = words.prefix(maxWords).joined(separator: " ")
        return truncated + "… I have more details if you want them."
    }
}

Asking Jarvis to Be Concise

Add a system-level instruction to the Jarvis backend prompt/profile:

When called from the voice interface, keep responses under 100 words.
Use plain prose, not markdown. No bullet points, no headers, no code blocks.
If the answer requires more detail, summarize it and offer to elaborate.

Signal this via a request field:

{
  "session_id": "...",
  "query": "...",
  "context": "voice"
}

The backend uses "context": "voice" to prepend a conciseness instruction to the system prompt.

Stripping Markdown

extension String {
    func strippedMarkdown() -> String {
        var result = self
        // Remove markdown headers
        result = result.replacingOccurrences(of: #"#{1,6}\s"#, with: "", options: .regularExpression)
        // Remove bold/italic
        result = result.replacingOccurrences(of: #"\*{1,3}(.+?)\*{1,3}"#, with: "$1", options: .regularExpression)
        // Remove bullet points
        result = result.replacingOccurrences(of: #"^\s*[-*+]\s"#, with: "", options: .regularExpression)
        // Remove code blocks
        result = result.replacingOccurrences(of: #"```[\s\S]*?```"#, with: "[code block]", options: .regularExpression)
        return result.trimmingCharacters(in: .whitespacesAndNewlines)
    }
}

Apply .strippedMarkdown().truncatedForVoice() before submitting as tool result.


State Machine

The Core States

enum VoiceState: Equatable {
    case idle                    // App open, no active listening
    case connecting              // Establishing WebRTC session
    case listening               // Mic active, VAD waiting for speech
    case userSpeaking            // VAD detected speech start
    case processing              // VAD fired, waiting for tool call / response
    case fillerSpeaking          // Model speaking filler phrase (concurrent with HTTP call)
    case waitingForJarvis        // HTTP call in flight, filler done
    case aiSpeaking              // Model speaking final response
    case error(VoiceError)       // Something went wrong
}

State Transitions

idle
  → [user taps mic / opens app] → connecting
  → [session.created received] → listening

listening
  → [input_audio_buffer.speech_started] → userSpeaking

userSpeaking
  → [input_audio_buffer.speech_stopped] → processing
  → [user taps interrupt] → listening (send response.cancel)

processing
  → [response.output_audio.delta starts] → fillerSpeaking
  → [response.output_item.done (function_call)] → start HTTP Task

fillerSpeaking
  → [response.done AND HTTP call complete] → aiSpeaking (response.create sent)
  → [response.done AND HTTP still in flight] → waitingForJarvis

waitingForJarvis
  → [HTTP call returns] → aiSpeaking (response.create sent)
  → [HTTP call fails] → aiSpeaking (error message submitted as tool result)

aiSpeaking
  → [response.done] → listening
  → [input_audio_buffer.speech_started] → userSpeaking (model interrupted)

error(*)
  → [retry] → connecting
  → [give up] → idle

Interruption Handling

When Sam starts speaking while the AI is speaking:

  1. Server sends input_audio_buffer.speech_started
  2. Server auto-cancels the in-progress response (with server_vad/semantic_vad)
  3. response.done arrives with status: "cancelled"
  4. Client stops playing buffered audio immediately
  5. Use conversation.item.truncate to sync the server’s understanding of what was actually heard
// On speech_started while in .aiSpeaking state:
case .inputAudioBufferSpeechStarted:
    if currentState == .aiSpeaking {
        playerNode.stop()           // Stop playing immediately
        playerNode.reset()          // Clear buffer queue
        state = .userSpeaking
        // Server handles response cancellation automatically with semantic_vad
    }

SwiftUI Implementation

@Observable
class VoiceViewModel {
    var state: VoiceState = .idle
    var audioLevel: Float = 0.0       // Drives orb animation
    var transcript: String = ""        // Optional display

    // Actor-based components
    private let audioEngine: AudioEngineActor
    private let realtimeSession: RealtimeSessionActor
    private let jarvisClient: JarvisClientActor

    @MainActor
    func startSession() async {
        state = .connecting
        do {
            let ephemeralKey = try await fetchEphemeralKey()
            try await realtimeSession.connect(with: ephemeralKey)
            try await audioEngine.start()
            state = .listening
        } catch {
            state = .error(.connectionFailed(error))
        }
    }

    @MainActor
    func handleServerEvent(_ event: RealtimeServerEvent) {
        switch event {
        case .speechStarted:
            state = currentState == .aiSpeaking ? .userSpeaking : .userSpeaking
            audioEngine.stopPlayback()
        case .speechStopped:
            state = .processing
        case .fillerAudioStarted:
            state = .fillerSpeaking
        case .functionCallReady(let call):
            handleToolCall(call)
        case .responseAudioStarted:
            state = .aiSpeaking
        case .responseDone:
            state = .listening
        default: break
        }
    }
}

UI/UX

Design Philosophy: Radical Minimalism

This is a personal tool for Sam, not a consumer app. No chrome. No tutorial overlays. Just: voice in, voice out.

The Orb

Full-screen pulsating circle that reflects audio state:

struct VoiceOrbView: View {
    @Bindable var vm: VoiceViewModel

    var body: some View {
        TimelineView(.animation) { _ in
            Canvas { ctx, size in
                let center = CGPoint(x: size.width/2, y: size.height/2)
                let baseRadius = min(size.width, size.height) * 0.25

                // Outer glow (breathing animation)
                let breathRadius = baseRadius + CGFloat(vm.audioLevel) * 60

                // Color shifts by state
                let orbColor: Color = switch vm.state {
                case .idle:          .gray.opacity(0.4)
                case .listening:     .blue.opacity(0.6)
                case .userSpeaking:  .green
                case .processing, .fillerSpeaking, .waitingForJarvis: .orange
                case .aiSpeaking:    .purple
                case .error:         .red
                default:             .gray
                }

                // Draw outer glow
                ctx.fill(
                    Path(ellipseIn: CGRect(x: center.x - breathRadius,
                                          y: center.y - breathRadius,
                                          width: breathRadius * 2,
                                          height: breathRadius * 2)),
                    with: .color(orbColor.opacity(0.3))
                )

                // Draw core orb
                ctx.fill(
                    Path(ellipseIn: CGRect(x: center.x - baseRadius,
                                          y: center.y - baseRadius,
                                          width: baseRadius * 2,
                                          height: baseRadius * 2)),
                    with: .color(orbColor)
                )
            }
        }
        .background(.black)
        .ignoresSafeArea()
    }
}

Transcript (Optional)

Show transcript in a ScrollView below the orb. Only the last 3–4 exchanges. Use ScrollViewReader to auto-scroll to latest. Toggleable with a long-press gesture.

No Complex Navigation

The entire app is:

SwiftUI + @Observable Pattern

Use @Observable macro (iOS 17+) for the view model. No ObservableObject, no @Published everywhere. Cleaner and more performant:

@Observable class VoiceViewModel { ... }  // iOS 17+
// In view:
@Environment(VoiceViewModel.self) var vm
// or
@State private var vm = VoiceViewModel()

Open Source References

RepoWhy It’s RelevantRating
m1guelpf/swift-realtime-openai⭐ Top pick. Full OpenAI Realtime API client in clean Swift 5.9 async/await. Supports both WebSocket and WebRTC connectors. Session management, conversation history, audio capture + playback. Production-quality code.⭐⭐⭐⭐⭐
PallavAg/VoiceModeWebRTCSwiftWebRTC-specific OpenAI Realtime implementation. Shows interruption handling, system message config, voice selection. Good reference for the WebRTC data channel event handling pattern.⭐⭐⭐⭐
kasimok/AECAudioStreamDrop-in Swift Package for hardware AEC via VoiceProcessingIO. Use this if the setVoiceProcessingEnabled approach has issues. Core Audio wrapper.⭐⭐⭐⭐
twilio/voice-quickstart-ios AudioDeviceExampleProduction-grade, battle-tested VoiceProcessingIO + AVAudioEngine manual rendering. ObjC but the most complete AEC reference that exists. Twilio uses this in production for millions of calls.⭐⭐⭐⭐⭐
baochuquan/ios-vadiOS VAD toolkit: WebRTC GMM, Silero DNN, Yamnet DNN models. Useful if you want client-side VAD (fallback or supplement to OpenAI’s server VAD).⭐⭐⭐⭐
dmrschmidt/DSWaveformImageBest waveform rendering library for SwiftUI and UIKit. Real-time waveform from audio buffers. Use for transcript view or orb alternative.⭐⭐⭐⭐
lzell/AIProxySwiftRealtime API with ephemeral key pattern — shows how to protect API key via a proxy. Good security pattern reference if you don’t want to run your own backend for the ephemeral key.⭐⭐⭐

Start with m1guelpf/swift-realtime-openai. Fork it, strip what you don’t need, add the Jarvis tool call handler. This saves 2–3 weeks of audio pipeline work.


Media Playback — A First-Class Use Case

This app should not feel like a voice-only ChatGPT wrapper. One of the highest-leverage interactions is:

“Play me something interesting.”

That single prompt turns Jarvis from an assistant into a companion. It curates. It surprises. It understands context. And critically: playback happens on-device, in high quality, with proper ducking when Jarvis speaks.

A Deliberate Exception to the Thin-Router Rule

The core architecture is still right: ask_jarvis() handles reasoning. But media control is one of the rare places where local tools should be first-class.

In short: Jarvis curates, iPhone performs.

The Flow

User: "Play something good"
Realtime API → ask_jarvis("recommend something to play — music, podcast, or ambient audio")
Jarvis reasons: time of day, Sam's recent activity, mood cues from conversation, taste history
Returns: {
  "type": "podcast",
  "title": "Darknet Diaries ep 147",
  "url": "https://...",
  "reason": "you haven't listened to this one and you're clearly in a technical mood"
}
iOS app executes client-side tool: play_audio(url, title, type)
AVPlayer / AVAudioEngine streams audio on device
Voice LLM: "Playing Darknet Diaries episode 147. You haven't heard this one."
Media plays. Jarvis goes quiet until spoken to.

Client-Side Media Tools

These run entirely on-device. No homelab round-trip required.

ToolAction
play_audio(url, title, type)Stream media URL via AVPlayer
pause_playback()Pause current media
resume_playback()Resume paused media
stop_playback()Stop and clear now playing
skip_track()Advance to next queued item
get_now_playing()Return current media metadata to voice LLM
set_volume(level)Set output volume (0.01.0)
duck_audio(level)Lower media level while Jarvis speaks
enqueue(url, title)Append item to local queue
clear_queue()Remove all queued items

If you want this to feel magical, support at least: play_audio, pause_playback, resume_playback, get_now_playing, and duck_audio in V1.

Audio Ducking (Non-Negotiable)

Ducking is what makes voice + playback feel polished instead of chaotic. Jarvis should never shout over music.

Use one shared audio policy:

// AVAudioSession setup for duplex voice + media
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord,
                        mode: .voiceChat,
                        options: [.defaultToSpeaker, .allowBluetooth, .duckOthers])
try session.setActive(true)

.duckOthers also helps when external audio apps are active (Spotify, Podcasts, etc.). For your own internal media player, still apply explicit gain automation so duck timing feels intentional.

What Jarvis Can Pick

The real product value is not playback mechanics. It is selection intelligence.

Podcasts

Music

Ambient / Focus Audio

Creative Modes (This Is Where It Becomes Memorable)

Jarvis should not only obey literal commands. It should program experiences.

The DJ Pattern

The strongest version of this feature is Contextual DJ Jarvis:

  1. Jarvis introduces a pick
  2. App plays it
  3. App detects playback end (AVPlayerItemDidPlayToEndTime)
  4. iOS sends event back to Realtime session
  5. Jarvis picks and tees up the next item with commentary

Example voice transition:

“That was Floating Points. Next up: something with similar texture but more drive — from a Warp compilation in 2019.”

This loop creates a living, personalized station rather than one-off playback commands.

Sources Without Auth (Zero-Config MVP)

For day-one implementation with no OAuth headaches:

This is enough to ship a compelling first version quickly.

Opinionated Build Order

  1. Ship zero-auth playback first (SomaFM + podcasts + queue + ducking)
  2. Add taste memory + novelty scoring (avoid repeats, explain picks)
  3. Implement DJ loop (track end events → next selection)
  4. Only then add OAuth providers (Spotify/YouTube Music)

If you get step 1 and step 3 right, the app already feels special.


Open Questions / Decisions Needed

Sam needs to decide the following before starting:

1. Voice API Choice (High Priority)

Recommendation is gpt-realtime, but verify: Are you comfortable with the cost (~$0.15/min typical usage)? For a personal tool used 30 min/day, that’s ~$4.50/day / ~$135/month. If that’s too high, Grok at $0.05/min is ~$45/month.

2. Integrate Existing term-llm HTTP API (Critical Path)

This is the most important integration decision, but not a wrapper-building project. term-llm already exposes the required HTTP API on 10.0.0.52:8081.

3. Tailscale vs Other Networking

Tailscale VPN On Demand is the cleanest solution, but it requires the Tailscale app to be installed. Alternatives:

Recommendation: Tailscale. You already use it presumably, and VPN On Demand is well-documented.

4. Session Length Strategy

How long should a voice session last before forcing a reset? Options:

Recommendation: Per-conversation with Jarvis continuity — each Realtime session is fresh, but the jarvis_session_id persists across the app session.

5. Transcript Display: Yes or No?

A persistent transcript is useful for debugging and for accessibility. But it adds UI complexity and storage considerations. Recommendation: implement it as an optional debug overlay, off by default.

6. gpt-realtime vs gpt-4o-mini-realtime

Mini is 8x cheaper for audio tokens but notably weaker at function calling. For a single-tool routing pattern (ask_jarvis always), this might be acceptable. Test mini first, see if tool call reliability is sufficient. If yes, the cost savings are significant ($17/month vs ~$135/month for 30 min/day usage).


LayerTechnologyJustification
Voice LLMOpenAI gpt-realtime (full model)Best tool calling reliability, semantic VAD, auto-waiting, WebRTC native, ephemeral keys
TransportWebRTC (via WebRTC iOS framework)Packet loss resilience, built-in jitter buffer, no TCP head-of-line blocking, OpenAI’s own recommendation
iOS AudioAVAudioEngine + .voiceChat mode + setVoiceProcessingEnabled(true)Hardware AEC, AGC, noise suppression; correct path for duplex voice AI
Sample Rate ConversionAVAudioConverter (48kHz Float32 → 24kHz Int16)Required — iOS hardware always runs at 48kHz; OpenAI requires 24kHz PCM16
**