Jarvis Voice — iOS App Brainstorm

Overview / What We’re Building

Jarvis Voice is a minimal iOS app that gives Sam a natural two-way voice interface to his existing Jarvis AI agent running on a homelab server at 10.0.0.52:8081. The architecture is a thin-router pattern: a cloud-based voice LLM (OpenAI’s gpt-realtime) handles all speech input/output — microphone capture, voice activity detection, speech-to-speech audio generation — but contains zero reasoning logic itself. Every meaningful user request is routed via a single ask_jarvis() tool call over Tailscale VPN directly to term-llm’s OpenAI-compatible HTTP API (/v1/chat/completions or /v1/responses) on the Jarvis backend, which does the actual thinking. The voice LLM speaks Jarvis’s response back to the user. The iOS app is the WebRTC audio transport layer, the tool-call dispatcher, and nothing else. The result: Sam speaks naturally, hears Jarvis respond in a natural voice, with sub-7-second end-to-end latency for most requests, zero cloud storage of conversation content, and full access to all of Jarvis’s existing capabilities without re-implementing them.

Voice LLM API Options

Comparison Table

Dimension	OpenAI gpt-realtime	xAI Grok Voice Agent	Gemini Live 2.5 Flash	ElevenLabs Conv. AI	Hume EVI 3
Architecture	True S2S	True S2S	True S2S (Native Audio)	Pipeline (STT→LLM→TTS)	True S2S
Transport	WebRTC + WebSocket + SIP	WebSocket + LiveKit	WebSocket	WebSocket + WebRTC	WebSocket
Tool calling	✅ Native, first-class	✅ Native + built-in (web/X search)	✅ Native + Google Search	✅ Client-side + server-side	✅ Requires external LLM
Official iOS SDK	❌ (community: m1guelpf)	❌ (use LiveKit iOS SDK)	❌ (DIY WebSocket or Pipecat)	✅ Native Swift SDK (v2.1.0+)	✅ HumeAI Swift SDK
Pricing	$32/$64 per 1M audio in/out tokens (~$0.10–0.20/min typical)	$0.05/min flat	~$0.015–0.02/min (cheapest)	$0.04–0.10/min	$0.04–0.07/min
Latency (TTFA)	~200–500ms (no tools)	<700ms avg	150–400ms (variable)	~300–600ms (pipeline)	~200–400ms
Context window	128K tokens (gpt-realtime GA)	S2S managed	1M tokens	Depends on LLM	Depends on LLM
LLM flexibility	❌ GPT-4o only	❌ Grok only	❌ Gemini only	✅ BYO-LLM	✅ External LLM (Claude/GPT)
Function calling reliability	⭐⭐⭐⭐⭐ (66.5% on evals, full model)	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐ (needs ext. LLM)
Voice quality	⭐⭐⭐⭐ (marin, alloy, etc.)	⭐⭐⭐⭐ (5 voices: Ara/Rex/Sal/Eve/Leo)	⭐⭐⭐	⭐⭐⭐⭐⭐ (5000+ voices)	⭐⭐⭐⭐⭐ (clone support)
Semantic VAD	✅ (`semantic_vad` mode)	✅	✅	✅	✅
Interruption handling	✅ Auto (server_vad)	✅	✅	✅	✅
Prompt caching	✅ 94% audio discount on cached	❌	✅	N/A	N/A
OpenAI API compat	✅ native	✅	❌	❌	❌
Ecosystem maturity	⭐⭐⭐⭐⭐	⭐⭐⭐ (Dec 2025 launch)	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐

Analysis

OpenAI gpt-realtime is the clear winner for this use case despite being the most expensive option. Reasons:

Tool calling is the entire architecture — the ask_jarvis() pattern lives or dies on tool call reliability. OpenAI’s GA gpt-realtime model scored 66.5% on function calling evals vs 49.7% for the preview model. xAI Grok is competitive but newer/less proven. The mini model is explicitly worse at function calling — use the full model.
Semantic VAD — gpt-realtime supports semantic_vad which understands natural speech pauses vs sentence endings. For Sam’s use case (complex queries to Jarvis), this matters enormously — you don’t want the model cutting off mid-sentence.
WebRTC as first-class transport — OpenAI explicitly recommends WebRTC for mobile/iOS. The ephemeral key flow is clean. The community Swift reference implementation (m1guelpf/swift-realtime-openai) is production-quality.
Auto-waiting built in — the GA gpt-realtime model automatically says “I’m still waiting on that” if a tool call takes too long. No custom implementation needed.
Prompt caching — 94% discount on cached audio input tokens means long sessions get dramatically cheaper after the first few turns.

Why not ElevenLabs? Despite the excellent native Swift SDK, ElevenLabs is pipeline-based (not true S2S), which adds latency. More importantly, it depends on your LLM choice for function calling — you’d be paying for both ElevenLabs and GPT-4o API costs, at higher combined latency.

Why not Grok? $0.05/min flat rate is appealing, but: (1) launched December 2025, tiny community, (2) no native iOS SDK, (3) LiveKit plugin was Python-only at launch, (4) function calling not as proven as OpenAI’s.

Why not Gemini Live? Cheapest option by far (~$0.015/min) and 1M token context window is incredible, but no native iOS SDK means 1–2 weeks of extra engineering to build the WebSocket layer or deploy a Pipecat relay server.

🏆 Recommendation: OpenAI `gpt-realtime` via WebRTC

Use gpt-realtime (full model, not mini) via WebRTC with ephemeral keys. Revisit gpt-4o-mini-realtime-preview only after validating that tool calling behavior is acceptable in testing — expect degraded function calling reliability on mini.

Recommended Architecture

The Thin-Router Pattern

The voice LLM is not the brain. It is the mouth and ears of Jarvis. All reasoning, memory, tool use, and response generation happens on the homelab. The voice LLM’s only job is:

Convert Sam’s speech to text (S2S)
Decide: “this is a real request” → call ask_jarvis()
Pass the result back as natural speech

This is better than making the voice LLM do everything because:

Jarvis already works — it has tools, memory, context, personality. Don’t duplicate this.
Context window efficiency — keeping voice context short (just routing events) means 32K tokens lasts much longer
Cost — voice LLM audio tokens are expensive. Complex reasoning on audio tokens is very expensive. Let Jarvis do reasoning in text on a cheaper model.
Upgradeability — swap Jarvis’s backend LLM (Claude → Gemini → whatever) without touching the voice layer

Full Architecture Diagram

┌──────────────────────────────────────────────────────────────────────────────┐
│                              iOS Voice App                                    │
│                                                                               │
│  ┌──────────────────────────────┐   ┌────────────────────────────────────┐   │
│  │      SwiftUI Layer           │   │     WebRTC Peer Connection          │   │
│  │  @Observable VoiceViewModel  │   │  RTCPeerConnection (audio track)    │   │
│  │  Pulsating orb / waveform    │   │  RTCDataChannel (JSON events)       │   │
│  │  .idle → .listening →        │   │  Ephemeral key auth                 │   │
│  │  .processing → .speaking     │   │  ICE/DTLS/SRTP encrypted           │   │
│  └──────────────┬───────────────┘   └──────────────────┬─────────────────┘   │
│                 │ @MainActor                            │                     │
│  ┌──────────────▼───────────────────────────────────┐  │                     │
│  │           RealtimeEventHandler (actor)           │  │                     │
│  │  • Registers ask_jarvis() tool in session.update │  │                     │
│  │  • Watches response.output_item.done events      │  │                     │
│  │  • Fires async Task → JarvisClient               │  │                     │
│  │  • Submits conversation.item.create (tool result)│  │                     │
│  │  • Sends response.create to trigger speech       │  │                     │
│  └──────────────┬───────────────────────────────────┘  │                     │
│                 │ async/await                           │                     │
│  ┌──────────────▼───────────────┐                      │                     │
│  │    AVAudioEngine (actor)     │                       │                     │
│  │  AVAudioSession.voiceChat    │                       │                     │
│  │  Hardware AEC + AGC          │                       │                     │
│  │  48kHz Float32 tap           │                       │                     │
│  │  AVAudioConverter → 24kHz    │                       │                     │
│  │  Int16 PCM for WebRTC        │                       │                     │
│  │  AVAudioPlayerNode (TTS out) │                       │                     │
│  └──────────────────────────────┘                      │                     │
│                                                         │                     │
│  ┌──────────────────────────────┐                       │                     │
│  │     JarvisClient (actor)     │                       │                     │
│  │  URLSession with 10s timeout │                       │                     │
│  │  Bearer token from Keychain  │                       │                     │
│  │  Circuit breaker pattern     │                       │                     │
│  │  session_id UUID tracking    │                       │                     │
│  └──────────────┬───────────────┘                       │                     │
└─────────────────│─────────────────────────────────────│─────────────────────┘
                  │                                       │
     Tailscale WireGuard VPN                    WebRTC + DTLS/SRTP
     (System VPN, On-Demand rules)              (UDP, direct path)
                  │                                       │
                  ▼                                       ▼
     ┌────────────────────────┐           ┌──────────────────────────────┐
     │   Homelab 10.0.0.52    │           │   OpenAI Realtime API        │
     │                        │           │   model: gpt-realtime         │
     │  ┌────────────────────────┐  │           │                              │
     │  │  term-llm HTTP API   │  │           │   Session config:            │
     │  │  /v1/chat/completions│  │           │   - Semantic VAD             │
     │  │  /v1/responses       │  │           │   - Tool: ask_jarvis()       │
     │  │  Bearer + session_id │  │           │   - Voice: marin             │
     │  └────────┬─────────────┘  │           │   - 128K context             │
     │           │                │           └──────────────────────────────┘
     │  ┌────────▼─────────────┐  │
     │  │  Jarvis Agent        │  │
     │  │  (term-llm)          │  │
     │  │  memory/tools/search │  │
     │  └──────────────────────┘  │
     └────────────────────────┘

Why Not Make the Voice LLM Do Everything?

Approach	Voice LLM Does Everything	Thin-Router (Recommended)
Jarvis memory/tools	Duplicated or lost	Fully preserved
Cost per complex query	High (audio tokens for reasoning)	Low (audio tokens only for routing)
Jarvis upgrades	Require app update	Transparent
Context window burn	Fast (audio tokens expensive)	Slow (minimal turns)
Session length	Short (~15 min before context fills)	Longer (Jarvis manages its own context)

The Tool Calling Mechanism

The `ask_jarvis()` Tool Definition

{
  "type": "function",
  "name": "ask_jarvis",
  "description": "Routes any meaningful user request to the Jarvis AI agent backend running on Sam's homelab. Use this tool for EVERY real question, task, or request. Do not attempt to answer from your own knowledge.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The user's complete request in natural language, including all relevant context they stated (names, dates, quantities, locations). Do not abbreviate or reframe."
      }
    },
    "required": ["query"]
  }
}

Complete Event Flow (Source-Verified from OpenAI Docs)

┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: USER SPEAKS                                                         │
│                                                                               │
│  Sam: "Hey, what's on my calendar tomorrow?"                                 │
│                                                                               │
│  iOS mic → PCM16 24kHz → WebRTC audio track → OpenAI Realtime API           │
│                                                                               │
│  Server events (client receives):                                             │
│    ← input_audio_buffer.speech_started                                       │
│    ← input_audio_buffer.speech_stopped    (semantic_vad fires)               │
│    ← input_audio_buffer.committed                                            │
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 2: MODEL DECIDES TO CALL TOOL                                          │
│                                                                               │
│  ← response.created                                                          │
│  ← response.output_item.added  { type: "function_call", name: "ask_jarvis" }│
│  ← response.function_call_arguments.delta  × N  (streaming JSON)            │
│     e.g. delta: '{"q'  →  '{"query'  →  '{"query":"What'  → ...            │
│  ← response.function_call_arguments.done                                     │
│     final: { "query": "What is on my calendar tomorrow?" }                   │
│  ← response.output_item.done  {                                              │
│       type: "function_call",                                                 │
│       name: "ask_jarvis",                                                    │
│       call_id: "call_abc123",                                                │
│       arguments: "{\"query\":\"What is on my calendar tomorrow?\"}"          │
│     }                                                                        │
│  ← response.done  (status: "completed" — model spoke filler, stopped)       │
│                                                                               │
│  [CONCURRENTLY: model has already spoken filler phrase audio]                │
│  [e.g. "Let me check with Jarvis." plays while HTTP is in flight]            │
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 3: iOS APP CALLS JARVIS                                                │
│                                                                               │
│  Task {                                                                       │
│    POST http://10.0.0.52:8081/v1/chat/completions                            │
│    Authorization: Bearer <token-from-keychain>                               │
│    Content-Type: application/json                                            │
│    session_id: <jarvis-session-uuid>                                         │
│    { "messages":[{"role":"user","content":"What is on my calendar tomorrow?"}],│
│      "stream": false }                                                      │
│  }                                                                            │
│                                                                               │
│  ← HTTP 200 { "choices":[{"message":{"content":"You have a dentist..."}}], ... }│
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 4: APP SUBMITS TOOL RESULT                                             │
│                                                                               │
│  → conversation.item.create {                                                │
│      "type": "conversation.item.create",                                     │
│      "item": {                                                               │
│        "type": "function_call_output",                                       │
│        "call_id": "call_abc123",                                             │
│        "output": "You have a dentist at 10am and team standup at 2pm."      │
│      }                                                                       │
│    }                                                                         │
│                                                                               │
│  → response.create {}                                                        │
│                                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 5: MODEL SPEAKS THE ANSWER                                             │
│                                                                               │
│  ← response.created                                                          │
│  ← response.output_audio.delta × N   (streamed PCM16 audio chunks)          │
│  ← response.output_audio_transcript.delta × N  (streamed transcript text)   │
│  ← response.done                                                             │
│                                                                               │
│  [iOS plays audio chunks via AVAudioPlayerNode as they arrive]               │
└─────────────────────────────────────────────────────────────────────────────┘

Key Implementation Detail (Swift)

func handleRealtimeEvent(_ event: RealtimeServerEvent) {
    guard case .responseOutputItemDone(let item) = event,
          item.type == "function_call",
          item.name == "ask_jarvis",
          let callId = item.callId,
          let argsStr = item.arguments,
          let args = try? JSONDecoder().decode(JarvisArgs.self,
                          from: argsStr.data(using: .utf8)!)
    else { return }

    Task {
        do {
            let result = try await jarvisClient.ask(
                query: args.query,
                sessionId: currentSessionId,
                timeout: 10.0
            )
            // Truncate long responses for voice
            let voiceOutput = result.response.truncatedForVoice(maxWords: 150)
            await submitToolResult(callId: callId, output: voiceOutput)
            await sendResponseCreate()
        } catch {
            let errorMsg = errorMessage(for: error)
            await submitToolResult(callId: callId, output: errorMsg)
            await sendResponseCreate()
        }
    }
}

func submitToolResult(callId: String, output: String) async {
    let event: [String: Any] = [
        "type": "conversation.item.create",
        "item": [
            "type": "function_call_output",
            "call_id": callId,
            "output": output
        ]
    ]
    await dataChannel.send(JSON(event))
}

Latency Breakdown Table

Stage	Latency (LAN)	Latency (4G/5G Cellular)	Notes
Semantic VAD fires after speech ends	200–600ms	200–600ms	Semantic VAD more accurate, slightly slower than server_vad
Model processes audio + decides to call tool	150–400ms	150–400ms	Included in response initiation
Model speaks filler phrase	800ms–1.5s	800ms–1.5s	Concurrent with HTTP call below
HTTP to Jarvis via Tailscale WireGuard	5–30ms	80–300ms	LAN: nearly instant. Cellular: DERP relay may add latency
Jarvis LLM inference (Claude Sonnet)	1,000–4,000ms	1,000–4,000ms	Dominant cost for complex queries
Model generates first audio byte after tool result	200–500ms	200–500ms	After `response.create` sent
Total perceived gap	~2–5s	~3–7s	Filler phrase masks Jarvis inference time

Key insight: The filler phrase (1–1.5s) buys you almost all the time you need for Jarvis to respond on LAN. The user hears “Let me check with Jarvis” and then almost immediately hears the answer. The silence gap that needs to be hidden is usually under 2 seconds.

Filler Phrase Strategy

The Problem

When ask_jarvis() is called, the voice LLM stops speaking and waits for the tool result. Without intervention, Sam hears dead silence for 2–7 seconds. This feels broken.

The Solution: Pre-Call Verbal Acknowledgment

Instruct the model in the system prompt to say one short phrase before calling the tool. This speech happens in the response.output_audio.delta stream that accompanies the function call. When response.done arrives (marking the end of the filler + the tool call), you fire the HTTP request to Jarvis.

System Prompt Language

# Tool Usage
Before calling ask_jarvis(), always speak one short, natural acknowledgment.
These must be varied — never use the same phrase twice in a row:
- "Let me ask Jarvis."
- "One moment."
- "Checking now."
- "On it."
- "Let me look that up."
- "Sure, give me a second."

Then call ask_jarvis() immediately. Do not say anything else before calling.

The Auto-Waiting Feature (Built Into gpt-realtime)

From OpenAI’s official prompting docs (confirmed source):

“If you ask the model for the results of a function call, it’ll say something like ‘I’m still waiting on that.’ This feature is automatically enabled for new models — no changes necessary.”

This means: if Jarvis takes longer than expected (e.g., complex multi-step query), the model will naturally fill the silence with “I’m still waiting on that…” without any code on your end. This is a gpt-realtime GA feature, not available in the preview models.

What NOT To Do

❌ Do not try to stream Jarvis’s response in real-time into the tool result — tool outputs are submitted as complete strings; there is no mid-tool-call streaming into the model
❌ Do not disable VAD waiting for the tool call to complete — the model handles this state automatically
❌ Do not set silence_duration_ms very low (< 200ms) — this causes premature VAD firing on natural pauses mid-sentence

iOS Audio Pipeline

AVAudioEngine vs AVAudioSession — The Mental Model

These are not alternatives — they are two layers of the same stack.

Layer	Role	What It Controls
`AVAudioSession`	OS contract — tells iOS how you intend to use audio	Routing, interruption policy, AEC mode, category
`AVAudioEngine`	Signal graph — the actual audio processing pipeline	Nodes, taps, converters, players

Configure AVAudioSession first, then build AVAudioEngine on top.

The Correct Setup for Full-Duplex Voice AI

// Step 1: Configure the session
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord,
                        mode: .voiceChat,        // ← KEY: enables hardware AEC + AGC
                        options: [.defaultToSpeaker, .allowBluetooth])
try session.setPreferredSampleRate(24000)         // Request 24kHz (hardware may ignore)
try session.setPreferredIOBufferDuration(0.01)    // 10ms buffer = ~240 samples
try session.setActive(true)

// Step 2: Build the engine
let engine = AVAudioEngine()

// Step 3: Enable voice processing (AEC) on the input node
// MUST be called while engine is STOPPED
try engine.inputNode.setVoiceProcessingEnabled(true)

// Step 4: Tap at NATIVE hardware format (48kHz Float32 — DO NOT try to force 24kHz here)
let nativeFormat = engine.inputNode.inputFormat(forBus: 0)  // 48kHz Float32 mono

// Step 5: Set up converter to OpenAI's required format
let targetFormat = AVAudioFormat(commonFormat: .pcmFormatInt16,
                                  sampleRate: 24000,
                                  channels: 1,
                                  interleaved: true)!
let converter = AVAudioConverter(from: nativeFormat, to: targetFormat)!

// Step 6: Install tap and stream to WebRTC
engine.inputNode.installTap(onBus: 0, bufferSize: 4800, format: nativeFormat) { buffer, _ in
    let frameCount = AVAudioFrameCount(24000) * buffer.frameLength /
                     AVAudioFrameCount(nativeFormat.sampleRate)
    let convertedBuffer = AVAudioPCMBuffer(pcmFormat: targetFormat,
                                           frameCapacity: frameCount)!
    var error: NSError?
    var consumed = false
    converter.convert(to: convertedBuffer, error: &error) { _, outStatus in
        if !consumed { outStatus.pointee = .haveData; consumed = true; return buffer }
        outStatus.pointee = .noDataNow; return nil
    }
    // convertedBuffer.int16ChannelData![0] = raw Int16 PCM at 24kHz
    // Base64-encode and send over WebSocket, OR feed directly to WebRTC audio track
    let audioData = Data(bytes: convertedBuffer.int16ChannelData![0],
                         count: Int(convertedBuffer.frameLength) * 2)
    realtimeSession.sendAudio(audioData)
}

// Step 7: Attach playback node (for TTS output from OpenAI)
let playerNode = AVAudioPlayerNode()
engine.attach(playerNode)
engine.connect(playerNode, to: engine.mainMixerNode, format: targetFormat)

try engine.start()

PCM16 at 24kHz — The Numbers

10ms frame  =  240 samples × 2 bytes =   480 bytes
20ms frame  =  480 samples × 2 bytes =   960 bytes  (good for VAD processing)
100ms chunk = 2400 samples × 2 bytes = 4,800 bytes  (good WebSocket granularity)

Uplink bandwidth:   ~48 KB/s (24kHz mono PCM16)
Downlink bandwidth: ~48 KB/s (AI voice response, same format)

AEC — The Right Configuration

Use Option A (.voiceChat mode) + Option B (setVoiceProcessingEnabled(true)) together. This is the path of least resistance and handles 95% of echo cancellation needs.

Do NOT use setPrefersEchoCancelledInput(true) — this is iOS 18.2+ only, hardware-gated to 2024 iPhones, and cannot be combined with Voice Processing IO APIs. It’s designed for music apps, not voice AI.

Key Gotchas

Gotcha	Detail	Fix
The disconnect bug (WebSocket code 1000)	URLSessionWebSocketTask closes immediately after first audio packet	Audio is not properly formatted as PCM16 Int16. Ensure base64 encodes raw `Int16` (little-endian) bytes, not Float32. Confirm `session.created` received before sending audio.
Volume drop	VoiceProcessingIO reduces playback volume ~3–6dB	This is by design (headroom for AEC). Adjust `playerNode.volume` upward, or use `engine.mainMixerNode.outputVolume`.
Cannot force tap format	Setting custom format on `installTap` silently fails or produces zero buffers	Always tap at native 48kHz Float32, use `AVAudioConverter` to resample.
Route change resets AEC	Headphone insertion/removal requires engine restart	Listen to `AVAudioSession.routeChangeNotification`, pause engine, call `try? session.setActive(true)`, restart engine.
Engine config change	Hardware change (USB mic, headphones) auto-stops engine	Listen to `AVAudioEngineConfigurationChangeNotification`, rewire graph and restart.
Media services reset	Rare but possible — iOS kills audio server	Listen to `AVAudioSession.mediaServicesWereResetNotification`, full teardown + rebuild.
Bluetooth A2DP → HFP	`.allowBluetooth` forces BT into HFP (narrowband) for AEC	Expected behavior. HFP = 8kHz or 16kHz voice profile. A2DP = high quality but no AEC.
`conversation_already_has_active_response`	Sending `response.create` while response in flight	Always gate `response.create` on `response.done` or `response.cancelled`

WebRTC vs WebSocket for iOS

Why WebRTC Wins

Dimension	WebRTC	WebSocket
Packet loss handling	Built-in FEC (Opus codec), can drop late packets	TCP: retransmits, causes jitter/delay
Head-of-line blocking	None (UDP-based)	Yes (TCP) — a dropped packet stalls all subsequent audio
AEC integration	Framework-level AEC built in to WebRTC iOS SDK	Manual (must implement via VoiceProcessingIO)
Network transitions	ICE restart handles wifi→cellular gracefully	URLSessionWebSocketTask often drops on network change
Jitter buffer	Built in (adaptive)	Must implement manually
Latency	Lower (UDP, adaptive bitrate)	Higher (TCP overhead)
OpenAI recommendation	✅ Explicitly recommended for mobile/iOS	“Server-to-server tool” per OpenAI docs

From the OpenAI docs (verified source):

“WebSocket is explicitly described as a ‘server-to-server’ tool. However, WebSocket is still viable on mobile when you control the full audio pipeline.”

Translated: use WebRTC. WebSocket is for your server talking to OpenAI, not your iOS app.

The Ephemeral Key Flow

1. Your backend server:
   POST https://api.openai.com/v1/realtime/client_secrets
   Authorization: Bearer <OPENAI_API_KEY>
   → Returns: { "client_secret": { "value": "ek_xxx...", "expires_at": ... } }

2. iOS app fetches ephemeral key from YOUR backend (never store raw OpenAI key on device)

3. iOS app:
   POST https://api.openai.com/v1/realtime/calls
   Authorization: Bearer ek_xxx
   Content-Type: application/sdp
   Body: <SDP offer from RTCPeerConnection>
   → Returns: SDP answer

4. Set RTCPeerConnection remote description with SDP answer
5. ICE negotiation completes → audio stream is live
6. Tool call events arrive on RTCDataChannel

Note: Ephemeral keys have a short TTL (minutes). Generate a new one per session start. Store the raw OpenAI API key server-side (your backend or Keychain-protected relay), never in the iOS app bundle.

Reference Implementations

m1guelpf/swift-realtime-openai — supports both WebSocket and WebRTC connectors. Clean Swift 5.9 async/await. Start here.
PallavAg/VoiceModeWebRTCSwift — WebRTC-specific implementation, shows interruption handling and voice selection

Networking: Reaching Jarvis from Outside

The Right Approach: Tailscale VPN On Demand

Tailscale’s VPN On Demand feature (available since Tailscale iOS 1.48, verified January 2026) allows iOS to automatically activate the WireGuard VPN tunnel whenever a DNS query for *.ts.net domains is made. This means:

Sam opens the voice app
App makes HTTP request to jarvis.tail-xxxx.ts.net
iOS VPN On Demand kicks in, activates Tailscale WireGuard tunnel
Request reaches 10.0.0.52:8081 on the homelab
No manual VPN management required

Why tsnet Doesn’t Work on iOS

The tsnet package — Tailscale’s embeddable Go library — allows embedding Tailscale directly into a Go binary so it acts as its own Tailscale node without a separate install. However:

tsnet is Go-only — it’s a Go package, not a framework that can be linked into a Swift iOS app
iOS app sandboxing prevents the kind of system-level network access tsnet requires
This was confirmed as not feasible in GitHub issue tailscale/tailscale#7240

The correct approach: Require the Tailscale iOS app to be installed separately. Use VPN On Demand rules.

MagicDNS Configuration

In your Tailscale admin console, enable MagicDNS. Your homelab server gets a stable DNS name like jarvis.tail-xxxx.ts.net. Configure the iOS VPN On Demand rule to trigger for *.ts.net or *.tail-xxxx.ts.net.

// iOS: How to call Jarvis via Tailscale
let jarvisURL = URL(string: "http://jarvis.tail-xxxx.ts.net/chat")!

// The Tailscale VPN On Demand activates automatically
// when this DNS name is resolved. No extra code needed.

Bearer Token Management

Store the Jarvis API bearer token in iOS Keychain, not in UserDefaults or app bundle:

import Security

func storeJarvisToken(_ token: String) {
    let query: [String: Any] = [
        kSecClass as String: kSecClassGenericPassword,
        kSecAttrAccount as String: "jarvis-api-token",
        kSecValueData as String: token.data(using: .utf8)!,
        kSecAttrAccessible as String: kSecAttrAccessibleWhenUnlockedThisDeviceOnly
    ]
    SecItemAdd(query as CFDictionary, nil)
}

Fallback When Homelab Is Unreachable

When Jarvis is down (server off, VPN unreachable, timeout), the tool result must still return something sensible:

enum JarvisError: Error {
    case timeout
    case serverDown
    case authFailed
    case unknownError(Int)
}

func errorMessage(for error: Error) -> String {
    switch error {
    case JarvisError.timeout:
        return "I wasn't able to reach Jarvis — the request timed out. The homelab may be busy."
    case JarvisError.serverDown:
        return "Jarvis appears to be offline right now. I can't reach the homelab."
    case JarvisError.authFailed:
        return "Authentication to Jarvis failed. You may need to update the API token in settings."
    default:
        return "Something went wrong reaching Jarvis: \(error.localizedDescription)"
    }
}

The voice LLM will speak these error messages naturally.

Jarvis Backend API

✅ Correct Finding: term-llm Already Exposes a Full HTTP API

term-llm serve --platform web exposes a production HTTP API that is OpenAI-compatible. No custom HTTP wrapper is required for Jarvis Voice.

Live homelab instance: http://10.0.0.52:8081 (reachable over Tailscale)

Available Endpoints

POST /v1/chat/completions — OpenAI Chat Completions compatible
POST /v1/responses — OpenAI Responses API compatible
GET /v1/models
GET /healthz
GET /v1/sessions
GET /v1/sessions/{id}

Auth + Session Continuity

Auth: Authorization: Bearer <token>
Pass session_id: <uuid> as an HTTP request header to preserve context across turns
Reuse that same session_id for the lifetime of a voice conversation
If omitted, term-llm auto-creates a session and returns it in the x-session-id response header
Default idle session TTL is 30 minutes (configurable)

iOS Integration Pattern (Tool Handler)

From the ask_jarvis() tool handler, call term-llm directly at /v1/chat/completions (or /v1/responses) and set session_id to the voice-session UUID tracked in VoiceViewModel.

let jarvisSessionId = currentVoiceSession.jarvisSessionId
request.setValue("Bearer \(token)", forHTTPHeaderField: "Authorization")
request.setValue(jarvisSessionId, forHTTPHeaderField: "session_id")

Example Request/Response (`/v1/chat/completions`)

POST http://10.0.0.52:8081/v1/chat/completions
Authorization: Bearer <jarvis-token>
Content-Type: application/json
session_id: my-voice-session-uuid

{
  "messages": [
    { "role": "user", "content": "What's on my calendar tomorrow?" }
  ],
  "stream": false
}

{
  "id": "chatcmpl_abc123",
  "object": "chat.completion",
  "model": "jarvis",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "You have a dentist appointment at 10 AM and team standup at 2 PM."
      },
      "finish_reason": "stop"
    }
  ]
}

"stream": true is also supported (SSE).

`/v1/responses` as the Newer Alternative

POST /v1/responses is available and follows OpenAI’s newer Responses API model. Either endpoint works for Jarvis Voice.

Capability Inheritance (Why This Is Great)

The Jarvis agent behind this API already has full memory, tools, web search, and orchestration. By routing voice requests into term-llm, the iOS voice app inherits all of those capabilities immediately — no separate mobile-side reimplementation required.

Conversation State Design

Two-Layer Model

There are two distinct conversation contexts that must be managed independently:

Layer 1: Realtime API Context (Voice Layer)

Managed by OpenAI’s gpt-realtime model
Contains: session instructions, tool definitions, voice-layer turns (user speech transcripts, filler phrases, brief spoken confirmations)
Does NOT contain: Jarvis’s full reasoning, long responses (those are summarized/truncated for voice)
Context limit: 128K tokens (gpt-realtime GA)
Managed via: conversation.item.delete for pruning, session.truncation for automatic management
Reset trigger: session timeout or explicit new session

Layer 2: Jarvis Session (Reasoning Layer)

Managed by term-llm’s server-side session store on homelab
Contains: full conversation history, tool results, reasoning context
NOT subject to OpenAI’s context limits
Keyed by: jarvis_session_id UUID (separate from any OpenAI session ID)
Reset trigger: user explicitly says “start fresh” or session TTL (30 min default)

// VoiceViewModel holds both IDs
struct SessionState {
    let realtimeSessionId: String   // from session.created event
    let jarvisSessionId: String     // UUID sent as session_id header on each /v1/chat/completions call
    let startedAt: Date
}

Session Keying

// Generate at each new voice session start
let jarvisSessionId = UUID().uuidString

// Include in every Jarvis HTTP call as a request header
var request = URLRequest(url: URL(string: "\(jarvisBaseURL)/v1/chat/completions")!)
request.setValue(jarvisSessionId, forHTTPHeaderField: "session_id")

Session Reset/Timeout

When to create a new Jarvis session (new jarvisSessionId):

User explicitly says “start a new conversation”
Voice session idle > 30 minutes
User taps a “Reset” button in the UI

When NOT to reset:

Realtime API session reconnect (network hiccup) — keep the same jarvisSessionId to preserve reasoning context across reconnects

Response Handling for Voice

The Problem: Jarvis Returns Long Text

Jarvis’s responses are optimized for reading (markdown, lists, long explanations). Voice needs short, natural-sounding prose. A 500-word markdown response read aloud verbatim is terrible UX.

The 150-Word Heuristic

Truncate Jarvis responses at ~150 words for voice output. This is roughly 45–60 seconds of speech at natural speaking pace — enough to convey rich information without making Sam’s arm go numb holding his phone.

extension String {
    func truncatedForVoice(maxWords: Int = 150) -> String {
        let words = self.split(separator: " ")
        if words.count <= maxWords { return self }

        let truncated = words.prefix(maxWords).joined(separator: " ")
        return truncated + "… I have more details if you want them."
    }
}

Asking Jarvis to Be Concise

Add a system-level instruction to the Jarvis backend prompt/profile:

When called from the voice interface, keep responses under 100 words.
Use plain prose, not markdown. No bullet points, no headers, no code blocks.
If the answer requires more detail, summarize it and offer to elaborate.

Signal this via a request field:

{
  "session_id": "...",
  "query": "...",
  "context": "voice"
}

The backend uses "context": "voice" to prepend a conciseness instruction to the system prompt.

Stripping Markdown

extension String {
    func strippedMarkdown() -> String {
        var result = self
        // Remove markdown headers
        result = result.replacingOccurrences(of: #"#{1,6}\s"#, with: "", options: .regularExpression)
        // Remove bold/italic
        result = result.replacingOccurrences(of: #"\*{1,3}(.+?)\*{1,3}"#, with: "$1", options: .regularExpression)
        // Remove bullet points
        result = result.replacingOccurrences(of: #"^\s*[-*+]\s"#, with: "", options: .regularExpression)
        // Remove code blocks
        result = result.replacingOccurrences(of: #"```[\s\S]*?```"#, with: "[code block]", options: .regularExpression)
        return result.trimmingCharacters(in: .whitespacesAndNewlines)
    }
}

Apply .strippedMarkdown().truncatedForVoice() before submitting as tool result.

State Machine

The Core States

enum VoiceState: Equatable {
    case idle                    // App open, no active listening
    case connecting              // Establishing WebRTC session
    case listening               // Mic active, VAD waiting for speech
    case userSpeaking            // VAD detected speech start
    case processing              // VAD fired, waiting for tool call / response
    case fillerSpeaking          // Model speaking filler phrase (concurrent with HTTP call)
    case waitingForJarvis        // HTTP call in flight, filler done
    case aiSpeaking              // Model speaking final response
    case error(VoiceError)       // Something went wrong
}

State Transitions

idle
  → [user taps mic / opens app] → connecting
  → [session.created received] → listening

listening
  → [input_audio_buffer.speech_started] → userSpeaking

userSpeaking
  → [input_audio_buffer.speech_stopped] → processing
  → [user taps interrupt] → listening (send response.cancel)

processing
  → [response.output_audio.delta starts] → fillerSpeaking
  → [response.output_item.done (function_call)] → start HTTP Task

fillerSpeaking
  → [response.done AND HTTP call complete] → aiSpeaking (response.create sent)
  → [response.done AND HTTP still in flight] → waitingForJarvis

waitingForJarvis
  → [HTTP call returns] → aiSpeaking (response.create sent)
  → [HTTP call fails] → aiSpeaking (error message submitted as tool result)

aiSpeaking
  → [response.done] → listening
  → [input_audio_buffer.speech_started] → userSpeaking (model interrupted)

error(*)
  → [retry] → connecting
  → [give up] → idle

Interruption Handling

When Sam starts speaking while the AI is speaking:

Server sends input_audio_buffer.speech_started
Server auto-cancels the in-progress response (with server_vad/semantic_vad)
response.done arrives with status: "cancelled"
Client stops playing buffered audio immediately
Use conversation.item.truncate to sync the server’s understanding of what was actually heard

// On speech_started while in .aiSpeaking state:
case .inputAudioBufferSpeechStarted:
    if currentState == .aiSpeaking {
        playerNode.stop()           // Stop playing immediately
        playerNode.reset()          // Clear buffer queue
        state = .userSpeaking
        // Server handles response cancellation automatically with semantic_vad
    }

SwiftUI Implementation

@Observable
class VoiceViewModel {
    var state: VoiceState = .idle
    var audioLevel: Float = 0.0       // Drives orb animation
    var transcript: String = ""        // Optional display

    // Actor-based components
    private let audioEngine: AudioEngineActor
    private let realtimeSession: RealtimeSessionActor
    private let jarvisClient: JarvisClientActor

    @MainActor
    func startSession() async {
        state = .connecting
        do {
            let ephemeralKey = try await fetchEphemeralKey()
            try await realtimeSession.connect(with: ephemeralKey)
            try await audioEngine.start()
            state = .listening
        } catch {
            state = .error(.connectionFailed(error))
        }
    }

    @MainActor
    func handleServerEvent(_ event: RealtimeServerEvent) {
        switch event {
        case .speechStarted:
            state = currentState == .aiSpeaking ? .userSpeaking : .userSpeaking
            audioEngine.stopPlayback()
        case .speechStopped:
            state = .processing
        case .fillerAudioStarted:
            state = .fillerSpeaking
        case .functionCallReady(let call):
            handleToolCall(call)
        case .responseAudioStarted:
            state = .aiSpeaking
        case .responseDone:
            state = .listening
        default: break
        }
    }
}

UI/UX

Design Philosophy: Radical Minimalism

This is a personal tool for Sam, not a consumer app. No chrome. No tutorial overlays. Just: voice in, voice out.

The Orb

Full-screen pulsating circle that reflects audio state:

struct VoiceOrbView: View {
    @Bindable var vm: VoiceViewModel

    var body: some View {
        TimelineView(.animation) { _ in
            Canvas { ctx, size in
                let center = CGPoint(x: size.width/2, y: size.height/2)
                let baseRadius = min(size.width, size.height) * 0.25

                // Outer glow (breathing animation)
                let breathRadius = baseRadius + CGFloat(vm.audioLevel) * 60

                // Color shifts by state
                let orbColor: Color = switch vm.state {
                case .idle:          .gray.opacity(0.4)
                case .listening:     .blue.opacity(0.6)
                case .userSpeaking:  .green
                case .processing, .fillerSpeaking, .waitingForJarvis: .orange
                case .aiSpeaking:    .purple
                case .error:         .red
                default:             .gray
                }

                // Draw outer glow
                ctx.fill(
                    Path(ellipseIn: CGRect(x: center.x - breathRadius,
                                          y: center.y - breathRadius,
                                          width: breathRadius * 2,
                                          height: breathRadius * 2)),
                    with: .color(orbColor.opacity(0.3))
                )

                // Draw core orb
                ctx.fill(
                    Path(ellipseIn: CGRect(x: center.x - baseRadius,
                                          y: center.y - baseRadius,
                                          width: baseRadius * 2,
                                          height: baseRadius * 2)),
                    with: .color(orbColor)
                )
            }
        }
        .background(.black)
        .ignoresSafeArea()
    }
}

Transcript (Optional)

Show transcript in a ScrollView below the orb. Only the last 3–4 exchanges. Use ScrollViewReader to auto-scroll to latest. Toggleable with a long-press gesture.

The entire app is:

ContentView → VoiceOrbView + optional TranscriptView
A settings sheet (gear icon) for: Jarvis URL, token entry, voice selection
No tabs, no navigation stack

SwiftUI + @Observable Pattern

Use @Observable macro (iOS 17+) for the view model. No ObservableObject, no @Published everywhere. Cleaner and more performant:

@Observable class VoiceViewModel { ... }  // iOS 17+
// In view:
@Environment(VoiceViewModel.self) var vm
// or
@State private var vm = VoiceViewModel()

Open Source References

Repo	Why It’s Relevant	Rating
m1guelpf/swift-realtime-openai	⭐ Top pick. Full OpenAI Realtime API client in clean Swift 5.9 async/await. Supports both WebSocket and WebRTC connectors. Session management, conversation history, audio capture + playback. Production-quality code.	⭐⭐⭐⭐⭐
PallavAg/VoiceModeWebRTCSwift	WebRTC-specific OpenAI Realtime implementation. Shows interruption handling, system message config, voice selection. Good reference for the WebRTC data channel event handling pattern.	⭐⭐⭐⭐
kasimok/AECAudioStream	Drop-in Swift Package for hardware AEC via `VoiceProcessingIO`. Use this if the `setVoiceProcessingEnabled` approach has issues. Core Audio wrapper.	⭐⭐⭐⭐
twilio/voice-quickstart-ios AudioDeviceExample	Production-grade, battle-tested `VoiceProcessingIO` + `AVAudioEngine` manual rendering. ObjC but the most complete AEC reference that exists. Twilio uses this in production for millions of calls.	⭐⭐⭐⭐⭐
baochuquan/ios-vad	iOS VAD toolkit: WebRTC GMM, Silero DNN, Yamnet DNN models. Useful if you want client-side VAD (fallback or supplement to OpenAI’s server VAD).	⭐⭐⭐⭐
dmrschmidt/DSWaveformImage	Best waveform rendering library for SwiftUI and UIKit. Real-time waveform from audio buffers. Use for transcript view or orb alternative.	⭐⭐⭐⭐
lzell/AIProxySwift	Realtime API with ephemeral key pattern — shows how to protect API key via a proxy. Good security pattern reference if you don’t want to run your own backend for the ephemeral key.	⭐⭐⭐

Start with m1guelpf/swift-realtime-openai. Fork it, strip what you don’t need, add the Jarvis tool call handler. This saves 2–3 weeks of audio pipeline work.

Media Playback — A First-Class Use Case

This app should not feel like a voice-only ChatGPT wrapper. One of the highest-leverage interactions is:

“Play me something interesting.”

That single prompt turns Jarvis from an assistant into a companion. It curates. It surprises. It understands context. And critically: playback happens on-device, in high quality, with proper ducking when Jarvis speaks.

A Deliberate Exception to the Thin-Router Rule

The core architecture is still right: ask_jarvis() handles reasoning. But media control is one of the rare places where local tools should be first-class.

Jarvis decides what to play (taste, context, novelty, mood)
The iOS client executes playback immediately (no extra server hops)
Jarvis narrates and coordinates (introductions, transitions, voice controls)

In short: Jarvis curates, iPhone performs.

The Flow

User: "Play something good"
         ↓
Realtime API → ask_jarvis("recommend something to play — music, podcast, or ambient audio")
         ↓
Jarvis reasons: time of day, Sam's recent activity, mood cues from conversation, taste history
         ↓
Returns: {
  "type": "podcast",
  "title": "Darknet Diaries ep 147",
  "url": "https://...",
  "reason": "you haven't listened to this one and you're clearly in a technical mood"
}
         ↓
iOS app executes client-side tool: play_audio(url, title, type)
         ↓
AVPlayer / AVAudioEngine streams audio on device
         ↓
Voice LLM: "Playing Darknet Diaries episode 147. You haven't heard this one."
         ↓
Media plays. Jarvis goes quiet until spoken to.

Client-Side Media Tools

These run entirely on-device. No homelab round-trip required.

Tool	Action
`play_audio(url, title, type)`	Stream media URL via `AVPlayer`
`pause_playback()`	Pause current media
`resume_playback()`	Resume paused media
`stop_playback()`	Stop and clear now playing
`skip_track()`	Advance to next queued item
`get_now_playing()`	Return current media metadata to voice LLM
`set_volume(level)`	Set output volume (`0.0`–`1.0`)
`duck_audio(level)`	Lower media level while Jarvis speaks
`enqueue(url, title)`	Append item to local queue
`clear_queue()`	Remove all queued items

If you want this to feel magical, support at least: play_audio, pause_playback, resume_playback, get_now_playing, and duck_audio in V1.

Audio Ducking (Non-Negotiable)

Ducking is what makes voice + playback feel polished instead of chaotic. Jarvis should never shout over music.

Use one shared audio policy:

Jarvis TTS starts → media ducks to ~60%
Jarvis TTS ends → fade media back to 100% over ~250–400ms
Interruption by user speech → optionally duck further or briefly pause for intelligibility

// AVAudioSession setup for duplex voice + media
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord,
                        mode: .voiceChat,
                        options: [.defaultToSpeaker, .allowBluetooth, .duckOthers])
try session.setActive(true)

.duckOthers also helps when external audio apps are active (Spotify, Podcasts, etc.). For your own internal media player, still apply explicit gain automation so duck timing feels intentional.

What Jarvis Can Pick

The real product value is not playback mechanics. It is selection intelligence.

Podcasts

Any public RSS feed → extract enclosure URL → immediate stream
Jarvis can avoid repeats by checking play history
Over time: integrate Pocket Casts / Overcast APIs for personal library awareness

Music

Internet radio (SomaFM, di.fm, Radio Paradise): perfect zero-auth starting point
Apple Music via MusicKit: native iOS path for full-catalog quality if subscription exists
Spotify / YouTube Music: high upside, but OAuth/SDK complexity best deferred to phase two
Bandcamp streams: great for discovery and curation flavor

Ambient / Focus Audio

Contextual picks: coding, reading, late-night wind-down, deep work sprint
Sources: myNoise, Coffitivity, A Soft Murmur, public ambience streams
Optional API integrations later: Endel, Brain.fm

Creative Modes (This Is Where It Becomes Memorable)

Jarvis should not only obey literal commands. It should program experiences.

“Surprise me” → choose outside normal taste, then justify the leap
“Match my mood” → infer emotional state from recent dialog
“Something I haven’t heard” → optimize for novelty with confidence
“20-minute focus set” → build a timed queue, not a single track
“Soundtrack this task” → use current coding/work context as selection input
“Discover mode” → web search emerging tracks/shows in a preferred genre
“Radio mode” → continuous queue with periodic voice interludes

The DJ Pattern

The strongest version of this feature is Contextual DJ Jarvis:

Jarvis introduces a pick
App plays it
App detects playback end (AVPlayerItemDidPlayToEndTime)
iOS sends event back to Realtime session
Jarvis picks and tees up the next item with commentary

Example voice transition:

“That was Floating Points. Next up: something with similar texture but more drive — from a Warp compilation in 2019.”

This loop creates a living, personalized station rather than one-off playback commands.

Sources Without Auth (Zero-Config MVP)

For day-one implementation with no OAuth headaches:

SomaFM — 30+ human-curated stations, stable MP3 streams
Radio Paradise — curated radio with high-quality streams
Public podcast RSS — near-universal compatibility
Archive.org — huge public-domain catalog with direct URLs

This is enough to ship a compelling first version quickly.

Opinionated Build Order

Ship zero-auth playback first (SomaFM + podcasts + queue + ducking)
Add taste memory + novelty scoring (avoid repeats, explain picks)
Implement DJ loop (track end events → next selection)
Only then add OAuth providers (Spotify/YouTube Music)

If you get step 1 and step 3 right, the app already feels special.

Open Questions / Decisions Needed

Sam needs to decide the following before starting:

1. Voice API Choice (High Priority)

Recommendation is gpt-realtime, but verify: Are you comfortable with the cost (~$0.15/min typical usage)? For a personal tool used 30 min/day, that’s ~$4.50/day / ~$135/month. If that’s too high, Grok at $0.05/min is ~$45/month.

2. Integrate Existing term-llm HTTP API (Critical Path)

This is the most important integration decision, but not a wrapper-building project. term-llm already exposes the required HTTP API on 10.0.0.52:8081.

A: Start with POST /v1/chat/completions (fastest path, OpenAI-compatible shape)
B: Use POST /v1/responses (newer OpenAI API style)
C: Optionally add a tiny proxy later only for logging/policy/rate-limit concerns (not required for core functionality)

3. Tailscale vs Other Networking

Tailscale VPN On Demand is the cleanest solution, but it requires the Tailscale app to be installed. Alternatives:

Cloudflare Tunnel: Zero trust, no VPN app required, but more complex to set up
Nginx reverse proxy with SSL + auth: Expose Jarvis directly over internet with auth, no VPN
mTLS: Mutual TLS client certificate auth — very secure, harder to set up

Recommendation: Tailscale. You already use it presumably, and VPN On Demand is well-documented.

4. Session Length Strategy

How long should a voice session last before forcing a reset? Options:

15 minutes max: Conservative, prevents context runaway, forces natural breaks
Until user ends it: More natural UX, but requires careful context truncation config
Per-conversation: Each tap of the mic is a new session (simplest, but loses continuity)

Recommendation: Per-conversation with Jarvis continuity — each Realtime session is fresh, but the jarvis_session_id persists across the app session.

5. Transcript Display: Yes or No?

A persistent transcript is useful for debugging and for accessibility. But it adds UI complexity and storage considerations. Recommendation: implement it as an optional debug overlay, off by default.

6. gpt-realtime vs gpt-4o-mini-realtime

Mini is 8x cheaper for audio tokens but notably weaker at function calling. For a single-tool routing pattern (ask_jarvis always), this might be acceptable. Test mini first, see if tool call reliability is sufficient. If yes, the cost savings are significant ($17/month vs ~$135/month for 30 min/day usage).

Recommended Stack

Layer	Technology	Justification
Voice LLM	OpenAI `gpt-realtime` (full model)	Best tool calling reliability, semantic VAD, auto-waiting, WebRTC native, ephemeral keys
Transport	WebRTC (via WebRTC iOS framework)	Packet loss resilience, built-in jitter buffer, no TCP head-of-line blocking, OpenAI’s own recommendation
iOS Audio	`AVAudioEngine` + `.voiceChat` mode + `setVoiceProcessingEnabled(true)`	Hardware AEC, AGC, noise suppression; correct path for duplex voice AI
Sample Rate Conversion	`AVAudioConverter` (48kHz Float32 → 24kHz Int16)	Required — iOS hardware always runs at 48kHz; OpenAI requires 24kHz PCM16
**