Overview / What We’re Building
Jarvis Voice is a minimal iOS app that gives Sam a natural two-way voice interface to his existing Jarvis AI agent running on a homelab server at 10.0.0.52:8081. The architecture is a thin-router pattern: a cloud-based voice LLM (OpenAI’s gpt-realtime) handles all speech input/output — microphone capture, voice activity detection, speech-to-speech audio generation — but contains zero reasoning logic itself. Every meaningful user request is routed via a single ask_jarvis() tool call over Tailscale VPN directly to term-llm’s OpenAI-compatible HTTP API (/v1/chat/completions or /v1/responses) on the Jarvis backend, which does the actual thinking. The voice LLM speaks Jarvis’s response back to the user. The iOS app is the WebRTC audio transport layer, the tool-call dispatcher, and nothing else. The result: Sam speaks naturally, hears Jarvis respond in a natural voice, with sub-7-second end-to-end latency for most requests, zero cloud storage of conversation content, and full access to all of Jarvis’s existing capabilities without re-implementing them.
Voice LLM API Options
Comparison Table
| Dimension | OpenAI gpt-realtime | xAI Grok Voice Agent | Gemini Live 2.5 Flash | ElevenLabs Conv. AI | Hume EVI 3 |
|---|---|---|---|---|---|
| Architecture | True S2S | True S2S | True S2S (Native Audio) | Pipeline (STT→LLM→TTS) | True S2S |
| Transport | WebRTC + WebSocket + SIP | WebSocket + LiveKit | WebSocket | WebSocket + WebRTC | WebSocket |
| Tool calling | ✅ Native, first-class | ✅ Native + built-in (web/X search) | ✅ Native + Google Search | ✅ Client-side + server-side | ✅ Requires external LLM |
| Official iOS SDK | ❌ (community: m1guelpf) | ❌ (use LiveKit iOS SDK) | ❌ (DIY WebSocket or Pipecat) | ✅ Native Swift SDK (v2.1.0+) | ✅ HumeAI Swift SDK |
| Pricing | $32/$64 per 1M audio in/out tokens (~$0.10–0.20/min typical) | $0.05/min flat | ~$0.015–0.02/min (cheapest) | $0.04–0.10/min | $0.04–0.07/min |
| Latency (TTFA) | ~200–500ms (no tools) | <700ms avg | 150–400ms (variable) | ~300–600ms (pipeline) | ~200–400ms |
| Context window | 128K tokens (gpt-realtime GA) | S2S managed | 1M tokens | Depends on LLM | Depends on LLM |
| LLM flexibility | ❌ GPT-4o only | ❌ Grok only | ❌ Gemini only | ✅ BYO-LLM | ✅ External LLM (Claude/GPT) |
| Function calling reliability | ⭐⭐⭐⭐⭐ (66.5% on evals, full model) | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ (needs ext. LLM) |
| Voice quality | ⭐⭐⭐⭐ (marin, alloy, etc.) | ⭐⭐⭐⭐ (5 voices: Ara/Rex/Sal/Eve/Leo) | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ (5000+ voices) | ⭐⭐⭐⭐⭐ (clone support) |
| Semantic VAD | ✅ (semantic_vad mode) | ✅ | ✅ | ✅ | ✅ |
| Interruption handling | ✅ Auto (server_vad) | ✅ | ✅ | ✅ | ✅ |
| Prompt caching | ✅ 94% audio discount on cached | ❌ | ✅ | N/A | N/A |
| OpenAI API compat | ✅ native | ✅ | ❌ | ❌ | ❌ |
| Ecosystem maturity | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ (Dec 2025 launch) | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
Analysis
OpenAI gpt-realtime is the clear winner for this use case despite being the most expensive option. Reasons:
Tool calling is the entire architecture — the
ask_jarvis()pattern lives or dies on tool call reliability. OpenAI’s GAgpt-realtimemodel scored 66.5% on function calling evals vs 49.7% for the preview model. xAI Grok is competitive but newer/less proven. The mini model is explicitly worse at function calling — use the full model.Semantic VAD —
gpt-realtimesupportssemantic_vadwhich understands natural speech pauses vs sentence endings. For Sam’s use case (complex queries to Jarvis), this matters enormously — you don’t want the model cutting off mid-sentence.WebRTC as first-class transport — OpenAI explicitly recommends WebRTC for mobile/iOS. The ephemeral key flow is clean. The community Swift reference implementation (
m1guelpf/swift-realtime-openai) is production-quality.Auto-waiting built in — the GA
gpt-realtimemodel automatically says “I’m still waiting on that” if a tool call takes too long. No custom implementation needed.Prompt caching — 94% discount on cached audio input tokens means long sessions get dramatically cheaper after the first few turns.
Why not ElevenLabs? Despite the excellent native Swift SDK, ElevenLabs is pipeline-based (not true S2S), which adds latency. More importantly, it depends on your LLM choice for function calling — you’d be paying for both ElevenLabs and GPT-4o API costs, at higher combined latency.
Why not Grok? $0.05/min flat rate is appealing, but: (1) launched December 2025, tiny community, (2) no native iOS SDK, (3) LiveKit plugin was Python-only at launch, (4) function calling not as proven as OpenAI’s.
Why not Gemini Live? Cheapest option by far (~$0.015/min) and 1M token context window is incredible, but no native iOS SDK means 1–2 weeks of extra engineering to build the WebSocket layer or deploy a Pipecat relay server.
🏆 Recommendation: OpenAI gpt-realtime via WebRTC
Use gpt-realtime (full model, not mini) via WebRTC with ephemeral keys. Revisit gpt-4o-mini-realtime-preview only after validating that tool calling behavior is acceptable in testing — expect degraded function calling reliability on mini.
Recommended Architecture
The Thin-Router Pattern
The voice LLM is not the brain. It is the mouth and ears of Jarvis. All reasoning, memory, tool use, and response generation happens on the homelab. The voice LLM’s only job is:
- Convert Sam’s speech to text (S2S)
- Decide: “this is a real request” → call
ask_jarvis() - Pass the result back as natural speech
This is better than making the voice LLM do everything because:
- Jarvis already works — it has tools, memory, context, personality. Don’t duplicate this.
- Context window efficiency — keeping voice context short (just routing events) means 32K tokens lasts much longer
- Cost — voice LLM audio tokens are expensive. Complex reasoning on audio tokens is very expensive. Let Jarvis do reasoning in text on a cheaper model.
- Upgradeability — swap Jarvis’s backend LLM (Claude → Gemini → whatever) without touching the voice layer
Full Architecture Diagram
┌──────────────────────────────────────────────────────────────────────────────┐
│ iOS Voice App │
│ │
│ ┌──────────────────────────────┐ ┌────────────────────────────────────┐ │
│ │ SwiftUI Layer │ │ WebRTC Peer Connection │ │
│ │ @Observable VoiceViewModel │ │ RTCPeerConnection (audio track) │ │
│ │ Pulsating orb / waveform │ │ RTCDataChannel (JSON events) │ │
│ │ .idle → .listening → │ │ Ephemeral key auth │ │
│ │ .processing → .speaking │ │ ICE/DTLS/SRTP encrypted │ │
│ └──────────────┬───────────────┘ └──────────────────┬─────────────────┘ │
│ │ @MainActor │ │
│ ┌──────────────▼───────────────────────────────────┐ │ │
│ │ RealtimeEventHandler (actor) │ │ │
│ │ • Registers ask_jarvis() tool in session.update │ │ │
│ │ • Watches response.output_item.done events │ │ │
│ │ • Fires async Task → JarvisClient │ │ │
│ │ • Submits conversation.item.create (tool result)│ │ │
│ │ • Sends response.create to trigger speech │ │ │
│ └──────────────┬───────────────────────────────────┘ │ │
│ │ async/await │ │
│ ┌──────────────▼───────────────┐ │ │
│ │ AVAudioEngine (actor) │ │ │
│ │ AVAudioSession.voiceChat │ │ │
│ │ Hardware AEC + AGC │ │ │
│ │ 48kHz Float32 tap │ │ │
│ │ AVAudioConverter → 24kHz │ │ │
│ │ Int16 PCM for WebRTC │ │ │
│ │ AVAudioPlayerNode (TTS out) │ │ │
│ └──────────────────────────────┘ │ │
│ │ │
│ ┌──────────────────────────────┐ │ │
│ │ JarvisClient (actor) │ │ │
│ │ URLSession with 10s timeout │ │ │
│ │ Bearer token from Keychain │ │ │
│ │ Circuit breaker pattern │ │ │
│ │ session_id UUID tracking │ │ │
│ └──────────────┬───────────────┘ │ │
└─────────────────│─────────────────────────────────────│─────────────────────┘
│ │
Tailscale WireGuard VPN WebRTC + DTLS/SRTP
(System VPN, On-Demand rules) (UDP, direct path)
│ │
▼ ▼
┌────────────────────────┐ ┌──────────────────────────────┐
│ Homelab 10.0.0.52 │ │ OpenAI Realtime API │
│ │ │ model: gpt-realtime │
│ ┌────────────────────────┐ │ │ │
│ │ term-llm HTTP API │ │ │ Session config: │
│ │ /v1/chat/completions│ │ │ - Semantic VAD │
│ │ /v1/responses │ │ │ - Tool: ask_jarvis() │
│ │ Bearer + session_id │ │ │ - Voice: marin │
│ └────────┬─────────────┘ │ │ - 128K context │
│ │ │ └──────────────────────────────┘
│ ┌────────▼─────────────┐ │
│ │ Jarvis Agent │ │
│ │ (term-llm) │ │
│ │ memory/tools/search │ │
│ └──────────────────────┘ │
└────────────────────────┘
Why Not Make the Voice LLM Do Everything?
| Approach | Voice LLM Does Everything | Thin-Router (Recommended) |
|---|---|---|
| Jarvis memory/tools | Duplicated or lost | Fully preserved |
| Cost per complex query | High (audio tokens for reasoning) | Low (audio tokens only for routing) |
| Jarvis upgrades | Require app update | Transparent |
| Context window burn | Fast (audio tokens expensive) | Slow (minimal turns) |
| Session length | Short (~15 min before context fills) | Longer (Jarvis manages its own context) |
The Tool Calling Mechanism
The ask_jarvis() Tool Definition
{
"type": "function",
"name": "ask_jarvis",
"description": "Routes any meaningful user request to the Jarvis AI agent backend running on Sam's homelab. Use this tool for EVERY real question, task, or request. Do not attempt to answer from your own knowledge.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The user's complete request in natural language, including all relevant context they stated (names, dates, quantities, locations). Do not abbreviate or reframe."
}
},
"required": ["query"]
}
}
Complete Event Flow (Source-Verified from OpenAI Docs)
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: USER SPEAKS │
│ │
│ Sam: "Hey, what's on my calendar tomorrow?" │
│ │
│ iOS mic → PCM16 24kHz → WebRTC audio track → OpenAI Realtime API │
│ │
│ Server events (client receives): │
│ ← input_audio_buffer.speech_started │
│ ← input_audio_buffer.speech_stopped (semantic_vad fires) │
│ ← input_audio_buffer.committed │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 2: MODEL DECIDES TO CALL TOOL │
│ │
│ ← response.created │
│ ← response.output_item.added { type: "function_call", name: "ask_jarvis" }│
│ ← response.function_call_arguments.delta × N (streaming JSON) │
│ e.g. delta: '{"q' → '{"query' → '{"query":"What' → ... │
│ ← response.function_call_arguments.done │
│ final: { "query": "What is on my calendar tomorrow?" } │
│ ← response.output_item.done { │
│ type: "function_call", │
│ name: "ask_jarvis", │
│ call_id: "call_abc123", │
│ arguments: "{\"query\":\"What is on my calendar tomorrow?\"}" │
│ } │
│ ← response.done (status: "completed" — model spoke filler, stopped) │
│ │
│ [CONCURRENTLY: model has already spoken filler phrase audio] │
│ [e.g. "Let me check with Jarvis." plays while HTTP is in flight] │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 3: iOS APP CALLS JARVIS │
│ │
│ Task { │
│ POST http://10.0.0.52:8081/v1/chat/completions │
│ Authorization: Bearer <token-from-keychain> │
│ Content-Type: application/json │
│ session_id: <jarvis-session-uuid> │
│ { "messages":[{"role":"user","content":"What is on my calendar tomorrow?"}],│
│ "stream": false } │
│ } │
│ │
│ ← HTTP 200 { "choices":[{"message":{"content":"You have a dentist..."}}], ... }│
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 4: APP SUBMITS TOOL RESULT │
│ │
│ → conversation.item.create { │
│ "type": "conversation.item.create", │
│ "item": { │
│ "type": "function_call_output", │
│ "call_id": "call_abc123", │
│ "output": "You have a dentist at 10am and team standup at 2pm." │
│ } │
│ } │
│ │
│ → response.create {} │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ PHASE 5: MODEL SPEAKS THE ANSWER │
│ │
│ ← response.created │
│ ← response.output_audio.delta × N (streamed PCM16 audio chunks) │
│ ← response.output_audio_transcript.delta × N (streamed transcript text) │
│ ← response.done │
│ │
│ [iOS plays audio chunks via AVAudioPlayerNode as they arrive] │
└─────────────────────────────────────────────────────────────────────────────┘
Key Implementation Detail (Swift)
func handleRealtimeEvent(_ event: RealtimeServerEvent) {
guard case .responseOutputItemDone(let item) = event,
item.type == "function_call",
item.name == "ask_jarvis",
let callId = item.callId,
let argsStr = item.arguments,
let args = try? JSONDecoder().decode(JarvisArgs.self,
from: argsStr.data(using: .utf8)!)
else { return }
Task {
do {
let result = try await jarvisClient.ask(
query: args.query,
sessionId: currentSessionId,
timeout: 10.0
)
// Truncate long responses for voice
let voiceOutput = result.response.truncatedForVoice(maxWords: 150)
await submitToolResult(callId: callId, output: voiceOutput)
await sendResponseCreate()
} catch {
let errorMsg = errorMessage(for: error)
await submitToolResult(callId: callId, output: errorMsg)
await sendResponseCreate()
}
}
}
func submitToolResult(callId: String, output: String) async {
let event: [String: Any] = [
"type": "conversation.item.create",
"item": [
"type": "function_call_output",
"call_id": callId,
"output": output
]
]
await dataChannel.send(JSON(event))
}
Latency Breakdown Table
| Stage | Latency (LAN) | Latency (4G/5G Cellular) | Notes |
|---|---|---|---|
| Semantic VAD fires after speech ends | 200–600ms | 200–600ms | Semantic VAD more accurate, slightly slower than server_vad |
| Model processes audio + decides to call tool | 150–400ms | 150–400ms | Included in response initiation |
| Model speaks filler phrase | 800ms–1.5s | 800ms–1.5s | Concurrent with HTTP call below |
| HTTP to Jarvis via Tailscale WireGuard | 5–30ms | 80–300ms | LAN: nearly instant. Cellular: DERP relay may add latency |
| Jarvis LLM inference (Claude Sonnet) | 1,000–4,000ms | 1,000–4,000ms | Dominant cost for complex queries |
| Model generates first audio byte after tool result | 200–500ms | 200–500ms | After response.create sent |
| Total perceived gap | ~2–5s | ~3–7s | Filler phrase masks Jarvis inference time |
Key insight: The filler phrase (1–1.5s) buys you almost all the time you need for Jarvis to respond on LAN. The user hears “Let me check with Jarvis” and then almost immediately hears the answer. The silence gap that needs to be hidden is usually under 2 seconds.
Filler Phrase Strategy
The Problem
When ask_jarvis() is called, the voice LLM stops speaking and waits for the tool result. Without intervention, Sam hears dead silence for 2–7 seconds. This feels broken.
The Solution: Pre-Call Verbal Acknowledgment
Instruct the model in the system prompt to say one short phrase before calling the tool. This speech happens in the response.output_audio.delta stream that accompanies the function call. When response.done arrives (marking the end of the filler + the tool call), you fire the HTTP request to Jarvis.
System Prompt Language
# Tool Usage
Before calling ask_jarvis(), always speak one short, natural acknowledgment.
These must be varied — never use the same phrase twice in a row:
- "Let me ask Jarvis."
- "One moment."
- "Checking now."
- "On it."
- "Let me look that up."
- "Sure, give me a second."
Then call ask_jarvis() immediately. Do not say anything else before calling.
The Auto-Waiting Feature (Built Into gpt-realtime)
From OpenAI’s official prompting docs (confirmed source):
“If you ask the model for the results of a function call, it’ll say something like ‘I’m still waiting on that.’ This feature is automatically enabled for new models — no changes necessary.”
This means: if Jarvis takes longer than expected (e.g., complex multi-step query), the model will naturally fill the silence with “I’m still waiting on that…” without any code on your end. This is a gpt-realtime GA feature, not available in the preview models.
What NOT To Do
- ❌ Do not try to stream Jarvis’s response in real-time into the tool result — tool outputs are submitted as complete strings; there is no mid-tool-call streaming into the model
- ❌ Do not disable VAD waiting for the tool call to complete — the model handles this state automatically
- ❌ Do not set
silence_duration_msvery low (< 200ms) — this causes premature VAD firing on natural pauses mid-sentence
iOS Audio Pipeline
AVAudioEngine vs AVAudioSession — The Mental Model
These are not alternatives — they are two layers of the same stack.
| Layer | Role | What It Controls |
|---|---|---|
AVAudioSession | OS contract — tells iOS how you intend to use audio | Routing, interruption policy, AEC mode, category |
AVAudioEngine | Signal graph — the actual audio processing pipeline | Nodes, taps, converters, players |
Configure AVAudioSession first, then build AVAudioEngine on top.
The Correct Setup for Full-Duplex Voice AI
// Step 1: Configure the session
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord,
mode: .voiceChat, // ← KEY: enables hardware AEC + AGC
options: [.defaultToSpeaker, .allowBluetooth])
try session.setPreferredSampleRate(24000) // Request 24kHz (hardware may ignore)
try session.setPreferredIOBufferDuration(0.01) // 10ms buffer = ~240 samples
try session.setActive(true)
// Step 2: Build the engine
let engine = AVAudioEngine()
// Step 3: Enable voice processing (AEC) on the input node
// MUST be called while engine is STOPPED
try engine.inputNode.setVoiceProcessingEnabled(true)
// Step 4: Tap at NATIVE hardware format (48kHz Float32 — DO NOT try to force 24kHz here)
let nativeFormat = engine.inputNode.inputFormat(forBus: 0) // 48kHz Float32 mono
// Step 5: Set up converter to OpenAI's required format
let targetFormat = AVAudioFormat(commonFormat: .pcmFormatInt16,
sampleRate: 24000,
channels: 1,
interleaved: true)!
let converter = AVAudioConverter(from: nativeFormat, to: targetFormat)!
// Step 6: Install tap and stream to WebRTC
engine.inputNode.installTap(onBus: 0, bufferSize: 4800, format: nativeFormat) { buffer, _ in
let frameCount = AVAudioFrameCount(24000) * buffer.frameLength /
AVAudioFrameCount(nativeFormat.sampleRate)
let convertedBuffer = AVAudioPCMBuffer(pcmFormat: targetFormat,
frameCapacity: frameCount)!
var error: NSError?
var consumed = false
converter.convert(to: convertedBuffer, error: &error) { _, outStatus in
if !consumed { outStatus.pointee = .haveData; consumed = true; return buffer }
outStatus.pointee = .noDataNow; return nil
}
// convertedBuffer.int16ChannelData![0] = raw Int16 PCM at 24kHz
// Base64-encode and send over WebSocket, OR feed directly to WebRTC audio track
let audioData = Data(bytes: convertedBuffer.int16ChannelData![0],
count: Int(convertedBuffer.frameLength) * 2)
realtimeSession.sendAudio(audioData)
}
// Step 7: Attach playback node (for TTS output from OpenAI)
let playerNode = AVAudioPlayerNode()
engine.attach(playerNode)
engine.connect(playerNode, to: engine.mainMixerNode, format: targetFormat)
try engine.start()
PCM16 at 24kHz — The Numbers
10ms frame = 240 samples × 2 bytes = 480 bytes
20ms frame = 480 samples × 2 bytes = 960 bytes (good for VAD processing)
100ms chunk = 2400 samples × 2 bytes = 4,800 bytes (good WebSocket granularity)
Uplink bandwidth: ~48 KB/s (24kHz mono PCM16)
Downlink bandwidth: ~48 KB/s (AI voice response, same format)
AEC — The Right Configuration
Use Option A (.voiceChat mode) + Option B (setVoiceProcessingEnabled(true)) together. This is the path of least resistance and handles 95% of echo cancellation needs.
Do NOT use setPrefersEchoCancelledInput(true) — this is iOS 18.2+ only, hardware-gated to 2024 iPhones, and cannot be combined with Voice Processing IO APIs. It’s designed for music apps, not voice AI.
Key Gotchas
| Gotcha | Detail | Fix |
|---|---|---|
| The disconnect bug (WebSocket code 1000) | URLSessionWebSocketTask closes immediately after first audio packet | Audio is not properly formatted as PCM16 Int16. Ensure base64 encodes raw Int16 (little-endian) bytes, not Float32. Confirm session.created received before sending audio. |
| Volume drop | VoiceProcessingIO reduces playback volume ~3–6dB | This is by design (headroom for AEC). Adjust playerNode.volume upward, or use engine.mainMixerNode.outputVolume. |
| Cannot force tap format | Setting custom format on installTap silently fails or produces zero buffers | Always tap at native 48kHz Float32, use AVAudioConverter to resample. |
| Route change resets AEC | Headphone insertion/removal requires engine restart | Listen to AVAudioSession.routeChangeNotification, pause engine, call try? session.setActive(true), restart engine. |
| Engine config change | Hardware change (USB mic, headphones) auto-stops engine | Listen to AVAudioEngineConfigurationChangeNotification, rewire graph and restart. |
| Media services reset | Rare but possible — iOS kills audio server | Listen to AVAudioSession.mediaServicesWereResetNotification, full teardown + rebuild. |
| Bluetooth A2DP → HFP | .allowBluetooth forces BT into HFP (narrowband) for AEC | Expected behavior. HFP = 8kHz or 16kHz voice profile. A2DP = high quality but no AEC. |
conversation_already_has_active_response | Sending response.create while response in flight | Always gate response.create on response.done or response.cancelled |
WebRTC vs WebSocket for iOS
Why WebRTC Wins
| Dimension | WebRTC | WebSocket |
|---|---|---|
| Packet loss handling | Built-in FEC (Opus codec), can drop late packets | TCP: retransmits, causes jitter/delay |
| Head-of-line blocking | None (UDP-based) | Yes (TCP) — a dropped packet stalls all subsequent audio |
| AEC integration | Framework-level AEC built in to WebRTC iOS SDK | Manual (must implement via VoiceProcessingIO) |
| Network transitions | ICE restart handles wifi→cellular gracefully | URLSessionWebSocketTask often drops on network change |
| Jitter buffer | Built in (adaptive) | Must implement manually |
| Latency | Lower (UDP, adaptive bitrate) | Higher (TCP overhead) |
| OpenAI recommendation | ✅ Explicitly recommended for mobile/iOS | “Server-to-server tool” per OpenAI docs |
From the OpenAI docs (verified source):
“WebSocket is explicitly described as a ‘server-to-server’ tool. However, WebSocket is still viable on mobile when you control the full audio pipeline.”
Translated: use WebRTC. WebSocket is for your server talking to OpenAI, not your iOS app.
The Ephemeral Key Flow
1. Your backend server:
POST https://api.openai.com/v1/realtime/client_secrets
Authorization: Bearer <OPENAI_API_KEY>
→ Returns: { "client_secret": { "value": "ek_xxx...", "expires_at": ... } }
2. iOS app fetches ephemeral key from YOUR backend (never store raw OpenAI key on device)
3. iOS app:
POST https://api.openai.com/v1/realtime/calls
Authorization: Bearer ek_xxx
Content-Type: application/sdp
Body: <SDP offer from RTCPeerConnection>
→ Returns: SDP answer
4. Set RTCPeerConnection remote description with SDP answer
5. ICE negotiation completes → audio stream is live
6. Tool call events arrive on RTCDataChannel
Note: Ephemeral keys have a short TTL (minutes). Generate a new one per session start. Store the raw OpenAI API key server-side (your backend or Keychain-protected relay), never in the iOS app bundle.
Reference Implementations
- m1guelpf/swift-realtime-openai — supports both WebSocket and WebRTC connectors. Clean Swift 5.9 async/await. Start here.
- PallavAg/VoiceModeWebRTCSwift — WebRTC-specific implementation, shows interruption handling and voice selection
Networking: Reaching Jarvis from Outside
The Right Approach: Tailscale VPN On Demand
Tailscale’s VPN On Demand feature (available since Tailscale iOS 1.48, verified January 2026) allows iOS to automatically activate the WireGuard VPN tunnel whenever a DNS query for *.ts.net domains is made. This means:
- Sam opens the voice app
- App makes HTTP request to
jarvis.tail-xxxx.ts.net - iOS VPN On Demand kicks in, activates Tailscale WireGuard tunnel
- Request reaches
10.0.0.52:8081on the homelab - No manual VPN management required
Why tsnet Doesn’t Work on iOS
The tsnet package — Tailscale’s embeddable Go library — allows embedding Tailscale directly into a Go binary so it acts as its own Tailscale node without a separate install. However:
- tsnet is Go-only — it’s a Go package, not a framework that can be linked into a Swift iOS app
- iOS app sandboxing prevents the kind of system-level network access tsnet requires
- This was confirmed as not feasible in GitHub issue tailscale/tailscale#7240
The correct approach: Require the Tailscale iOS app to be installed separately. Use VPN On Demand rules.
MagicDNS Configuration
In your Tailscale admin console, enable MagicDNS. Your homelab server gets a stable DNS name like jarvis.tail-xxxx.ts.net. Configure the iOS VPN On Demand rule to trigger for *.ts.net or *.tail-xxxx.ts.net.
// iOS: How to call Jarvis via Tailscale
let jarvisURL = URL(string: "http://jarvis.tail-xxxx.ts.net/chat")!
// The Tailscale VPN On Demand activates automatically
// when this DNS name is resolved. No extra code needed.
Bearer Token Management
Store the Jarvis API bearer token in iOS Keychain, not in UserDefaults or app bundle:
import Security
func storeJarvisToken(_ token: String) {
let query: [String: Any] = [
kSecClass as String: kSecClassGenericPassword,
kSecAttrAccount as String: "jarvis-api-token",
kSecValueData as String: token.data(using: .utf8)!,
kSecAttrAccessible as String: kSecAttrAccessibleWhenUnlockedThisDeviceOnly
]
SecItemAdd(query as CFDictionary, nil)
}
Fallback When Homelab Is Unreachable
When Jarvis is down (server off, VPN unreachable, timeout), the tool result must still return something sensible:
enum JarvisError: Error {
case timeout
case serverDown
case authFailed
case unknownError(Int)
}
func errorMessage(for error: Error) -> String {
switch error {
case JarvisError.timeout:
return "I wasn't able to reach Jarvis — the request timed out. The homelab may be busy."
case JarvisError.serverDown:
return "Jarvis appears to be offline right now. I can't reach the homelab."
case JarvisError.authFailed:
return "Authentication to Jarvis failed. You may need to update the API token in settings."
default:
return "Something went wrong reaching Jarvis: \(error.localizedDescription)"
}
}
The voice LLM will speak these error messages naturally.
Jarvis Backend API
✅ Correct Finding: term-llm Already Exposes a Full HTTP API
term-llm serve --platform web exposes a production HTTP API that is OpenAI-compatible. No custom HTTP wrapper is required for Jarvis Voice.
Live homelab instance: http://10.0.0.52:8081 (reachable over Tailscale)
Available Endpoints
POST /v1/chat/completions— OpenAI Chat Completions compatiblePOST /v1/responses— OpenAI Responses API compatibleGET /v1/modelsGET /healthzGET /v1/sessionsGET /v1/sessions/{id}
Auth + Session Continuity
- Auth:
Authorization: Bearer <token> - Pass
session_id: <uuid>as an HTTP request header to preserve context across turns - Reuse that same
session_idfor the lifetime of a voice conversation - If omitted, term-llm auto-creates a session and returns it in the
x-session-idresponse header - Default idle session TTL is 30 minutes (configurable)
iOS Integration Pattern (Tool Handler)
From the ask_jarvis() tool handler, call term-llm directly at /v1/chat/completions (or /v1/responses) and set session_id to the voice-session UUID tracked in VoiceViewModel.
let jarvisSessionId = currentVoiceSession.jarvisSessionId
request.setValue("Bearer \(token)", forHTTPHeaderField: "Authorization")
request.setValue(jarvisSessionId, forHTTPHeaderField: "session_id")
Example Request/Response (/v1/chat/completions)
POST http://10.0.0.52:8081/v1/chat/completions
Authorization: Bearer <jarvis-token>
Content-Type: application/json
session_id: my-voice-session-uuid
{
"messages": [
{ "role": "user", "content": "What's on my calendar tomorrow?" }
],
"stream": false
}
{
"id": "chatcmpl_abc123",
"object": "chat.completion",
"model": "jarvis",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "You have a dentist appointment at 10 AM and team standup at 2 PM."
},
"finish_reason": "stop"
}
]
}
"stream": true is also supported (SSE).
/v1/responses as the Newer Alternative
POST /v1/responses is available and follows OpenAI’s newer Responses API model. Either endpoint works for Jarvis Voice.
Capability Inheritance (Why This Is Great)
The Jarvis agent behind this API already has full memory, tools, web search, and orchestration. By routing voice requests into term-llm, the iOS voice app inherits all of those capabilities immediately — no separate mobile-side reimplementation required.
Conversation State Design
Two-Layer Model
There are two distinct conversation contexts that must be managed independently:
Layer 1: Realtime API Context (Voice Layer)
- Managed by OpenAI’s
gpt-realtimemodel - Contains: session instructions, tool definitions, voice-layer turns (user speech transcripts, filler phrases, brief spoken confirmations)
- Does NOT contain: Jarvis’s full reasoning, long responses (those are summarized/truncated for voice)
- Context limit: 128K tokens (gpt-realtime GA)
- Managed via:
conversation.item.deletefor pruning,session.truncationfor automatic management - Reset trigger: session timeout or explicit new session
Layer 2: Jarvis Session (Reasoning Layer)
- Managed by term-llm’s server-side session store on homelab
- Contains: full conversation history, tool results, reasoning context
- NOT subject to OpenAI’s context limits
- Keyed by:
jarvis_session_idUUID (separate from any OpenAI session ID) - Reset trigger: user explicitly says “start fresh” or session TTL (30 min default)
// VoiceViewModel holds both IDs
struct SessionState {
let realtimeSessionId: String // from session.created event
let jarvisSessionId: String // UUID sent as session_id header on each /v1/chat/completions call
let startedAt: Date
}
Session Keying
// Generate at each new voice session start
let jarvisSessionId = UUID().uuidString
// Include in every Jarvis HTTP call as a request header
var request = URLRequest(url: URL(string: "\(jarvisBaseURL)/v1/chat/completions")!)
request.setValue(jarvisSessionId, forHTTPHeaderField: "session_id")
Session Reset/Timeout
When to create a new Jarvis session (new jarvisSessionId):
- User explicitly says “start a new conversation”
- Voice session idle > 30 minutes
- User taps a “Reset” button in the UI
When NOT to reset:
- Realtime API session reconnect (network hiccup) — keep the same
jarvisSessionIdto preserve reasoning context across reconnects
Response Handling for Voice
The Problem: Jarvis Returns Long Text
Jarvis’s responses are optimized for reading (markdown, lists, long explanations). Voice needs short, natural-sounding prose. A 500-word markdown response read aloud verbatim is terrible UX.
The 150-Word Heuristic
Truncate Jarvis responses at ~150 words for voice output. This is roughly 45–60 seconds of speech at natural speaking pace — enough to convey rich information without making Sam’s arm go numb holding his phone.
extension String {
func truncatedForVoice(maxWords: Int = 150) -> String {
let words = self.split(separator: " ")
if words.count <= maxWords { return self }
let truncated = words.prefix(maxWords).joined(separator: " ")
return truncated + "… I have more details if you want them."
}
}
Asking Jarvis to Be Concise
Add a system-level instruction to the Jarvis backend prompt/profile:
When called from the voice interface, keep responses under 100 words.
Use plain prose, not markdown. No bullet points, no headers, no code blocks.
If the answer requires more detail, summarize it and offer to elaborate.
Signal this via a request field:
{
"session_id": "...",
"query": "...",
"context": "voice"
}
The backend uses "context": "voice" to prepend a conciseness instruction to the system prompt.
Stripping Markdown
extension String {
func strippedMarkdown() -> String {
var result = self
// Remove markdown headers
result = result.replacingOccurrences(of: #"#{1,6}\s"#, with: "", options: .regularExpression)
// Remove bold/italic
result = result.replacingOccurrences(of: #"\*{1,3}(.+?)\*{1,3}"#, with: "$1", options: .regularExpression)
// Remove bullet points
result = result.replacingOccurrences(of: #"^\s*[-*+]\s"#, with: "", options: .regularExpression)
// Remove code blocks
result = result.replacingOccurrences(of: #"```[\s\S]*?```"#, with: "[code block]", options: .regularExpression)
return result.trimmingCharacters(in: .whitespacesAndNewlines)
}
}
Apply .strippedMarkdown().truncatedForVoice() before submitting as tool result.
State Machine
The Core States
enum VoiceState: Equatable {
case idle // App open, no active listening
case connecting // Establishing WebRTC session
case listening // Mic active, VAD waiting for speech
case userSpeaking // VAD detected speech start
case processing // VAD fired, waiting for tool call / response
case fillerSpeaking // Model speaking filler phrase (concurrent with HTTP call)
case waitingForJarvis // HTTP call in flight, filler done
case aiSpeaking // Model speaking final response
case error(VoiceError) // Something went wrong
}
State Transitions
idle
→ [user taps mic / opens app] → connecting
→ [session.created received] → listening
listening
→ [input_audio_buffer.speech_started] → userSpeaking
userSpeaking
→ [input_audio_buffer.speech_stopped] → processing
→ [user taps interrupt] → listening (send response.cancel)
processing
→ [response.output_audio.delta starts] → fillerSpeaking
→ [response.output_item.done (function_call)] → start HTTP Task
fillerSpeaking
→ [response.done AND HTTP call complete] → aiSpeaking (response.create sent)
→ [response.done AND HTTP still in flight] → waitingForJarvis
waitingForJarvis
→ [HTTP call returns] → aiSpeaking (response.create sent)
→ [HTTP call fails] → aiSpeaking (error message submitted as tool result)
aiSpeaking
→ [response.done] → listening
→ [input_audio_buffer.speech_started] → userSpeaking (model interrupted)
error(*)
→ [retry] → connecting
→ [give up] → idle
Interruption Handling
When Sam starts speaking while the AI is speaking:
- Server sends
input_audio_buffer.speech_started - Server auto-cancels the in-progress response (with
server_vad/semantic_vad) response.donearrives withstatus: "cancelled"- Client stops playing buffered audio immediately
- Use
conversation.item.truncateto sync the server’s understanding of what was actually heard
// On speech_started while in .aiSpeaking state:
case .inputAudioBufferSpeechStarted:
if currentState == .aiSpeaking {
playerNode.stop() // Stop playing immediately
playerNode.reset() // Clear buffer queue
state = .userSpeaking
// Server handles response cancellation automatically with semantic_vad
}
SwiftUI Implementation
@Observable
class VoiceViewModel {
var state: VoiceState = .idle
var audioLevel: Float = 0.0 // Drives orb animation
var transcript: String = "" // Optional display
// Actor-based components
private let audioEngine: AudioEngineActor
private let realtimeSession: RealtimeSessionActor
private let jarvisClient: JarvisClientActor
@MainActor
func startSession() async {
state = .connecting
do {
let ephemeralKey = try await fetchEphemeralKey()
try await realtimeSession.connect(with: ephemeralKey)
try await audioEngine.start()
state = .listening
} catch {
state = .error(.connectionFailed(error))
}
}
@MainActor
func handleServerEvent(_ event: RealtimeServerEvent) {
switch event {
case .speechStarted:
state = currentState == .aiSpeaking ? .userSpeaking : .userSpeaking
audioEngine.stopPlayback()
case .speechStopped:
state = .processing
case .fillerAudioStarted:
state = .fillerSpeaking
case .functionCallReady(let call):
handleToolCall(call)
case .responseAudioStarted:
state = .aiSpeaking
case .responseDone:
state = .listening
default: break
}
}
}
UI/UX
Design Philosophy: Radical Minimalism
This is a personal tool for Sam, not a consumer app. No chrome. No tutorial overlays. Just: voice in, voice out.
The Orb
Full-screen pulsating circle that reflects audio state:
struct VoiceOrbView: View {
@Bindable var vm: VoiceViewModel
var body: some View {
TimelineView(.animation) { _ in
Canvas { ctx, size in
let center = CGPoint(x: size.width/2, y: size.height/2)
let baseRadius = min(size.width, size.height) * 0.25
// Outer glow (breathing animation)
let breathRadius = baseRadius + CGFloat(vm.audioLevel) * 60
// Color shifts by state
let orbColor: Color = switch vm.state {
case .idle: .gray.opacity(0.4)
case .listening: .blue.opacity(0.6)
case .userSpeaking: .green
case .processing, .fillerSpeaking, .waitingForJarvis: .orange
case .aiSpeaking: .purple
case .error: .red
default: .gray
}
// Draw outer glow
ctx.fill(
Path(ellipseIn: CGRect(x: center.x - breathRadius,
y: center.y - breathRadius,
width: breathRadius * 2,
height: breathRadius * 2)),
with: .color(orbColor.opacity(0.3))
)
// Draw core orb
ctx.fill(
Path(ellipseIn: CGRect(x: center.x - baseRadius,
y: center.y - baseRadius,
width: baseRadius * 2,
height: baseRadius * 2)),
with: .color(orbColor)
)
}
}
.background(.black)
.ignoresSafeArea()
}
}
Transcript (Optional)
Show transcript in a ScrollView below the orb. Only the last 3–4 exchanges. Use ScrollViewReader to auto-scroll to latest. Toggleable with a long-press gesture.
No Complex Navigation
The entire app is:
ContentView→VoiceOrbView+ optionalTranscriptView- A settings sheet (gear icon) for: Jarvis URL, token entry, voice selection
- No tabs, no navigation stack
SwiftUI + @Observable Pattern
Use @Observable macro (iOS 17+) for the view model. No ObservableObject, no @Published everywhere. Cleaner and more performant:
@Observable class VoiceViewModel { ... } // iOS 17+
// In view:
@Environment(VoiceViewModel.self) var vm
// or
@State private var vm = VoiceViewModel()
Open Source References
| Repo | Why It’s Relevant | Rating |
|---|---|---|
| m1guelpf/swift-realtime-openai | ⭐ Top pick. Full OpenAI Realtime API client in clean Swift 5.9 async/await. Supports both WebSocket and WebRTC connectors. Session management, conversation history, audio capture + playback. Production-quality code. | ⭐⭐⭐⭐⭐ |
| PallavAg/VoiceModeWebRTCSwift | WebRTC-specific OpenAI Realtime implementation. Shows interruption handling, system message config, voice selection. Good reference for the WebRTC data channel event handling pattern. | ⭐⭐⭐⭐ |
| kasimok/AECAudioStream | Drop-in Swift Package for hardware AEC via VoiceProcessingIO. Use this if the setVoiceProcessingEnabled approach has issues. Core Audio wrapper. | ⭐⭐⭐⭐ |
| twilio/voice-quickstart-ios AudioDeviceExample | Production-grade, battle-tested VoiceProcessingIO + AVAudioEngine manual rendering. ObjC but the most complete AEC reference that exists. Twilio uses this in production for millions of calls. | ⭐⭐⭐⭐⭐ |
| baochuquan/ios-vad | iOS VAD toolkit: WebRTC GMM, Silero DNN, Yamnet DNN models. Useful if you want client-side VAD (fallback or supplement to OpenAI’s server VAD). | ⭐⭐⭐⭐ |
| dmrschmidt/DSWaveformImage | Best waveform rendering library for SwiftUI and UIKit. Real-time waveform from audio buffers. Use for transcript view or orb alternative. | ⭐⭐⭐⭐ |
| lzell/AIProxySwift | Realtime API with ephemeral key pattern — shows how to protect API key via a proxy. Good security pattern reference if you don’t want to run your own backend for the ephemeral key. | ⭐⭐⭐ |
Start with m1guelpf/swift-realtime-openai. Fork it, strip what you don’t need, add the Jarvis tool call handler. This saves 2–3 weeks of audio pipeline work.
Media Playback — A First-Class Use Case
This app should not feel like a voice-only ChatGPT wrapper. One of the highest-leverage interactions is:
“Play me something interesting.”
That single prompt turns Jarvis from an assistant into a companion. It curates. It surprises. It understands context. And critically: playback happens on-device, in high quality, with proper ducking when Jarvis speaks.
A Deliberate Exception to the Thin-Router Rule
The core architecture is still right: ask_jarvis() handles reasoning. But media control is one of the rare places where local tools should be first-class.
- Jarvis decides what to play (taste, context, novelty, mood)
- The iOS client executes playback immediately (no extra server hops)
- Jarvis narrates and coordinates (introductions, transitions, voice controls)
In short: Jarvis curates, iPhone performs.
The Flow
User: "Play something good"
↓
Realtime API → ask_jarvis("recommend something to play — music, podcast, or ambient audio")
↓
Jarvis reasons: time of day, Sam's recent activity, mood cues from conversation, taste history
↓
Returns: {
"type": "podcast",
"title": "Darknet Diaries ep 147",
"url": "https://...",
"reason": "you haven't listened to this one and you're clearly in a technical mood"
}
↓
iOS app executes client-side tool: play_audio(url, title, type)
↓
AVPlayer / AVAudioEngine streams audio on device
↓
Voice LLM: "Playing Darknet Diaries episode 147. You haven't heard this one."
↓
Media plays. Jarvis goes quiet until spoken to.
Client-Side Media Tools
These run entirely on-device. No homelab round-trip required.
| Tool | Action |
|---|---|
play_audio(url, title, type) | Stream media URL via AVPlayer |
pause_playback() | Pause current media |
resume_playback() | Resume paused media |
stop_playback() | Stop and clear now playing |
skip_track() | Advance to next queued item |
get_now_playing() | Return current media metadata to voice LLM |
set_volume(level) | Set output volume (0.0–1.0) |
duck_audio(level) | Lower media level while Jarvis speaks |
enqueue(url, title) | Append item to local queue |
clear_queue() | Remove all queued items |
If you want this to feel magical, support at least: play_audio, pause_playback, resume_playback, get_now_playing, and duck_audio in V1.
Audio Ducking (Non-Negotiable)
Ducking is what makes voice + playback feel polished instead of chaotic. Jarvis should never shout over music.
Use one shared audio policy:
- Jarvis TTS starts → media ducks to ~60%
- Jarvis TTS ends → fade media back to 100% over ~250–400ms
- Interruption by user speech → optionally duck further or briefly pause for intelligibility
// AVAudioSession setup for duplex voice + media
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord,
mode: .voiceChat,
options: [.defaultToSpeaker, .allowBluetooth, .duckOthers])
try session.setActive(true)
.duckOthers also helps when external audio apps are active (Spotify, Podcasts, etc.). For your own internal media player, still apply explicit gain automation so duck timing feels intentional.
What Jarvis Can Pick
The real product value is not playback mechanics. It is selection intelligence.
Podcasts
- Any public RSS feed → extract enclosure URL → immediate stream
- Jarvis can avoid repeats by checking play history
- Over time: integrate Pocket Casts / Overcast APIs for personal library awareness
Music
- Internet radio (SomaFM, di.fm, Radio Paradise): perfect zero-auth starting point
- Apple Music via MusicKit: native iOS path for full-catalog quality if subscription exists
- Spotify / YouTube Music: high upside, but OAuth/SDK complexity best deferred to phase two
- Bandcamp streams: great for discovery and curation flavor
Ambient / Focus Audio
- Contextual picks: coding, reading, late-night wind-down, deep work sprint
- Sources: myNoise, Coffitivity, A Soft Murmur, public ambience streams
- Optional API integrations later: Endel, Brain.fm
Creative Modes (This Is Where It Becomes Memorable)
Jarvis should not only obey literal commands. It should program experiences.
- “Surprise me” → choose outside normal taste, then justify the leap
- “Match my mood” → infer emotional state from recent dialog
- “Something I haven’t heard” → optimize for novelty with confidence
- “20-minute focus set” → build a timed queue, not a single track
- “Soundtrack this task” → use current coding/work context as selection input
- “Discover mode” → web search emerging tracks/shows in a preferred genre
- “Radio mode” → continuous queue with periodic voice interludes
The DJ Pattern
The strongest version of this feature is Contextual DJ Jarvis:
- Jarvis introduces a pick
- App plays it
- App detects playback end (
AVPlayerItemDidPlayToEndTime) - iOS sends event back to Realtime session
- Jarvis picks and tees up the next item with commentary
Example voice transition:
“That was Floating Points. Next up: something with similar texture but more drive — from a Warp compilation in 2019.”
This loop creates a living, personalized station rather than one-off playback commands.
Sources Without Auth (Zero-Config MVP)
For day-one implementation with no OAuth headaches:
- SomaFM — 30+ human-curated stations, stable MP3 streams
- Radio Paradise — curated radio with high-quality streams
- Public podcast RSS — near-universal compatibility
- Archive.org — huge public-domain catalog with direct URLs
This is enough to ship a compelling first version quickly.
Opinionated Build Order
- Ship zero-auth playback first (SomaFM + podcasts + queue + ducking)
- Add taste memory + novelty scoring (avoid repeats, explain picks)
- Implement DJ loop (track end events → next selection)
- Only then add OAuth providers (Spotify/YouTube Music)
If you get step 1 and step 3 right, the app already feels special.
Open Questions / Decisions Needed
Sam needs to decide the following before starting:
1. Voice API Choice (High Priority)
Recommendation is gpt-realtime, but verify: Are you comfortable with the cost (~$0.15/min typical usage)? For a personal tool used 30 min/day, that’s ~$4.50/day / ~$135/month. If that’s too high, Grok at $0.05/min is ~$45/month.
2. Integrate Existing term-llm HTTP API (Critical Path)
This is the most important integration decision, but not a wrapper-building project. term-llm already exposes the required HTTP API on 10.0.0.52:8081.
- A: Start with
POST /v1/chat/completions(fastest path, OpenAI-compatible shape) - B: Use
POST /v1/responses(newer OpenAI API style) - C: Optionally add a tiny proxy later only for logging/policy/rate-limit concerns (not required for core functionality)
3. Tailscale vs Other Networking
Tailscale VPN On Demand is the cleanest solution, but it requires the Tailscale app to be installed. Alternatives:
- Cloudflare Tunnel: Zero trust, no VPN app required, but more complex to set up
- Nginx reverse proxy with SSL + auth: Expose Jarvis directly over internet with auth, no VPN
- mTLS: Mutual TLS client certificate auth — very secure, harder to set up
Recommendation: Tailscale. You already use it presumably, and VPN On Demand is well-documented.
4. Session Length Strategy
How long should a voice session last before forcing a reset? Options:
- 15 minutes max: Conservative, prevents context runaway, forces natural breaks
- Until user ends it: More natural UX, but requires careful context truncation config
- Per-conversation: Each tap of the mic is a new session (simplest, but loses continuity)
Recommendation: Per-conversation with Jarvis continuity — each Realtime session is fresh, but the jarvis_session_id persists across the app session.
5. Transcript Display: Yes or No?
A persistent transcript is useful for debugging and for accessibility. But it adds UI complexity and storage considerations. Recommendation: implement it as an optional debug overlay, off by default.
6. gpt-realtime vs gpt-4o-mini-realtime
Mini is 8x cheaper for audio tokens but notably weaker at function calling. For a single-tool routing pattern ($17/month vs ~$135/month for 30 min/day usage).ask_jarvis always), this might be acceptable. Test mini first, see if tool call reliability is sufficient. If yes, the cost savings are significant (
Recommended Stack
| Layer | Technology | Justification |
|---|---|---|
| Voice LLM | OpenAI gpt-realtime (full model) | Best tool calling reliability, semantic VAD, auto-waiting, WebRTC native, ephemeral keys |
| Transport | WebRTC (via WebRTC iOS framework) | Packet loss resilience, built-in jitter buffer, no TCP head-of-line blocking, OpenAI’s own recommendation |
| iOS Audio | AVAudioEngine + .voiceChat mode + setVoiceProcessingEnabled(true) | Hardware AEC, AGC, noise suppression; correct path for duplex voice AI |
| Sample Rate Conversion | AVAudioConverter (48kHz Float32 → 24kHz Int16) | Required — iOS hardware always runs at 48kHz; OpenAI requires 24kHz PCM16 |
| ** |