Voice Agent Demo

Real-time voice conversation powered by OpenAI's Realtime API. Same system prompt as the CLI agent—speak naturally and get instant responses.

voice-agent — openai realtime

idle

Click to start conversation

Implementation

How It Works

Session Token API

typescript

// API Route: /api/realtime/session/route.ts
export async function GET() {
  const response = await fetch(
    "https://api.openai.com/v1/realtime/sessions",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "gpt-4o-realtime-preview-2024-12-17",
        voice: "alloy",
        instructions: systemPrompt, // Same prompt as CLI agent
      }),
    }
  )

  const data = await response.json()
  return Response.json(data)
}

WebRTC Connection

typescript

// WebRTC Connection Setup
const pc = new RTCPeerConnection()

// Handle incoming audio stream
pc.ontrack = (e) => {
  audioElement.srcObject = e.streams[0]
}

// Add microphone track
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
pc.addTrack(stream.getTracks()[0])

// Create data channel for events
const dc = pc.createDataChannel("oai-events")

dc.onmessage = (e) => {
  const event = JSON.parse(e.data)
  // Handle: speech_started, speech_stopped,
  // response.audio.delta, response.done
}

// Connect to OpenAI
const offer = await pc.createOffer()
await pc.setLocalDescription(offer)

const response = await fetch(
  `https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview`,
  {
    method: "POST",
    body: offer.sdp,
    headers: {
      Authorization: `Bearer ${ephemeralKey}`,
      "Content-Type": "application/sdp",
    },
  }
)

await pc.setRemoteDescription({
  type: "answer",
  sdp: await response.text()
})

Agent State Machine

typescript

// Agent State Machine
type AgentState = "idle" | "connecting" | "listening" | "thinking" | "speaking"

dc.onmessage = (e) => {
  const event = JSON.parse(e.data)

  switch (event.type) {
    case "input_audio_buffer.speech_started":
      setAgentState("listening")
      break
    case "input_audio_buffer.speech_stopped":
      setAgentState("thinking")
      break
    case "response.audio.delta":
      setAgentState("speaking")
      break
    case "response.done":
      setAgentState("listening")
      break
  }
}

Why This Matters

Multimodal Unlocks Velocity

Speed of Thought

Voice input is 3-4x faster than typing. When working with complex systems, the ability to articulate thoughts verbally while keeping hands on code eliminates context-switching overhead.

Natural Iteration

Conversational refinement feels natural with voice. The back-and-forth rhythm of voice conversation matches how we naturally explore ideas and refine requirements.

Same Powerful Stack

The voice interface connects to the same LLM, tools, and harness as the text-based agent. Structured outputs, tool calling—all accessible through natural speech.

Ambient Computing

Voice agents enable hands-free interaction while walking, driving, or when screens aren't practical. The agent becomes truly ambient—available whenever you need to think out loud.

The key insight: Pair a voice agent with the same LLM, tooling, and harness you use for text-based interaction. The modality changes, but the capability compounds. You get speed-of-thought interaction with production-grade agent infrastructure.