Voice Agent Demo
Real-time voice conversation powered by OpenAI's Realtime API. Same system prompt as the CLI agent—speak naturally and get instant responses.
Click to start conversation
Implementation
How It Works
// API Route: /api/realtime/session/route.ts
export async function GET() {
const response = await fetch(
"https://api.openai.com/v1/realtime/sessions",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-realtime-preview-2024-12-17",
voice: "alloy",
instructions: systemPrompt, // Same prompt as CLI agent
}),
}
)
const data = await response.json()
return Response.json(data)
}// WebRTC Connection Setup
const pc = new RTCPeerConnection()
// Handle incoming audio stream
pc.ontrack = (e) => {
audioElement.srcObject = e.streams[0]
}
// Add microphone track
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
pc.addTrack(stream.getTracks()[0])
// Create data channel for events
const dc = pc.createDataChannel("oai-events")
dc.onmessage = (e) => {
const event = JSON.parse(e.data)
// Handle: speech_started, speech_stopped,
// response.audio.delta, response.done
}
// Connect to OpenAI
const offer = await pc.createOffer()
await pc.setLocalDescription(offer)
const response = await fetch(
`https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview`,
{
method: "POST",
body: offer.sdp,
headers: {
Authorization: `Bearer ${ephemeralKey}`,
"Content-Type": "application/sdp",
},
}
)
await pc.setRemoteDescription({
type: "answer",
sdp: await response.text()
})// Agent State Machine
type AgentState = "idle" | "connecting" | "listening" | "thinking" | "speaking"
dc.onmessage = (e) => {
const event = JSON.parse(e.data)
switch (event.type) {
case "input_audio_buffer.speech_started":
setAgentState("listening")
break
case "input_audio_buffer.speech_stopped":
setAgentState("thinking")
break
case "response.audio.delta":
setAgentState("speaking")
break
case "response.done":
setAgentState("listening")
break
}
}Why This Matters
Multimodal Unlocks Velocity
Speed of Thought
Voice input is 3-4x faster than typing. When working with complex systems, the ability to articulate thoughts verbally while keeping hands on code eliminates context-switching overhead.
Natural Iteration
Conversational refinement feels natural with voice. The back-and-forth rhythm of voice conversation matches how we naturally explore ideas and refine requirements.
Same Powerful Stack
The voice interface connects to the same LLM, tools, and harness as the text-based agent. Structured outputs, tool calling—all accessible through natural speech.
Ambient Computing
Voice agents enable hands-free interaction while walking, driving, or when screens aren't practical. The agent becomes truly ambient—available whenever you need to think out loud.
The key insight: Pair a voice agent with the same LLM, tooling, and harness you use for text-based interaction. The modality changes, but the capability compounds. You get speed-of-thought interaction with production-grade agent infrastructure.
