Building Your First Voice Agent With The gpt-realtime API

Building Your First Voice Agent With The gpt realtime API

You can tell a lot about a platform from the first minute you talk to it. With the gpt-realtime API, that first minute feels like you are chatting with an engineer who also happens to sing in tune. It hears you, answers quickly, and does not make you wait while a daisy chain of services argues over who owns the waveform. That is the shift. Speech is no longer a bolt on. It is native.

What follows is a practical, clear path to build a voice agent that you would be happy to ship. We will keep it grounded in real constraints, performance, and the craft of product quality. We will also use the OpenAI Agents SDK, since it removes a lot of the plumbing that usually sends weekend projects into orbit.

1. Why Voice, Why Now

The classic stack for voice was a relay race. One model listened, another wrote, a third one spoke. Each hop added latency and shaved nuance off the edges. The gpt-realtime API collapses that pipeline into a single, speech-native model that handles listening and speaking in one continuous loop. Fewer moving parts. More fidelity. Lower latency. Stronger instruction following.

For builders, this changes the surface area. You can build a voice agent that understands laughter, pauses, accents, code words, and alphanumerics. You can keep the audio thread flowing while tools run in the background. You can add an image mid-call and ask, “what do you see.” And you can take the whole thing to production without duct tape.

2. What The gpt-realtime API Gives You

Here is the quick map of capabilities and what they mean for a team that wants to build a voice agent without shipping a ball of yarn.

OpenAI Realtime & Voice Capabilities: Quick Start Map
CapabilityWhat You GetWhere It LivesFirst Move
Speech-to-speech coreSingle model that hears and speaks with natural prosodygpt-realtime APIStart with a default voice, then tune pace and clarity
WebRTC in the browserVery low latency round trip, mic in and audio outOpenAI Agents SDKUse the SDK’s RealtimeSession to attach mic and speaker
WebSocket on serversStable, low jitter connection for middle tiersgpt-realtime APIUse WS for server bots or transcription bridges
SIP callingRoute the same agent to phone numbers and PBX systemsOpenAI Realtime voice APIPrototype IVR handoffs before you scale
Image inputGround the conversation in what the user seesgpt-realtime APIAdd a screenshot mid-call and ask for details
Function callingLet the model call tools with strong argument accuracyOpenAI Agents SDKDefine a few high-value tools and add rules for when to call them
MCP serversPlug external capabilities in a clean, discoverable waygpt-realtime APIPoint the session at a remote MCP server to unlock actions
Prompt reuseVersioned prompts you can pin per sessionOpenAI Realtime API documentationStore one prompt for consistency across teams
Async tool waitsKeep talking while functions rungpt-realtime APIDo not freeze the conversation during long calls
Data residency and privacy controlsEnterprise-grade guaranteesOpenAI Realtime API documentationPick the residency you need and log responsibly

You will notice one theme. The gpt-realtime API does not ask you to “integrate twelve things to get sound.” It gives you a tight loop, then a clean path to extend it with tools when you are ready.

3. The Minimal Architecture That Scales

Minimal WebRTC to cloud flow for a gpt-realtime API voice agent
Minimal WebRTC to cloud flow for a gpt-realtime API voice agent

You only need two pieces.

  1. A tiny server that mints a short-lived client secret for the browser. Never ship a long-lived API key to the client. Your server calls the platform, receives a time-boxed secret, and hands it to the browser.
  2. A simple web page that uses the OpenAI Agents SDK. The SDK speaks WebRTC to the gpt-realtime API, connects your microphone, and streams audio back to the speakers.

With this split, you get strong security, the snappy feel of WebRTC, and the ability to add tooling on the server side without exposing anything you should not.

4. A Ten Minute gpt-realtime Tutorial In The Playground

Turn taking and latency pipeline with semantic VAD and Opus in a gpt-realtime voice agent
Turn taking and latency pipeline with semantic VAD and Opus in a gpt-realtime voice agent

If you want a no-infrastructure start, the Playground is perfect. Pick the gpt-realtime API model, enable the mic, select a voice like Marin or Cedar, then paste a tight system prompt:

# Role & Objective You are a helpful voice assistant. Keep answers under 3 sentences.# Tone Warm, concise, confident.# Language Match the user’s language.# Variety Avoid repeating the same sentence.# Turn Taking Only speak when you detect the user has finished.

Say “Hello, what can you do.” You will hear a reply and see live transcripts. Attach an image and ask a question about it. When the flow feels right, click View code. The Playground will export a snippet that matches your exact settings. That is your seed for a gpt-realtime tutorial or demo repo.

5. Ship It, A Tiny Working Example With The OpenAI Agents SDK

Here is a compact example that readers can drop into a single file to build a voice agent. The page connects your mic, streams to the gpt-realtime API over WebRTC, and plays the response. The logic is intentionally small, so you can see the shape clearly.

<!doctype html> <html> <head> <meta charset="utf-8" /> <title>gpt-realtime Voice Agent</title> <style> body { font-family: system-ui, sans-serif; max-width: 720px; margin: 40px auto; } button { padding: 10px 14px; font-size: 16px; } #log { white-space: pre-wrap; border: 1px solid #ddd; border-radius: 8px; padding: 12px; margin-top: 12px; } </style> </head> <body> <h1>gpt-realtime Voice Agent</h1> <p><button id="connect">Connect mic</button> <span id="status"></span></p> <audio id="speaker" autoplay></audio> <div id="log" aria-live="polite"></div><script type="module"> import { RealtimeAgent, RealtimeSession } from "https://cdn.skypack.dev/@openai/agents/realtime";const btn = document.getElementById("connect"); const statusEl = document.getElementById("status"); const speaker = document.getElementById("speaker"); const logEl = document.getElementById("log"); const log = t => { logEl.textContent += t + "\n"; };btn.onclick = async () => { try { btn.disabled = true; statusEl.textContent = "Mic…"; const mic = await navigator.mediaDevices.getUserMedia({ audio: true });statusEl.textContent = "Client secret…"; // In production, fetch a short-lived client secret from your server. const r = await fetch("/client-secret"); const { client_secret } = await r.json();const agent = new RealtimeAgent({ name: "Assistant", instructions: "Helpful, concise, confirm understanding before actions." });const session = new RealtimeSession(agent, { model: "gpt-realtime", audioElement: speaker, config: { outputModalities: ["audio", "text"], voice: "marin", turnDetection: { type: "semantic_vad", createResponse: true } } });await session.connect({ clientSecret: client_secret.value }); session.addInputAudioStream(mic); statusEl.textContent = "Connected, speak now";session.on("transcript", evt => { log(`${evt.role === "user" ? "You" : "Agent"}: ${evt.text}`); }); } catch (e) { statusEl.textContent = "Failed"; log(e.message || String(e)); btn.disabled = false; } }; </script> </body> </html>

Swap the voice to Cedar if that better matches your brand. Add outputModalities: [“audio”] if you want a voice-only vibe. Keep this page behind an internal login while you test. Then move the short-lived token endpoint behind your production auth.

This is the smallest possible loop that still feels like magic. It is also the right baseline for a serious gpt-realtime tutorial.

6. Prompting That Works For Voice

Security, privacy, and trust considerations for gpt-realtime API voice agents
Security, privacy, and trust considerations for gpt-realtime API voice agents

Voice agents live or die by the first five words they say. The gpt-realtime API follows short, precise rules. Treat your system prompt like a micro-style guide.

  • Keep instructions in bullets. Two to four lines is plenty.
  • Lock language behavior. Match the user’s language by default, or pin English for a support line.
  • Add a variety rule. It prevents robotic repetition during long sessions.
  • Define a turn taking policy. Use semantic VAD in the session config and remind the agent to wait for the user to finish.
  • Include a short escalation line. If the user asks for a human or seems frustrated, say your handoff phrase, then trigger the tool.

The goal is not poetry. It is consistent behavior that you can repeat across teams. If a sentence is ambiguous, rewrite it. If two rules fight, delete one. Your future self will thank you.

7. Tools, Function Calls, And MCP

A great voice agent does not just talk. It checks, fetches, and acts. The OpenAI Agents SDK lets you define functions that the model can call with arguments that usually match what you would have written yourself. Think of them as verbs you are willing to perform on the user’s behalf.

A clean pattern is a one-line preamble before each tool call. For example, “I am checking that now.” Then call the tool. This keeps the conversation transparent. With MCP servers in the loop, you can plug in capabilities without hard wiring every integration. Point the gpt-realtime API session at a remote server, approve the tools, and you are live.

Do not start with twenty tools. Start with two that unlock real value. Eligibility checks. A knowledge lookup. A safe write action behind a confirmation. Expand once the logs show real demand.

8. Latency, Turn Taking, And Audio Quality

People forgive the occasional hiccup. They do not forgive sluggishness. The gpt-realtime API gives you three high-leverage controls.

  1. WebRTC in the browser. This is the fastest loop for mic in and audio out. Use it when the user is present and wants to talk.
  2. Semantic VAD. Configure turn detection to stop interrupting people. Use a medium eagerness to start, then tune it for your audience. Sales teams often prefer slightly quicker interjections. Support lines prefer patience.
  3. Audio formats. G.711 for phone, PCM for web, Opus if you want better compression at high quality. Log what you ship so you can compare apples to apples.

A small change here does more for perceived intelligence than any total rewrite. You are sculpting the rhythm of the conversation. That matters.

9. Cost That You Can Explain To Finance

You can build a great experience without burning a hole in the budget. The phrase to search for is OpenAI Realtime API pricing. Set up a budget cap before you invite the whole company. Then apply a few practical habits.

  • Keep sessions short. Idle minutes are invisible expenses.
  • Cache reusable prompts and context. Cached input costs less than fresh tokens.
  • Ship text only while you iterate. Switch to voice when the copy feels tight.
  • Downshift sample rates for telephony. It reduces cost and fits the medium.
  • Preflight tool calls. Avoid long back-and-forth when a single call would do.

Write your cost policy in the same repo as the code. New teammates will follow the rules you document.

10. Security, Privacy, And The Social Contract

You are building an agent that sounds like a person. Treat that as a responsibility.

  • Never expose long-lived API keys in the browser. Use short-lived client secrets only.
  • Make it clear that users are talking to an AI. Do not simulate a real person by name.
  • Log with purpose. Keep transcripts for debugging and training, but strip identifiers when you can.
  • Set a privacy floor. If you store audio, say so. If you do not, say that too.
  • Respect escalation. If a user asks for a human, connect one. Do not argue.

Check the OpenAI Realtime API documentation for the controls your industry needs, including data residency. Keep those settings explicit in code.

11. Your First Production Ready Flow

Here is a proven path to go from hello world to something your team can rely on.

  1. Playground to first demo. Use the gpt-realtime API model, test voices, attach an image, and export code.
  2. Minimal web app. Drop the OpenAI Agents SDK example into a page behind login. Mint short-lived client secrets on a tiny server endpoint.
  3. Prompt hardening. Add the bullets that define tone, language, variety, and escalation. Remove ambiguities.
  4. Two tools only. One read tool for a lookup. One write tool behind confirmation. Add a single preamble line before tool calls.
  5. Latency passes. Tune VAD and audio formats. Measure end-to-end time from user speech end to first audio byte out.
  6. Pricing pass. Review OpenAI Realtime API pricing with your finance partner. Set a budget cap. Add cached input where it helps.
  7. Ship, then expand. Add SIP calling if your users live on phones. Add MCP once you know the tools that matter. Keep a change log users can read.

This is a thoughtful way to build a voice agent that respects users and scales with your roadmap.

12. Closing, Build Something People Will Talk To Twice

The first time someone talks to your product, they are judging the personality, not the stack. The gpt-realtime API gives you the technical foundation to sound human, to reason, to act, and to do it fast. The OpenAI Agents SDK lets you move from a prototype to a real system without a hallway of glue code.

So pick a tiny, meaningful problem. Build a voice agent that solves it with grace. Share a quick gpt-realtime tutorial with your team. Link to the OpenAI Realtime API documentation in your repo. Add one tool that makes a real decision easier. Then invite users to try it and tell you where it falls short.

If you want help, start by opening the Playground, selecting the gpt-realtime API model, and pressing Connect. Ten minutes later you will have a conversation worth shipping. Then you can add your own twist, publish your demo, and keep going.

gpt-realtime API
OpenAI model family and API for speech in, speech out, low-latency interactions.
Realtime API
Interface that streams audio, text, and tool calls in real time.
WebRTC
Browser standard for bidirectional low-latency media with built-in echo control.
SIP
Session Initiation Protocol for placing and managing phone calls.
MCP
Model Context Protocol used to expose tools and data sources remotely.
Agents SDK
JavaScript SDK that scaffolds agent behaviors and transports.
Semantic VAD
Learned endpointing to detect end of user speech for faster replies.
Opus
Audio codec optimized for high quality at low bitrates.
G.711
Legacy telephony codec using PCM, higher bandwidth than Opus.
Turn taking
Control logic for when the agent listens versus speaks.
Prompt caching
Caching static prompt segments to cut cost and latency.
Tool calling
Model-initiated function calls to retrieve data or take actions.
Data channel
WebRTC channel for structured JSON messages alongside audio.

What is the difference between the gpt-realtime API and the standard ChatGPT API?

The gpt-realtime API is built for live, low-latency, multimodal conversations, including speech in and speech out. It runs natively on audio, supports image input, and can connect over WebRTC in the browser or WebSocket on servers. It also adds production features like SIP phone calling and MCP tool access. The standard ChatGPT style APIs are text first and do not provide the same real-time audio pipeline. If you want to build a voice agent, gpt-realtime API is the purpose-built path

Is the gpt-realtime API free to use?

No, usage is billed. Playground sessions count the same as normal API calls. Some accounts can enable complimentary daily tokens by opting in to data sharing in organization settings, eligibility varies by tier and the offer can change. Check your dashboard to confirm enrollment before testing.

How much does the gpt-realtime API cost?

Pricing is per token, with different rates for text, audio, and image tokens. For audio, gpt-realtime is listed at about 32 dollars per 1M input audio tokens, 64 dollars per 1M output audio tokens, and 0.40 dollars per 1M cached input tokens. Text and image token rates for gpt-realtime are also published on the pricing page, so review that table to estimate your specific mix.

What is the OpenAI Agents SDK and why is it recommended for voice agents?

The OpenAI Agents SDK is a lightweight toolkit that wires your agent’s instructions, tools, and state to OpenAI models with minimal glue code. For a browser voice agent, it pairs cleanly with the gpt-realtime API over WebRTC, which keeps latency low and makes microphone and audio output straightforward. You get a fast path from a Playground experiment to a working web app without building a custom signaling stack.

What do I need to get started with this tutorial?

You need an OpenAI API account with an API key, a modern browser with a microphone, and access to the Playground to try the model without code. If you export code, plan on a recent Node.js environment and HTTPS for local testing, the Agents SDK handles the WebRTC session and audio plumbing for you. Review the Realtime and Agents SDK docs for exact setup steps and any rate limits.

Leave a Comment