RealtimeAgent
Event-driven voice agent using the OpenAI Realtime API.
Manages the full lifecycle of a real-time voice session: audio I/O, WebSocket connection, tool calling, optional subagent handoffs, MCP server integration, and inactivity timeouts.
Call prewarm() before run() to prewarm connections and avoid startup delays.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt defining the assistant's personality and behavior. |
''
|
model
|
RealtimeModel
|
Realtime model variant to use. Defaults to |
GPT_REALTIME_MINI
|
voice
|
AssistantVoice
|
TTS voice used for assistant responses. |
MARIN
|
speech_speed
|
float
|
Playback speed of the assistant's voice. Automatically clamped to |
1.0
|
transcription_model
|
TranscriptionModel | None
|
STT model used to produce |
WHISPER_1
|
output_modalities
|
list[OutputModality] | None
|
Assistant response output modalities sent to Realtime API. Defaults to |
None
|
noise_reduction
|
NoiseReduction
|
Microphone noise reduction profile. Use |
FAR_FIELD
|
turn_detection
|
TurnDetection | None
|
Voice activity detection strategy. Defaults to |
None
|
tools
|
Tools | None
|
Pre-registered tool set exposed to the model. Tools receive the shared |
None
|
subagents
|
list[SubAgent] | None
|
Optional sub-agents reachable via auto-registered handoff tools. Prefer attaching MCP servers to subagents rather than the agent. |
None
|
mcp_servers
|
list[MCPServer] | None
|
MCP servers connected during |
None
|
audio_input
|
AudioInputDevice | None
|
Audio input device. Defaults to |
None
|
audio_output
|
AudioOutputDevice | None
|
Audio output device. Defaults to |
None
|
context
|
T | None
|
Shared context object forwarded to all tool handlers and all subagents. |
None
|
event_bus
|
EventBus | None
|
Event bus for session event dispatch. If omitted, a new bus is created automatically. |
None
|
listener
|
AgentListener | None
|
Callback interface for session lifecycle events (transcripts, speaking state, errors, …). |
None
|
inactivity_timeout_seconds
|
float | None
|
Seconds of user silence before the agent stops automatically. Has no effect unless |
None
|
inactivity_timeout_enabled
|
bool
|
Activates the inactivity timeout watchdog. Requires |
False
|
recording_path
|
str | Path | None
|
If provided, the full session audio is recorded to this path via |
None
|
provider
|
RealtimeProvider | None
|
Realtime API provider. Defaults to |
None
|
api_key
|
str | None
|
OpenAI API key. Shortcut for |
None
|
prewarm
async
Prewarm MCP and subagent connections before run().
Calling this explicitly avoids a cold-start delay when the session begins. Safe to call multiple times — subsequent calls are no-ops for MCP servers that are already connected.
Returns:
| Type | Description |
|---|---|
Self
|
Returns |
run
async
run() -> AgentResult
Start the agent and block until the session ends.
Dispatches a StartAgentCommand to kick off audio I/O and the WebSocket
connection, then waits until stop() is called — either manually, via
inactivity timeout, or through an error watchdog.
Returns:
| Type | Description |
|---|---|
AgentResult
|
Conversation history and recording path after the session ends. |
set_speech_speed
async
Update the assistant's speech speed mid-session.
Clamps the value to [0.25, 1.5] before applying. The change takes
effect on the next response — audio that is already playing is unaffected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
speed
|
float
|
Target playback speed. Automatically clamped to |
required |
Configuration
Configuration types, enums, and data classes used across rtvoice.
Models
Bases: StrEnum
Available OpenAI Realtime API model variants.
Attributes:
| Name | Type | Description |
|---|---|---|
GPT_REALTIME |
Full-sized model with higher capability. |
|
GPT_REALTIME_MINI |
Smaller, faster, and cheaper variant. Recommended default for most use-cases. |
Bases: StrEnum
STT models used to produce user transcript events.
Attributes:
| Name | Type | Description |
|---|---|---|
WHISPER_1 |
OpenAI Whisper v1. Currently the only supported model. |
Note
Pass transcription_model=None to RealtimeAgent to disable
transcription entirely. Note that subagents require
transcription to be enabled.
Voice
Bases: StrEnum
TTS voices available for the OpenAI Realtime API.
Attributes:
| Name | Type | Description |
|---|---|---|
ALLOY |
Neutral and balanced; clean output suitable for general use. |
|
ASH |
Clear and precise; described as a male baritone with a slightly scratchy yet upbeat quality. May have limited performance with accents. |
|
BALLAD |
Melodic and gentle; community notes suggest a male-sounding voice. |
|
CORAL |
Warm and friendly; good for approachable or empathetic tones. |
|
ECHO |
Resonant and deep; strong presence in delivery. |
|
FABLE |
Narrative-like and expressive; fitting for storytelling contexts. |
|
ONYX |
Darker, strong, and confident in tone. |
|
NOVA |
Bright, youthful, and energetic. |
|
SAGE |
Calm and thoughtful; measured pacing with a reflective quality. |
|
SHIMMER |
Bright and energetic; dynamic expression with high clarity. |
|
VERSE |
Versatile and expressive; adapts well across different contexts. |
|
CEDAR |
Realtime-only voice. No official description available. |
|
MARIN |
Realtime-only voice. No official description available. |
Audio input
Bases: StrEnum
Microphone noise reduction profile applied to audio input.
Attributes:
| Name | Type | Description |
|---|---|---|
NEAR_FIELD |
Optimised for close-range audio, e.g. a headset microphone. |
|
FAR_FIELD |
Optimised for distant audio sources, e.g. a desktop or room mic. |
Turn detection
Bases: BaseModel
Semantic voice-activity detection strategy.
The model waits until it understands the speaker has finished a thought, producing more natural turn-taking with fewer false cut-offs compared to energy-based detection.
Attributes:
| Name | Type | Description |
|---|---|---|
eagerness |
SemanticEagerness
|
How aggressively the model cuts off the user.
Defaults to |
eagerness
class-attribute
instance-attribute
eagerness: SemanticEagerness = AUTO
How quickly the model decides the user has stopped speaking.
Bases: StrEnum
Controls how quickly semantic VAD decides the user has finished speaking.
Higher eagerness means the model cuts off sooner; lower eagerness waits longer to ensure the user has truly finished their thought.
Attributes:
| Name | Type | Description |
|---|---|---|
LOW |
Waits longest before committing to end-of-turn. |
|
MEDIUM |
Balanced cut-off timing. |
|
HIGH |
Cuts off quickly; may interrupt longer pauses mid-thought. |
|
AUTO |
Let the model decide based on context. Recommended default. |
Bases: BaseModel
Energy- and silence-based voice-activity detection strategy.
Triggers end-of-turn based on audio energy thresholds and silence duration rather than semantic understanding. Useful when latency is critical or semantic VAD produces undesirable behaviour.
Attributes:
| Name | Type | Description |
|---|---|---|
threshold |
float
|
Energy threshold in the range |
prefix_padding_ms |
int
|
Milliseconds of audio to include before the detected
speech onset. Defaults to |
silence_duration_ms |
int
|
Milliseconds of silence required to commit an
end-of-turn. Defaults to |
prefix_padding_ms
class-attribute
instance-attribute
Milliseconds of audio prepended before the detected speech onset.
silence_duration_ms
class-attribute
instance-attribute
Milliseconds of silence required to commit an end-of-turn.
Results
Bases: BaseModel
Return value of RealtimeAgent.run() after the session ends.
Attributes:
| Name | Type | Description |
|---|---|---|
turns |
list[ConversationTurn]
|
Ordered list of conversation turns recorded during the session. |
recording_path |
Path | None
|
Path to the recorded session audio file, or |