Skip to content

RealtimeAgent

Event-driven voice agent using the OpenAI Realtime API.

Manages the full lifecycle of a real-time voice session: audio I/O, WebSocket connection, tool calling, optional subagent handoffs, MCP server integration, and inactivity timeouts.

Call prewarm() before run() to prewarm connections and avoid startup delays.

Example
agent = RealtimeAgent(
    instructions="You are Jarvis, a helpful home assistant.",
    voice=AssistantVoice.MARIN,
    inactivity_timeout_seconds=30,
    inactivity_timeout_enabled=True,
)
result = await agent.run()

Parameters:

Name Type Description Default
instructions str

System prompt defining the assistant's personality and behavior.

''
model RealtimeModel

Realtime model variant to use. Defaults to GPT_REALTIME_MINI.

GPT_REALTIME_MINI
voice AssistantVoice

TTS voice used for assistant responses.

MARIN
speech_speed float

Playback speed of the assistant's voice. Automatically clamped to [0.5, 1.5].

1.0
transcription_model TranscriptionModel | None

STT model used to produce UserTranscriptCompletedEvent transcripts. Pass None to disable transcription entirely.

WHISPER_1
output_modalities list[OutputModality] | None

Assistant response output modalities sent to Realtime API. Defaults to ["audio"]. Include "text" to receive streamed text events.

None
noise_reduction NoiseReduction

Microphone noise reduction profile. Use FAR_FIELD for desktop mics.

FAR_FIELD
turn_detection TurnDetection | None

Voice activity detection strategy. Defaults to SemanticVAD when None.

None
tools Tools | None

Pre-registered tool set exposed to the model. Tools receive the shared context and event_bus automatically.

None
subagents list[SubAgent] | None

Optional sub-agents reachable via auto-registered handoff tools. Prefer attaching MCP servers to subagents rather than the agent.

None
mcp_servers list[MCPServer] | None

MCP servers connected during prewarm(). Their tools are registered and forwarded to the model.

None
audio_input AudioInputDevice | None

Audio input device. Defaults to MicrophoneInput.

None
audio_output AudioOutputDevice | None

Audio output device. Defaults to SpeakerOutput.

None
context T | None

Shared context object forwarded to all tool handlers and all subagents.

None
event_bus EventBus | None

Event bus for session event dispatch. If omitted, a new bus is created automatically.

None
listener AgentListener | None

Callback interface for session lifecycle events (transcripts, speaking state, errors, …).

None
inactivity_timeout_seconds float | None

Seconds of user silence before the agent stops automatically. Has no effect unless inactivity_timeout_enabled=True.

None
inactivity_timeout_enabled bool

Activates the inactivity timeout watchdog. Requires inactivity_timeout_seconds to be set.

False
recording_path str | Path | None

If provided, the full session audio is recorded to this path via AudioRecordingWatchdog.

None
provider RealtimeProvider | None

Realtime API provider. Defaults to OpenAIProvider. Pass an AzureOpenAIProvider instance to use Azure OpenAI.

None
api_key str | None

OpenAI API key. Shortcut for OpenAIProvider(api_key=...). Deprecated — prefer passing provider=OpenAIProvider(api_key=...) explicitly.

None

prewarm async

prewarm() -> Self

Prewarm MCP and subagent connections before run().

Calling this explicitly avoids a cold-start delay when the session begins. Safe to call multiple times — subsequent calls are no-ops for MCP servers that are already connected.

Returns:

Type Description
Self

Returns self for optional chaining with run().

run async

run() -> AgentResult

Start the agent and block until the session ends.

Dispatches a StartAgentCommand to kick off audio I/O and the WebSocket connection, then waits until stop() is called — either manually, via inactivity timeout, or through an error watchdog.

Returns:

Type Description
AgentResult

Conversation history and recording path after the session ends.

set_speech_speed async

set_speech_speed(speed: float) -> None

Update the assistant's speech speed mid-session.

Clamps the value to [0.25, 1.5] before applying. The change takes effect on the next response — audio that is already playing is unaffected.

Parameters:

Name Type Description Default
speed float

Target playback speed. Automatically clamped to [0.25, 1.5].

required

stop async

stop() -> None

Gracefully shut down the agent.

Cleans up all MCP server connections, dispatches AgentStoppedEvent, and signals the run() coroutine to return. Idempotent — safe to call multiple times.


Configuration

Configuration types, enums, and data classes used across rtvoice.

Models

Bases: StrEnum

Available OpenAI Realtime API model variants.

Attributes:

Name Type Description
GPT_REALTIME

Full-sized model with higher capability.

GPT_REALTIME_MINI

Smaller, faster, and cheaper variant. Recommended default for most use-cases.

Bases: StrEnum

STT models used to produce user transcript events.

Attributes:

Name Type Description
WHISPER_1

OpenAI Whisper v1. Currently the only supported model.

Note

Pass transcription_model=None to RealtimeAgent to disable transcription entirely. Note that subagents require transcription to be enabled.

Voice

Bases: StrEnum

TTS voices available for the OpenAI Realtime API.

Attributes:

Name Type Description
ALLOY

Neutral and balanced; clean output suitable for general use.

ASH

Clear and precise; described as a male baritone with a slightly scratchy yet upbeat quality. May have limited performance with accents.

BALLAD

Melodic and gentle; community notes suggest a male-sounding voice.

CORAL

Warm and friendly; good for approachable or empathetic tones.

ECHO

Resonant and deep; strong presence in delivery.

FABLE

Narrative-like and expressive; fitting for storytelling contexts.

ONYX

Darker, strong, and confident in tone.

NOVA

Bright, youthful, and energetic.

SAGE

Calm and thoughtful; measured pacing with a reflective quality.

SHIMMER

Bright and energetic; dynamic expression with high clarity.

VERSE

Versatile and expressive; adapts well across different contexts.

CEDAR

Realtime-only voice. No official description available.

MARIN

Realtime-only voice. No official description available.

Example
agent = RealtimeAgent(
    voice=AssistantVoice.CORAL,
)

Audio input

Bases: StrEnum

Microphone noise reduction profile applied to audio input.

Attributes:

Name Type Description
NEAR_FIELD

Optimised for close-range audio, e.g. a headset microphone.

FAR_FIELD

Optimised for distant audio sources, e.g. a desktop or room mic.

Example
agent = RealtimeAgent(
    noise_reduction=NoiseReduction.NEAR_FIELD,
)

Turn detection

Bases: BaseModel

Semantic voice-activity detection strategy.

The model waits until it understands the speaker has finished a thought, producing more natural turn-taking with fewer false cut-offs compared to energy-based detection.

Attributes:

Name Type Description
eagerness SemanticEagerness

How aggressively the model cuts off the user. Defaults to SemanticEagerness.AUTO.

Example
agent = RealtimeAgent(
    turn_detection=SemanticVAD(eagerness=SemanticEagerness.LOW),
)

eagerness class-attribute instance-attribute

eagerness: SemanticEagerness = AUTO

How quickly the model decides the user has stopped speaking.

Bases: StrEnum

Controls how quickly semantic VAD decides the user has finished speaking.

Higher eagerness means the model cuts off sooner; lower eagerness waits longer to ensure the user has truly finished their thought.

Attributes:

Name Type Description
LOW

Waits longest before committing to end-of-turn.

MEDIUM

Balanced cut-off timing.

HIGH

Cuts off quickly; may interrupt longer pauses mid-thought.

AUTO

Let the model decide based on context. Recommended default.

Bases: BaseModel

Energy- and silence-based voice-activity detection strategy.

Triggers end-of-turn based on audio energy thresholds and silence duration rather than semantic understanding. Useful when latency is critical or semantic VAD produces undesirable behaviour.

Attributes:

Name Type Description
threshold float

Energy threshold in the range [0, 1] above which audio is considered speech. Defaults to 0.5.

prefix_padding_ms int

Milliseconds of audio to include before the detected speech onset. Defaults to 300.

silence_duration_ms int

Milliseconds of silence required to commit an end-of-turn. Defaults to 500.

Example
agent = RealtimeAgent(
    turn_detection=ServerVAD(silence_duration_ms=800),
)

prefix_padding_ms class-attribute instance-attribute

prefix_padding_ms: int = 300

Milliseconds of audio prepended before the detected speech onset.

silence_duration_ms class-attribute instance-attribute

silence_duration_ms: int = 500

Milliseconds of silence required to commit an end-of-turn.

threshold class-attribute instance-attribute

threshold: float = 0.5

Energy threshold above which audio is considered speech.

Results

Bases: BaseModel

Return value of RealtimeAgent.run() after the session ends.

Attributes:

Name Type Description
turns list[ConversationTurn]

Ordered list of conversation turns recorded during the session.

recording_path Path | None

Path to the recorded session audio file, or None if recording was not enabled.

Example
result = await agent.run()

for turn in result.turns:
    print(turn)

if result.recording_path:
    print(f"Recording saved to: {result.recording_path}")

recording_path class-attribute instance-attribute

recording_path: Path | None = None

Path to the recorded session audio, or None if recording was disabled.

turns instance-attribute

turns: list[ConversationTurn]

Ordered list of conversation turns recorded during the session.