RealtimeAgent

Event-driven voice agent using the OpenAI Realtime API.

Manages the full lifecycle of a real-time voice session: audio I/O, WebSocket connection, tool calling, optional subagent handoffs, MCP server integration, and inactivity timeouts.

Call prewarm() before run() to prewarm connections and avoid startup delays.

Example

agent = RealtimeAgent(
    instructions="You are Jarvis, a helpful home assistant.",
    voice=AssistantVoice.MARIN,
    inactivity_timeout_seconds=30,
    inactivity_timeout_enabled=True,
)
result = await agent.run()

Parameters:

Name	Type	Description	Default
`instructions`	`str`	System prompt defining the assistant's personality and behavior.	`''`
`model`	`RealtimeModel`	Realtime model variant to use. Defaults to `GPT_REALTIME_MINI`.	`GPT_REALTIME_MINI`
`voice`	`AssistantVoice`	TTS voice used for assistant responses.	`MARIN`
`speech_speed`	`float`	Playback speed of the assistant's voice. Automatically clamped to `[0.5, 1.5]`.	`1.0`
`transcription_model`	`TranscriptionModel \| None`	STT model used to produce `UserTranscriptCompletedEvent` transcripts. Pass `None` to disable transcription entirely.	`WHISPER_1`
`output_modalities`	`list[OutputModality] \| None`	Assistant response output modalities sent to Realtime API. Defaults to `["audio"]`. Include `"text"` to receive streamed text events.	`None`
`noise_reduction`	`NoiseReduction`	Microphone noise reduction profile. Use `FAR_FIELD` for desktop mics.	`FAR_FIELD`
`turn_detection`	`TurnDetection \| None`	Voice activity detection strategy. Defaults to `SemanticVAD` when `None`.	`None`
`tools`	`Tools \| None`	Pre-registered tool set exposed to the model. Tools receive the shared `context` and `event_bus` automatically.	`None`
`subagents`	`list[SubAgent] \| None`	Optional sub-agents reachable via auto-registered handoff tools. Prefer attaching MCP servers to subagents rather than the agent.	`None`
`mcp_servers`	`list[MCPServer] \| None`	MCP servers connected during `prewarm()`. Their tools are registered and forwarded to the model.	`None`
`audio_input`	`AudioInputDevice \| None`	Audio input device. Defaults to `MicrophoneInput`.	`None`
`audio_output`	`AudioOutputDevice \| None`	Audio output device. Defaults to `SpeakerOutput`.	`None`
`context`	`T \| None`	Shared context object forwarded to all tool handlers and all subagents.	`None`
`event_bus`	`EventBus \| None`	Event bus for session event dispatch. If omitted, a new bus is created automatically.	`None`
`listener`	`AgentListener \| None`	Callback interface for session lifecycle events (transcripts, speaking state, errors, …).	`None`
`inactivity_timeout_seconds`	`float \| None`	Seconds of user silence before the agent stops automatically. Has no effect unless `inactivity_timeout_enabled=True`.	`None`
`inactivity_timeout_enabled`	`bool`	Activates the inactivity timeout watchdog. Requires `inactivity_timeout_seconds` to be set.	`False`
`recording_path`	`str \| Path \| None`	If provided, the full session audio is recorded to this path via `AudioRecordingWatchdog`.	`None`
`provider`	`RealtimeProvider \| None`	Realtime API provider. Defaults to `OpenAIProvider`. Pass an `AzureOpenAIProvider` instance to use Azure OpenAI.	`None`
`api_key`	`str \| None`	OpenAI API key. Shortcut for `OpenAIProvider(api_key=...)`. Deprecated — prefer passing `provider=OpenAIProvider(api_key=...)` explicitly.	`None`

prewarm `async`

prewarm() -> Self

Prewarm MCP and subagent connections before run().

Calling this explicitly avoids a cold-start delay when the session begins. Safe to call multiple times — subsequent calls are no-ops for MCP servers that are already connected.

Returns:

Type	Description
`Self`	Returns `self` for optional chaining with `run()`.

run `async`

run() -> AgentResult

Start the agent and block until the session ends.

Dispatches a StartAgentCommand to kick off audio I/O and the WebSocket connection, then waits until stop() is called — either manually, via inactivity timeout, or through an error watchdog.

Returns:

Type	Description
`AgentResult`	Conversation history and recording path after the session ends.

set_speech_speed `async`

set_speech_speed(speed: float) -> None

Update the assistant's speech speed mid-session.

Clamps the value to [0.25, 1.5] before applying. The change takes effect on the next response — audio that is already playing is unaffected.

Parameters:

Name	Type	Description	Default
`speed`	`float`	Target playback speed. Automatically clamped to `[0.25, 1.5]`.	required

stop `async`

stop() -> None

Gracefully shut down the agent.

Cleans up all MCP server connections, dispatches AgentStoppedEvent, and signals the run() coroutine to return. Idempotent — safe to call multiple times.

Configuration

Configuration types, enums, and data classes used across rtvoice.

Models

Bases: StrEnum

Available OpenAI Realtime API model variants.

Attributes:

Name	Type	Description
`GPT_REALTIME`		Full-sized model with higher capability.
`GPT_REALTIME_MINI`		Smaller, faster, and cheaper variant. Recommended default for most use-cases.

Bases: StrEnum

STT models used to produce user transcript events.

Attributes:

Name	Type	Description
`WHISPER_1`		OpenAI Whisper v1. Currently the only supported model.

Note

Pass transcription_model=None to RealtimeAgent to disable transcription entirely. Note that subagents require transcription to be enabled.

Voice

Bases: StrEnum

TTS voices available for the OpenAI Realtime API.

Attributes:

Name	Type	Description
`ALLOY`		Neutral and balanced; clean output suitable for general use.
`ASH`		Clear and precise; described as a male baritone with a slightly scratchy yet upbeat quality. May have limited performance with accents.
`BALLAD`		Melodic and gentle; community notes suggest a male-sounding voice.
`CORAL`		Warm and friendly; good for approachable or empathetic tones.
`ECHO`		Resonant and deep; strong presence in delivery.
`FABLE`		Narrative-like and expressive; fitting for storytelling contexts.
`ONYX`		Darker, strong, and confident in tone.
`NOVA`		Bright, youthful, and energetic.
`SAGE`		Calm and thoughtful; measured pacing with a reflective quality.
`SHIMMER`		Bright and energetic; dynamic expression with high clarity.
`VERSE`		Versatile and expressive; adapts well across different contexts.
`CEDAR`		Realtime-only voice. No official description available.
`MARIN`		Realtime-only voice. No official description available.

Example

agent = RealtimeAgent(
    voice=AssistantVoice.CORAL,
)

Audio input

Bases: StrEnum

Microphone noise reduction profile applied to audio input.

Attributes:

Name	Type	Description
`NEAR_FIELD`		Optimised for close-range audio, e.g. a headset microphone.
`FAR_FIELD`		Optimised for distant audio sources, e.g. a desktop or room mic.

Example

agent = RealtimeAgent(
    noise_reduction=NoiseReduction.NEAR_FIELD,
)

Turn detection

Bases: BaseModel

Semantic voice-activity detection strategy.

The model waits until it understands the speaker has finished a thought, producing more natural turn-taking with fewer false cut-offs compared to energy-based detection.

Attributes:

Name	Type	Description
`eagerness`	`SemanticEagerness`	How aggressively the model cuts off the user. Defaults to `SemanticEagerness.AUTO`.

Example

agent = RealtimeAgent(
    turn_detection=SemanticVAD(eagerness=SemanticEagerness.LOW),
)

eagerness `class-attribute` `instance-attribute`

eagerness: SemanticEagerness = AUTO

How quickly the model decides the user has stopped speaking.

Bases: StrEnum

Controls how quickly semantic VAD decides the user has finished speaking.

Higher eagerness means the model cuts off sooner; lower eagerness waits longer to ensure the user has truly finished their thought.

Attributes:

Name	Type	Description
`LOW`		Waits longest before committing to end-of-turn.
`MEDIUM`		Balanced cut-off timing.
`HIGH`		Cuts off quickly; may interrupt longer pauses mid-thought.
`AUTO`		Let the model decide based on context. Recommended default.

Bases: BaseModel

Energy- and silence-based voice-activity detection strategy.

Triggers end-of-turn based on audio energy thresholds and silence duration rather than semantic understanding. Useful when latency is critical or semantic VAD produces undesirable behaviour.

Attributes:

Name	Type	Description
`threshold`	`float`	Energy threshold in the range `[0, 1]` above which audio is considered speech. Defaults to `0.5`.
`prefix_padding_ms`	`int`	Milliseconds of audio to include before the detected speech onset. Defaults to `300`.
`silence_duration_ms`	`int`	Milliseconds of silence required to commit an end-of-turn. Defaults to `500`.

Example

agent = RealtimeAgent(
    turn_detection=ServerVAD(silence_duration_ms=800),
)

prefix_padding_ms `class-attribute` `instance-attribute`

prefix_padding_ms: int = 300

Milliseconds of audio prepended before the detected speech onset.

silence_duration_ms `class-attribute` `instance-attribute`

silence_duration_ms: int = 500

Milliseconds of silence required to commit an end-of-turn.

threshold `class-attribute` `instance-attribute`

threshold: float = 0.5

Energy threshold above which audio is considered speech.

Results

Bases: BaseModel

Return value of RealtimeAgent.run() after the session ends.

Attributes:

Name	Type	Description
`turns`	`list[ConversationTurn]`	Ordered list of conversation turns recorded during the session.
`recording_path`	`Path \| None`	Path to the recorded session audio file, or `None` if recording was not enabled.

Example

result = await agent.run()

for turn in result.turns:
    print(turn)

if result.recording_path:
    print(f"Recording saved to: {result.recording_path}")

recording_path `class-attribute` `instance-attribute`

recording_path: Path | None = None

Path to the recorded session audio, or None if recording was disabled.

turns `instance-attribute`

turns: list[ConversationTurn]

Ordered list of conversation turns recorded during the session.

RealtimeAgent

prewarm async

run async

set_speech_speed async

stop async

Configuration

Models

Voice

Audio input

Turn detection

eagerness class-attribute instance-attribute

prefix_padding_ms class-attribute instance-attribute

silence_duration_ms class-attribute instance-attribute

threshold class-attribute instance-attribute

Results

recording_path class-attribute instance-attribute

turns instance-attribute

prewarm `async`

run `async`

set_speech_speed `async`

stop `async`

eagerness `class-attribute` `instance-attribute`

prefix_padding_ms `class-attribute` `instance-attribute`

silence_duration_ms `class-attribute` `instance-attribute`

threshold `class-attribute` `instance-attribute`

recording_path `class-attribute` `instance-attribute`

turns `instance-attribute`