-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat: #1614 gpt-realtime migration (Realtime API GA) #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/realtime/app/server.py
Outdated
# Disable server-side interrupt_response to avoid truncating assistant audio | ||
session_context = await runner.run( | ||
model_config={ | ||
"initial_model_settings": { | ||
"turn_detection": {"type": "semantic_vad", "interrupt_response": False} | ||
} | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this by default? why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored some changes to make the audio output quality, but they're not related to the gpt-realtime migration. So, I've reverted all of them. I will continue seeing improvements for this example app, but it can be done with a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing to change to new voices, this is taken from the examples (examples/realtime/app)
model_settings: RealtimeSessionModelSettings = {
"model_name": "gpt-realtime",
"modalities": ["text", "audio"],
"voice": "marin",
"speed": 1.0,
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-mini-transcribe",
},
"turn_detection": {"type": "semantic_vad", "threshold": 0.5},
# "instructions": "…", # optional
# "prompt": "…", # optional
# "tool_choice": "auto", # optional
# "tools": [], # optional
# "handoffs": [], # optional
# "tracing": {"enabled": False}, # optional
}
config = RealtimeRunConfig(model_settings=model_settings)
runner = RealtimeRunner(starting_agent=get_starting_agent())
I noticied that voice is changed but I lost all agents handoff, tool, etc.
I setted config via RealtimeRunConfig and RealtimeModelConfig. In both cases happened the same.
examples/realtime/app/server.py
Outdated
base_event["output"] = str(event.output) | ||
elif event.type == "audio": | ||
base_event["audio"] = base64.b64encode(event.audio.data).decode("utf-8") | ||
# Coalesce raw PCM and flush on a steady timer for smoother playback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just a quality improvement? would be nice to make it be a separate PR if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, same with above (I won't repeat this for the rest)
a4333dd
to
f02b096
Compare
Hello, Any ETA on this one? I could be using it right now. :) Cheers, Thomas |
Hi @seratch, do you know if this PR is going to be merged this week? No pressure, just to know ETA in this cases. Thanks you very much! By the way, class OpenAIRealtimeWebSocketModel(RealtimeModel) has "gpt-4o-realtime-preview" by default (and you can't change it). Should by nice to set to "gpt-realtime". |
not to speak for @seratch, but this is probably mostly dependent more on the review from @rm-openai |
@seratch : FYI, noted that with OpenAI 1.107.0, I get this import error using your branch: File "\.venv\Lib\site-packages\agents\realtime\__init__.py", line 84, in <module>
from .openai_realtime import (
...<3 lines>...
)
File "\.venv\Lib\site-packages\agents\realtime\openai_realtime.py", line 32, in <module>
from openai.types.realtime.realtime_audio_config import (
...<3 lines>...
)
ImportError: cannot import name 'Input' from 'openai.types.realtime.realtime_audio_config' (\.venv\Lib\site-packages\openai\types\realtime\realtime_audio_config.py) |
@KelSolaar Thanks for letting me know this! Will resolve the conflicts. |
You are very much welcome! The new model has also mostly solved the issue I reported here: #1681 |
@rm-openai @seratch What about changing OpenAIRealtimeWebSocketModel(RealtimeModel) model from "gpt-4o-realtime-preview" to "gpt-realtime"? Should be nice to have it as default, or better, to make possible to select realtime model to use. |
@na-proyectran This pull request already does the change. Once this is released, the default model will be changed. Right now, we're waiting for the underlying |
Not the only, in openai-python (release 1.107.0) they removed other things like: from openai.types.realtime.realtime_tools_config_union import ( from openai.types.realtime.realtime_audio_config import ( |
sounds great! do you have an idea when that will be? should I think of days, weeks, months? thanks! |
The pull request is essentially functional as is and can be tested, just make sure that you pin your requirements:
|
Hello, I'm looking for image input, and unless I'm missing something, it is not supported at the moment right? From @classmethod
def convert_user_input_to_conversation_item(
cls, event: RealtimeModelSendUserInput
) -> OpenAIConversationItem:
user_input = event.user_input
if isinstance(user_input, dict):
return RealtimeConversationItemUserMessage(
type="message",
role="user",
content=[
Content(
type="input_text",
text=item.get("text"),
)
for item in user_input.get("content", [])
],
)
else:
return RealtimeConversationItemUserMessage(
type="message",
role="user",
content=[Content(type="input_text", text=user_input)],
) The API should look like this: {
"type": "conversation.item.create",
"previous_item_id": null,
"item": {
"type": "message",
"role": "user",
"content": [
{
"type": "input_image",
"image_url": "data:image/{format(example: png)};base64,{some_base64_image_bytes}"
}
]
}
} |
@KelSolaar Thanks for pointing the lack out. The image input should be supported but it's missing here now. I will update the code to cover the use case too. |
Thanks a ton and sorry for making this PR harder to push through! |
It's, just pointing new openai release.
I mean, should by nice to sync with last openai release |
30bbd8d
to
7afde98
Compare
enable-cache: true | ||
- name: Install dependencies | ||
run: make sync | ||
- name: Install Python 3.9 dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to makefile
# Environments | ||
.env | ||
.python-version | ||
.env* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for local python 3.9 tests
this.playbackAudioContext = null; | ||
this.currentAudioSource = null; | ||
|
||
this.currentAudioGain = null; // per-chunk gain for smooth fades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted internals of this JS code to more smoothly play the audio chunks (less gain noise)
this.toggleMute(); | ||
}); | ||
|
||
// Image upload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for image file inputs
|
||
def calculate_audio_length_ms(format: RealtimeAudioFormat | None, audio_bytes: bytes) -> float: | ||
if format and format.startswith("g711"): | ||
if format and isinstance(format, str) and format.startswith("g711"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how the format data could be either str or dict/class
from ..logger import logger | ||
|
||
|
||
def to_realtime_audio_format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TS SDK does the same
RealtimeModelSendUserInput, | ||
) | ||
|
||
# Avoid direct imports of non-exported names by referencing via module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for mypy warnings
|
||
DEFAULT_MODEL_SETTINGS: RealtimeSessionModelSettings = { | ||
"voice": "ash", | ||
"modalities": ["text", "audio"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial release of gpt-realtime does not support having both, so changed this default settings; you can still receive transcript in addition to audio chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change default voice to newer ones, for quality improvement
|
||
async def _handle_ws_event(self, event: dict[str, Any]): | ||
await self._emit_event(RealtimeModelRawServerEvent(data=event)) | ||
# The public interface definedo on this Agents SDK side (e.g., RealtimeMessageItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned here, this SDK's public interface was the same with beta API's data structure and the GA ones are slightly different. Thus, converting the data to fill the gap here
await self._emit_event(RealtimeModelItemDeletedEvent(item_id=parsed.item_id)) | ||
elif ( | ||
parsed.type == "conversation.item.created" | ||
parsed.type == "conversation.item.added" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is necessary to detect the user input item addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments
to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
Thank you @seratch! |
Amazing work, thanks a lot to everyone involved! I've just noticed one small thing - the Twilio example in /examples/realtime/twilio produces only noise right now. Is there a chance to update it as well? |
Thanks for the feedback. I'll update the twilio example sometime soon! |
// Smoothly ramp down before stopping to avoid clicks | ||
if (this.currentAudioSource && this.playbackAudioContext) { | ||
try { | ||
this.currentAudioSource.stop(); | ||
this.currentAudioSource = null; | ||
const now = this.playbackAudioContext.currentTime; | ||
const fade = Math.max(0.01, this.playbackFadeSec); | ||
if (this.currentAudioGain) { | ||
try { | ||
this.currentAudioGain.gain.cancelScheduledValues(now); | ||
// Capture current value to ramp from it | ||
const current = this.currentAudioGain.gain.value ?? 1.0; | ||
this.currentAudioGain.gain.setValueAtTime(current, now); | ||
this.currentAudioGain.gain.linearRampToValueAtTime(0.0001, now + fade); | ||
} catch {} | ||
} | ||
// Stop after the fade completes | ||
setTimeout(() => { | ||
try { this.currentAudioSource && this.currentAudioSource.stop(); } catch {} | ||
this.currentAudioSource = null; | ||
this.currentAudioGain = null; | ||
}, Math.ceil(fade * 1000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seratch : Why are we getting those clicks in the first place, is this a scheduling issue?
this is still in progress but will resolve #1614