[Platform][Agent] Introduce Speech support#943
[Platform][Agent] Introduce Speech support#943Guikingone wants to merge 1 commit intosymfony:mainfrom
Speech support#943Conversation
2c573eb to
8dd5cd5
Compare
|
To me we maybe should introduce capabilities also to platforms rather than having a voice component. As far as I understand I cannot use the Voice component standalone, right? I don't think a dedicated component is the way to go here |
|
We can introduce it via the Platform, could be easier, the voice can be used without agents but it will requires the Will update the PR to match this approach 👍🏻 |
|
I agree, Agent scope is not needed 👍🏻 |
VoiceProviders and VoiceListeners
|
Hi @Guikingone, i agree that week lack some kind of guidance on how voices work - but same goes for other binary stuff like creating images or videos. so two things i would like to understand
btw, "speech" is more common than "vioce" isn't it? |
The main goal is to add the capacity to have an agent/platform that can "listen" and answer to inputs thanks to voice / speech (voice is used as a sugar here, could be renamed to speech), creating a workflow where you can submit voice, call the platform that transforms it to speech / text (depending on the situation you're in) and returning it to the user without frictions.
It is now part of
Agreed, could be renamed to
Yes, the goal is to ease it with a "built-in" approach / API that stays transparent for the user. |
VoiceProviders and VoiceListenersSpeech support via Platform
|
just realized we should the "audio" demo to "speech" as well - and i'm def not really happy with that solution there. can we make it as easy as the structured output - like with an listener? i like that starting point: $result = $platform->invoke('eleven_multilingual_v2', new Text('Hello world'), [
'voice' => 'Dslrhjl3ZpzrctukrQSN', // Brad (https://elevenlabs.io/app/voice-library?voiceId=Dslrhjl3ZpzrctukrQSN)
]);
echo $result->asVoice();what would be the return type here? would it be same as |
Could be something to explore, the API is not locked for now.
My first approach was to do the same thing as |
79ddf87 to
f011c3e
Compare
dcae952 to
be04280
Compare
Speech support via PlatformSpeech support
be04280 to
b319521
Compare
|
Well, might seems weird but here we go, |
120f391 to
1963409
Compare
be85dda to
74bd8cb
Compare
|
@chr-hertel will have a look soon, not sure it will land in 0.3, lets keep it for now |
ddb5904 to
5a0a9a2
Compare
059e90f to
a3fec4f
Compare
Speech supportSpeech support
f42ac9c to
e5d9137
Compare
|
Hi @OskarStark @chr-hertel, yes, I know, again 😄 I think that this time, that's the one, while thinking about #1572 and the comment from chris, I thought about this PR and the listener approach didn't looked like "THE" solution, especially while we have the processors, so, I asked Claude (yes, sometimes, asking for an external opinion might lead to a solution) for a "reworked implementation" that could ease the user experience and the maintenance of it, it submitted a solution close to the processors and I did the final tweaking. So, what changed? Now, the speech configuration is moved where it needs to be, at the The I updated the examples and reworked the documentation, much better, make more sense IMHO to be like that. I let you take a look at it and review it if you think it deserves to be reviewed, is #1572 needed anymore? Thought question, if this PR is merged, probably not, at least, I don't see use case except for the validation/evaluation part (for now) that could require it (as speech is now at the agent level), probably another topic for another day 😄 |
b3586eb to
d9218f0
Compare
d9218f0 to
6a7fa0d
Compare
6a7fa0d to
9e38be5
Compare
TTS,STTandSTSfor agentsExample for an OpenAI-based
STSagent:TTSorSTTindependentlySpeechConfigurationobject handle the speech configurationSpeechProcessorhandle the input/output