Enhancement: The ability to configure speech-to-text within the front end. #5253
danielrosehill
started this conversation in
Feature Requests & Suggestions
Replies: 1 comment 1 reply
-
@berry-13 I agree that we should add |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What features would you like to see added?
As a huge user of speech-to-text services who finds them invaluable for capturing prompts, I would very much like to be able to configure speech-to-text within the app itself rather than having to do so via the YAML configuration files.
I tried that approach and I can't seem to get it to hold my details. But either way, I think it's much more sustainable to have this functionality accessible through the UI.
More details
I've been using these tools (speech to text) almost full-time for the past few months so if my personal assessment of their capabilities is of any use I'll offer it here:
Locally hosted Whisper models are less helpful in my opinion than simply using Whisper via the OpenAI API in this particular context. What I mean by that is for the majority of users who aren't deploying their instance on to hardware that would really do STT justice (high spec GPU etc) they're better served by using a cloud API and the costs associated with prompting via Whisper in my experience are not that significant. I think that offering users all the options is absolutely the right approach, but I'd be happy to draft some documentation on working with all options if I can get it to work on my own instance.
Beyond Whisper, there are some other speech-to-text providers that are accessible via API and which it might be nice to offer as additional options as relatively few tools are doing this to date. I'd point to Amazon, DeepGram, and Speechmatics and other additional providers who are offering high quality ASR voice recognition tools that generally far exceed the performance of Google or whatever else the user has accessible in the browser.
Which components are impacted by your request?
General
Pictures
No response
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions