Skip to content

Commit 0a4244b

Browse files
feat(tools): added more tts providers, added stt and videogen models, fixed search modal keyboard nav (#2094)
* feat(tools): added more tts providers, added stt and videogen models, fixed search modal keyboard nav * fixed icons * cleaned up * added falai * improvement: icons * fixed build --------- Co-authored-by: Emir Karabeg <[email protected]>
1 parent 3be57af commit 0a4244b

File tree

39 files changed

+6352
-171
lines changed

39 files changed

+6352
-171
lines changed

apps/docs/components/icons.tsx

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4085,7 +4085,29 @@ export function CalendlyIcon(props: SVGProps<SVGSVGElement>) {
40854085
)
40864086
}
40874087

4088-
export function AudioWaveformIcon(props: SVGProps<SVGSVGElement>) {
4088+
export function STTIcon(props: SVGProps<SVGSVGElement>) {
4089+
return (
4090+
<svg
4091+
{...props}
4092+
xmlns='http://www.w3.org/2000/svg'
4093+
width='24'
4094+
height='24'
4095+
viewBox='0 0 24 24'
4096+
fill='none'
4097+
stroke='currentColor'
4098+
strokeWidth='2'
4099+
strokeLinecap='round'
4100+
strokeLinejoin='round'
4101+
>
4102+
<path d='m15 16 2.536-7.328a1.02 1.02 1 0 1 1.928 0L22 16' />
4103+
<path d='M15.697 14h5.606' />
4104+
<path d='m2 16 4.039-9.69a.5.5 0 0 1 .923 0L11 16' />
4105+
<path d='M3.304 13h6.392' />
4106+
</svg>
4107+
)
4108+
}
4109+
4110+
export function TTSIcon(props: SVGProps<SVGSVGElement>) {
40894111
return (
40904112
<svg
40914113
{...props}
@@ -4108,3 +4130,23 @@ export function AudioWaveformIcon(props: SVGProps<SVGSVGElement>) {
41084130
</svg>
41094131
)
41104132
}
4133+
4134+
export function VideoIcon(props: SVGProps<SVGSVGElement>) {
4135+
return (
4136+
<svg
4137+
{...props}
4138+
xmlns='http://www.w3.org/2000/svg'
4139+
width='24'
4140+
height='24'
4141+
viewBox='0 0 24 24'
4142+
fill='none'
4143+
stroke='currentColor'
4144+
strokeWidth='2'
4145+
strokeLinecap='round'
4146+
strokeLinejoin='round'
4147+
>
4148+
<path d='m16 13 5.223 3.482a.5.5 0 0 0 .777-.416V7.87a.5.5 0 0 0-.752-.432L16 10.5' />
4149+
<rect x='2' y='6' width='14' height='12' rx='2' />
4150+
</svg>
4151+
)
4152+
}

apps/docs/components/ui/icon-mapping.ts

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ import {
88
ApolloIcon,
99
ArxivIcon,
1010
AsanaIcon,
11-
AudioWaveformIcon,
1211
BrainIcon,
1312
BrowserUseIcon,
1413
CalendlyIcon,
@@ -63,15 +62,18 @@ import {
6362
SalesforceIcon,
6463
SerperIcon,
6564
SlackIcon,
65+
STTIcon,
6666
StagehandIcon,
6767
StripeIcon,
6868
SupabaseIcon,
6969
TavilyIcon,
7070
TelegramIcon,
7171
TranslateIcon,
7272
TrelloIcon,
73+
TTSIcon,
7374
TwilioIcon,
7475
TypeformIcon,
76+
VideoIcon,
7577
WealthboxIcon,
7678
WebflowIcon,
7779
WhatsAppIcon,
@@ -92,16 +94,18 @@ export const blockTypeToIconMap: Record<string, IconComponent> = {
9294
webflow: WebflowIcon,
9395
wealthbox: WealthboxIcon,
9496
vision: EyeIcon,
97+
video_generator: VideoIcon,
9598
typeform: TypeformIcon,
9699
twilio_voice: TwilioIcon,
97100
twilio_sms: TwilioIcon,
101+
tts: TTSIcon,
98102
trello: TrelloIcon,
99103
translate: TranslateIcon,
100104
thinking: BrainIcon,
101105
telegram: TelegramIcon,
102106
tavily: TavilyIcon,
103107
supabase: SupabaseIcon,
104-
stt: AudioWaveformIcon,
108+
stt: STTIcon,
105109
stripe: StripeIcon,
106110
stagehand_agent: StagehandIcon,
107111
stagehand: StagehandIcon,

apps/docs/content/docs/en/tools/meta.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,11 @@
6868
"thinking",
6969
"translate",
7070
"trello",
71+
"tts",
7172
"twilio_sms",
7273
"twilio_voice",
7374
"typeform",
75+
"video_generator",
7476
"vision",
7577
"wealthbox",
7678
"webflow",

apps/docs/content/docs/en/tools/stt.mdx

Lines changed: 87 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,32 @@ import { BlockInfoCard } from "@/components/ui/block-info-card"
1111
/>
1212

1313
{/* MANUAL-CONTENT-START:intro */}
14-
Transcribe speech to text using state-of-the-art AI models from leading providers. The Sim Speech-to-Text (STT) tools allow you to convert audio and video files into accurate transcripts, supporting multiple languages, timestamps, and optional translation.
14+
Transcribe speech to text using the latest AI models from world-class providers. Sim's Speech-to-Text (STT) tools empower you to turn audio and video into accurate, timestamped, and optionally translated transcripts—supporting a diversity of languages and enhanced with advanced features such as diarization and speaker identification.
1515

16-
Supported providers:
16+
**Supported Providers & Models:**
1717

18-
- **[OpenAI Whisper](https://platform.openai.com/docs/guides/speech-to-text/overview)**: Advanced open-source STT model from OpenAI. Supports models such as `whisper-1` and handles a wide variety of languages and audio formats.
19-
- **[Deepgram](https://deepgram.com/)**: Real-time and batch STT API with deep learning models like `nova-3`, `nova-2`, and `whisper-large`. Offers features like diarization, intent recognition, and industry-specific tuning.
20-
- **[ElevenLabs](https://elevenlabs.io/)**: Known for high-quality speech AI, ElevenLabs provides STT models focused on accuracy and natural language understanding for numerous languages and dialects.
18+
- **[OpenAI Whisper](https://platform.openai.com/docs/guides/speech-to-text/overview)** (OpenAI):
19+
OpenAI’s Whisper is an open-source deep learning model renowned for its robustness across languages and audio conditions. It supports advanced models such as `whisper-1`, excelling in transcription, translation, and tasks demanding high model generalization. Backed by OpenAI—the company known for ChatGPT and leading AI research—Whisper is widely used in research and as a baseline for comparative evaluation.
2120

22-
Choose the provider and model best suited to your task—whether fast, production-grade transcription (Deepgram), highly accurate multi-language capability (Whisper), or advanced understanding and language coverage (ElevenLabs).
21+
- **[Deepgram](https://deepgram.com/)** (Deepgram Inc.):
22+
Based in San Francisco, Deepgram offers scalable, production-grade speech recognition APIs for developers and enterprises. Deepgram’s models include `nova-3`, `nova-2`, and `whisper-large`, offering real-time and batch transcription with industry-leading accuracy, multi-language support, automatic punctuation, intelligent diarization, call analytics, and features for use cases ranging from telephony to media production.
23+
24+
- **[ElevenLabs](https://elevenlabs.io/)** (ElevenLabs):
25+
A leader in voice AI, ElevenLabs is especially known for premium voice synthesis and recognition. Its STT product delivers high-accuracy, natural understanding of numerous languages, dialects, and accents. Recent ElevenLabs STT models are optimized for clarity, speaker distinction, and are suitable for both creative and accessibility scenarios. ElevenLabs is recognized for cutting-edge advancements in AI-powered speech technologies.
26+
27+
- **[AssemblyAI](https://www.assemblyai.com/)** (AssemblyAI Inc.):
28+
AssemblyAI provides API-driven, highly accurate speech recognition, with features such as auto chaptering, topic detection, summarization, sentiment analysis, and content moderation alongside transcription. Its proprietary model, including the acclaimed `Conformer-2`, powers some of the largest media, call center, and compliance applications in the industry. AssemblyAI is trusted by Fortune 500s and leading AI startups globally.
29+
30+
- **[Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text)** (Google Cloud):
31+
Google’s enterprise-grade Speech-to-Text API supports over 125 languages and variants, offering high accuracy and features such as real-time streaming, word-level confidence, speaker diarization, automatic punctuation, custom vocabulary, and domain-specific tuning. Models such as `latest_long`, `video`, and domain-optimized models are available, powered by Google’s years of research and deployed for global scalability.
32+
33+
- **[AWS Transcribe](https://aws.amazon.com/transcribe/)** (Amazon Web Services):
34+
AWS Transcribe leverages Amazon’s cloud infrastructure to deliver robust speech recognition as an API. It supports multiple languages and features such as speaker identification, custom vocabulary, channel identification (for call center audio), and medical-specific transcription. Popular models include `standard` and domain-specific variations. AWS Transcribe is ideal for organizations already using Amazon’s cloud.
35+
36+
**How to Choose:**
37+
Select the provider and model that fits your application—whether you need fast, enterprise-ready transcription with extra analytics (Deepgram, AssemblyAI, Google, AWS), high versatility and open-source access (OpenAI Whisper), or advanced speaker/contextual understanding (ElevenLabs). Consider the pricing, language coverage, accuracy, and any special features (like summarization, chaptering, or sentiment analysis) you might need.
38+
39+
For more details on capabilities, pricing, feature highlights, and fine-tuning options, refer to each provider’s official documentation via the links above.
2340
{/* MANUAL-CONTENT-END */}
2441

2542

@@ -48,6 +65,8 @@ Transcribe audio to text using OpenAI Whisper
4865
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
4966
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
5067
| `translateToEnglish` | boolean | No | Translate audio to English |
68+
| `prompt` | string | No | Optional text to guide the model's style or continue a previous audio segment. Helps with proper nouns and context. |
69+
| `temperature` | number | No | Sampling temperature between 0 and 1. Higher values make output more random, lower values more focused and deterministic. |
5170

5271
#### Output
5372

@@ -57,7 +76,6 @@ Transcribe audio to text using OpenAI Whisper
5776
| `segments` | array | Timestamped segments |
5877
| `language` | string | Detected or specified language |
5978
| `duration` | number | Audio duration in seconds |
60-
| `confidence` | number | Overall confidence score |
6179

6280
### `stt_deepgram`
6381

@@ -114,6 +132,68 @@ Transcribe audio to text using ElevenLabs
114132
| `duration` | number | Audio duration in seconds |
115133
| `confidence` | number | Overall confidence score |
116134

135+
### `stt_assemblyai`
136+
137+
Transcribe audio to text using AssemblyAI with advanced NLP features
138+
139+
#### Input
140+
141+
| Parameter | Type | Required | Description |
142+
| --------- | ---- | -------- | ----------- |
143+
| `provider` | string | Yes | STT provider \(assemblyai\) |
144+
| `apiKey` | string | Yes | AssemblyAI API key |
145+
| `model` | string | No | AssemblyAI model to use \(default: best\) |
146+
| `audioFile` | file | No | Audio or video file to transcribe |
147+
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
148+
| `audioUrl` | string | No | URL to audio or video file |
149+
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
150+
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
151+
| `diarization` | boolean | No | Enable speaker diarization |
152+
| `sentiment` | boolean | No | Enable sentiment analysis |
153+
| `entityDetection` | boolean | No | Enable entity detection |
154+
| `piiRedaction` | boolean | No | Enable PII redaction |
155+
| `summarization` | boolean | No | Enable automatic summarization |
156+
157+
#### Output
158+
159+
| Parameter | Type | Description |
160+
| --------- | ---- | ----------- |
161+
| `transcript` | string | Full transcribed text |
162+
| `segments` | array | Timestamped segments with speaker labels |
163+
| `language` | string | Detected or specified language |
164+
| `duration` | number | Audio duration in seconds |
165+
| `confidence` | number | Overall confidence score |
166+
| `sentiment` | array | Sentiment analysis results |
167+
| `entities` | array | Detected entities |
168+
| `summary` | string | Auto-generated summary |
169+
170+
### `stt_gemini`
171+
172+
Transcribe audio to text using Google Gemini with multimodal capabilities
173+
174+
#### Input
175+
176+
| Parameter | Type | Required | Description |
177+
| --------- | ---- | -------- | ----------- |
178+
| `provider` | string | Yes | STT provider \(gemini\) |
179+
| `apiKey` | string | Yes | Google API key |
180+
| `model` | string | No | Gemini model to use \(default: gemini-2.5-flash\) |
181+
| `audioFile` | file | No | Audio or video file to transcribe |
182+
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
183+
| `audioUrl` | string | No | URL to audio or video file |
184+
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
185+
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
186+
187+
#### Output
188+
189+
| Parameter | Type | Description |
190+
| --------- | ---- | ----------- |
191+
| `transcript` | string | Full transcribed text |
192+
| `segments` | array | Timestamped segments |
193+
| `language` | string | Detected or specified language |
194+
| `duration` | number | Audio duration in seconds |
195+
| `confidence` | number | Overall confidence score |
196+
117197

118198

119199
## Notes

0 commit comments

Comments
 (0)