|
| 1 | +--- |
| 2 | +outline: deep |
| 3 | +--- |
| 4 | + |
| 5 | +# Automatic Speech Recognition <Badge type="tip" text="^0.5.0" /> |
| 6 | + |
| 7 | +Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing audio into text. It |
| 8 | +has various applications, such as voice user interfaces, caption generation, and virtual assistants. |
| 9 | + |
| 10 | +## Task ID |
| 11 | + |
| 12 | +- `automatic-speech-recognition` |
| 13 | +- `asr` |
| 14 | + |
| 15 | +## Default Model |
| 16 | + |
| 17 | +- `Xenova/whisper-tiny.en` |
| 18 | + |
| 19 | +## Use Cases |
| 20 | + |
| 21 | +Automatic Speech Recognition is widely used in several domains, including: |
| 22 | + |
| 23 | +- **Caption Generation:** Automatically generates captions for live-streamed or recorded videos, enhancing accessibility |
| 24 | + and aiding in content interpretation for non-native language speakers. |
| 25 | +- **Virtual Speech Assistants:** Embedded in devices to recognize voice commands, facilitating tasks like dialing a |
| 26 | + phone number, answering general questions, or scheduling meetings. |
| 27 | +- **Multilingual ASR:** Converts audio inputs in multiple languages into transcripts, often with language identification |
| 28 | + for improved performance. Examples include models like Whisper. |
| 29 | + |
| 30 | +## Running an Inference Session |
| 31 | + |
| 32 | +Here's how to perform automatic speech recognition using the pipeline: |
| 33 | + |
| 34 | +```php |
| 35 | +use function Codewithkyrian\Transformers\Pipelines\pipeline; |
| 36 | + |
| 37 | +$transcriber = pipeline('automatic-speech-recognition', 'onnx-community/whisper-tiny.en'); |
| 38 | + |
| 39 | +$audioUrl = __DIR__ . '/preamble.wav'; |
| 40 | +$output = $transcriber($audioUrl, maxNewTokens: 256); |
| 41 | +``` |
| 42 | + |
| 43 | +## Pipeline Input Options |
| 44 | + |
| 45 | +When running the `automatic-speech-recognition` pipeline, you can use the following options: |
| 46 | + |
| 47 | +- ### `inputs` *(string)* |
| 48 | + |
| 49 | + The audio file to transcribe. It can be a local file path, a file resource, or a URL to an audio file (local or |
| 50 | + remote). It's the first argument, so there's no need to pass it as a named argument. |
| 51 | + |
| 52 | + ```php |
| 53 | + $output = $transcriber('https://example.com/audio.wav'); |
| 54 | + ``` |
| 55 | + |
| 56 | +- ### `returnTimestamps` *(bool|string)* |
| 57 | + |
| 58 | + Determines whether to return timestamps with the transcribed text. |
| 59 | + - If set to `true`, the model will return the start and end timestamps for each chunk of text, with the chunks |
| 60 | + determined by the model itself. |
| 61 | + - If set to `'word'`, the model will return timestamps for individual words. Note that word-level timestamps require |
| 62 | + models exported with `output_attentions=True`. |
| 63 | + |
| 64 | +- ### `chunkLengthSecs` *(int)* |
| 65 | + |
| 66 | + The length of audio chunks to process in seconds. This is essential for models like Whisper that can only process a |
| 67 | + maximum of 30 seconds at a time. Setting this option will chunk the audio, process each chunk individually, and then |
| 68 | + merge the results into a single output. |
| 69 | + |
| 70 | +- ### `strideLengthSecs` *(int)* |
| 71 | + |
| 72 | + The length of overlap between consecutive audio chunks in seconds. If not provided, this defaults |
| 73 | + to `chunkLengthSecs / 6`. Overlapping ensures smoother transitions and more accurate transcriptions, especially for |
| 74 | + longer audio segments. |
| 75 | + |
| 76 | +- ### `forceFullSequences` *(bool)* |
| 77 | + |
| 78 | + Whether to force the output to be in full sequences. This is set to `false` by default. |
| 79 | + |
| 80 | +- ### `language` *(string)* |
| 81 | + |
| 82 | + The source language of the audio. By default, this is `null`, meaning the language will be auto-detected. Specifying |
| 83 | + the language can improve performance if the source language is known. |
| 84 | + |
| 85 | +- ### `task` *(string)* |
| 86 | + |
| 87 | + The specific task to perform. By default, this is `null`, meaning it will be auto-detected. Possible values |
| 88 | + are `'transcribe'` for transcription and `'translate'` for translating the audio content. |
| 89 | + |
| 90 | +Please note that using the streamer option with this task is not yet supported. |
| 91 | + |
| 92 | +## Pipeline Outputs |
| 93 | + |
| 94 | +The output of the pipeline is an array containing the transcribed text and, optionally, the timestamps. The timestamps |
| 95 | +can be provided either at the chunk level or word level, depending on the `returnTimestamps` setting. |
| 96 | + |
| 97 | +- **Default Output (without timestamps):** |
| 98 | + |
| 99 | + ```php |
| 100 | + [ |
| 101 | + "text" => "We, the people of the United States, in order to form a more perfect union, establish justice, ensure domestic tranquility, provide for the common defense, promote the general welfare, and secure the blessings of liberty to ourselves and our posterity, to ordain and establish this constitution for the United States of America." |
| 102 | + ] |
| 103 | + ``` |
| 104 | + |
| 105 | +- **Output with Chunk-Level Timestamps:** |
| 106 | + |
| 107 | + ```php |
| 108 | + [ |
| 109 | + "text" => "We, the people of the United States, in order to form a more perfect union...", |
| 110 | + "chunks" => [ |
| 111 | + [ |
| 112 | + "timestamp" => [0.0, 5.12], |
| 113 | + "text" => "We, the people of the United States, in order to form a more perfect union, establish" |
| 114 | + ], |
| 115 | + [ |
| 116 | + "timestamp" => [5.12, 10.4], |
| 117 | + "text" => " justice, ensure domestic tranquility, provide for the common defense, promote the general" |
| 118 | + ], |
| 119 | + [ |
| 120 | + "timestamp" => [10.4, 15.2], |
| 121 | + "text" => " welfare, and secure the blessings of liberty to ourselves and our posterity, to ordain" |
| 122 | + ], |
| 123 | + ... |
| 124 | + ] |
| 125 | + ] |
| 126 | + ``` |
| 127 | + |
| 128 | +- **Output with Word-Level Timestamps:** |
| 129 | + |
| 130 | + ```php |
| 131 | + [ |
| 132 | + "text" => "...", |
| 133 | + "chunks" => [ |
| 134 | + ["text" => "We,", "timestamp" => [0.6, 0.94]], |
| 135 | + ["text" => "the", "timestamp" => [0.94, 1.3]], |
| 136 | + ["text" => "people", "timestamp" => [1.3, 1.52]], |
| 137 | + ["text" => "of", "timestamp" => [1.52, 1.62]], |
| 138 | + ["text" => "the", "timestamp" => [1.62, 1.82]], |
| 139 | + ["text" => "United", "timestamp" => [1.82, 2.52]], |
| 140 | + ["text" => "States", "timestamp" => [2.52, 2.72]], |
| 141 | + ["text" => "in", "timestamp" => [2.72, 2.88]], |
| 142 | + ["text" => "order", "timestamp" => [2.88, 3.1]], |
| 143 | + ... |
| 144 | + ] |
| 145 | + ] |
| 146 | + ``` |
0 commit comments