|
| 1 | +# XTTS Streaming |
| 2 | + |
| 3 | +This repository packages [TTS](https://github.com/coqui-ai/TTS) as a [Truss](https://truss.baseten.co/) but with streaming. |
| 4 | + |
| 5 | +TTS is a generative audio model for text-to-speech generation. This model takes in text and a speaker's voice as input and converts the text to speech in the voice of the speaker. |
| 6 | + |
| 7 | +## Deploying XTTS |
| 8 | + |
| 9 | +First, clone this repository: |
| 10 | + |
| 11 | +```sh |
| 12 | +git clone https://github.com/basetenlabs/truss-examples/ |
| 13 | +cd xtts-streaming |
| 14 | +``` |
| 15 | + |
| 16 | +Before deployment: |
| 17 | + |
| 18 | +1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys). |
| 19 | +2. Install the latest version of Truss: `pip install --upgrade truss` |
| 20 | + |
| 21 | +With `xtts-v2-truss` as your working directory, you can deploy the model with: |
| 22 | + |
| 23 | +```sh |
| 24 | +truss push |
| 25 | +``` |
| 26 | + |
| 27 | +Paste your Baseten API key if prompted. |
| 28 | + |
| 29 | +For more information, see [Truss documentation](https://truss.baseten.co). |
| 30 | + |
| 31 | +## Invoking the model |
| 32 | + |
| 33 | +Here are the following inputs for the model: |
| 34 | +1. `text`: The text that needs to be converted into speech |
| 35 | +2. `language`: Language for the text |
| 36 | +3. `chunk_size`: Integer size of each chunk being streamed |
| 37 | + |
| 38 | +Here are two examples of streaming the audio. This first example write all of the streamed chunks to an audio file. |
| 39 | + |
| 40 | +```python |
| 41 | +import wave |
| 42 | +import requests |
| 43 | + |
| 44 | +channels = 1 # mono=1, stereo=2 |
| 45 | +sampwidth = 2 # Sample width in bytes, typical values: 2 for 16-bit audio, 1 for 8-bit audio |
| 46 | +framerate = 24000 # Sampling rate, in samples per second (Hz) |
| 47 | + |
| 48 | + |
| 49 | +resp = requests.post( |
| 50 | + "https://model-<model-id>.api.baseten.co/development/predict", |
| 51 | + headers={"Authorization": "Api-Key BASETEN-API-KEY"}, |
| 52 | + json={"text": "Kurt watched the incoming Pelicans. The blocky jet-powered craft were so distant they were only specks against the setting sun. He hit the magnification on his faceplate and saw lines of fire tracing their reentry vectors. They would touch down in three minutes."}, |
| 53 | + stream=True |
| 54 | +) |
| 55 | + |
| 56 | +with wave.open("dat2-wav.wav", 'wb') as wav_file: |
| 57 | + wav_file.setnchannels(channels) |
| 58 | + wav_file.setsampwidth(sampwidth) |
| 59 | + wav_file.setframerate(framerate) |
| 60 | + |
| 61 | + # Iterate through streamed content and write audio chunks directly |
| 62 | + for chunk in resp.iter_content(chunk_size=None): # Use server's chunk size |
| 63 | + if chunk: |
| 64 | + wav_file.writeframes(chunk) |
| 65 | +``` |
| 66 | + |
| 67 | +If you want to stream the audio directly as it gets generated here is another option: |
| 68 | + |
| 69 | +```python |
| 70 | +import pyaudio |
| 71 | + |
| 72 | +FORMAT = pyaudio.paInt16 # Audio format (e.g., 16-bit PCM) |
| 73 | +CHANNELS = 1 # Number of audio channels |
| 74 | +RATE = 24000 # Sample rate |
| 75 | + |
| 76 | +# Initialize PyAudio |
| 77 | +p = pyaudio.PyAudio() |
| 78 | + |
| 79 | +# Open a stream for audio playback |
| 80 | +stream = p.open(format=p.get_format_from_width(2), channels=CHANNELS, rate=RATE, output=True) |
| 81 | + |
| 82 | +# Make a streaming HTTP request to the server |
| 83 | +original_text = "Kurt watched the incoming Pelicans. The blocky jet-powered craft were so distant they were only specks against the setting sun. He hit the magnification on his faceplate and saw lines of fire tracing their reentry vectors. They would touch down in three minutes." |
| 84 | + |
| 85 | + |
| 86 | +resp = requests.post( |
| 87 | + "https://model-<model-id>.api.baseten.co/development/predict", |
| 88 | + headers={"Authorization": "Api-Key BASETEN-API-KEY"}, |
| 89 | + json={"text": "Kurt watched the incoming Pelicans. The blocky jet-powered craft were so distant they were only specks against the setting sun. He hit the magnification on his faceplate and saw lines of fire tracing their reentry vectors. They would touch down in three minutes."}, |
| 90 | + stream=True |
| 91 | +) |
| 92 | + |
| 93 | +# Create a buffer to hold multiple chunks |
| 94 | +buffer = b'' |
| 95 | +buffer_size_threshold = 2**20 |
| 96 | + |
| 97 | +# Stream and play the audio data as it's received |
| 98 | +for chunk in resp.iter_content(chunk_size=4096): |
| 99 | + if chunk: |
| 100 | + buffer += chunk |
| 101 | + if len(buffer) >= buffer_size_threshold: |
| 102 | + print(f"Writing buffer of size: {len(buffer)}") |
| 103 | + stream.write(buffer) |
| 104 | + buffer = b'' # Clear the buffer |
| 105 | + # stream.write(chunk) |
| 106 | + |
| 107 | +if buffer: |
| 108 | + print(f"Writing final buffer of size: {len(buffer)}") |
| 109 | + stream.write(buffer) |
| 110 | + |
| 111 | +# Close and terminate the stream and PyAudio |
| 112 | +stream.stop_stream() |
| 113 | +stream.close() |
| 114 | +p.terminate() |
| 115 | +``` |
0 commit comments