| 
 | 1 | +# llama.cpp/example/tts  | 
 | 2 | +This example demonstrates the Text To Speech feature. It uses a  | 
 | 3 | +[model](https://www.outeai.com/blog/outetts-0.2-500m) from  | 
 | 4 | +[outeai](https://www.outeai.com/).  | 
 | 5 | + | 
 | 6 | +## Quickstart  | 
 | 7 | +If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the  | 
 | 8 | +following command and the required models will be downloaded automatically:  | 
 | 9 | +```console  | 
 | 10 | +$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav  | 
 | 11 | +```  | 
 | 12 | +For details about the models and how to convert them to the required format  | 
 | 13 | +see the following sections.  | 
 | 14 | + | 
 | 15 | +### Model conversion  | 
 | 16 | +Checkout or download the model that contains the LLM model:  | 
 | 17 | +```console  | 
 | 18 | +$ pushd models  | 
 | 19 | +$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M  | 
 | 20 | +$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull  | 
 | 21 | +$ popd  | 
 | 22 | +```  | 
 | 23 | +Convert the model to .gguf format:  | 
 | 24 | +```console  | 
 | 25 | +(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \  | 
 | 26 | +    --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16  | 
 | 27 | +```  | 
 | 28 | +The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.  | 
 | 29 | + | 
 | 30 | +We can optionally quantize this to Q8_0 using the following command:  | 
 | 31 | +```console  | 
 | 32 | +$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \  | 
 | 33 | +    models/outetts-0.2-0.5B-q8_0.gguf q8_0  | 
 | 34 | +```  | 
 | 35 | +The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.  | 
 | 36 | + | 
 | 37 | +Next we do something simlar for the audio decoder. First download or checkout  | 
 | 38 | +the model for the voice decoder:  | 
 | 39 | +```console  | 
 | 40 | +$ pushd models  | 
 | 41 | +$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token  | 
 | 42 | +$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull  | 
 | 43 | +$ popd  | 
 | 44 | +```  | 
 | 45 | +This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to  | 
 | 46 | +huggingface format:  | 
 | 47 | +```console  | 
 | 48 | +(venv) python examples/tts/convert_pt_to_hf.py \  | 
 | 49 | +    models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt  | 
 | 50 | +...  | 
 | 51 | +Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors  | 
 | 52 | +Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json  | 
 | 53 | +Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json  | 
 | 54 | +```  | 
 | 55 | +Then we can convert the huggingface format to gguf:  | 
 | 56 | +```console  | 
 | 57 | +(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \  | 
 | 58 | +    --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16  | 
 | 59 | +...  | 
 | 60 | +INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf  | 
 | 61 | +```  | 
 | 62 | + | 
 | 63 | +### Running the example  | 
 | 64 | + | 
 | 65 | +With both of the models generated, the LLM model and the voice decoder model,  | 
 | 66 | +we can run the example:  | 
 | 67 | +```console  | 
 | 68 | +$ build/bin/llama-tts -m  ./models/outetts-0.2-0.5B-q8_0.gguf \  | 
 | 69 | +    -mv ./models/wavtokenizer-large-75-f16.gguf \  | 
 | 70 | +    -p "Hello world"  | 
 | 71 | +...  | 
 | 72 | +main: audio written to file 'output.wav'  | 
 | 73 | +```  | 
 | 74 | +The output.wav file will contain the audio of the prompt. This can be heard  | 
 | 75 | +by playing the file with a media player. On Linux the following command will  | 
 | 76 | +play the audio:  | 
 | 77 | +```console  | 
 | 78 | +$ aplay output.wav  | 
 | 79 | +```  | 
 | 80 | + | 
0 commit comments