Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .github/workflows/python-package-conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Python Package using Conda

on: [push]

jobs:
build-linux:
runs-on: ubuntu-latest
strategy:
max-parallel: 5

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: '3.10'
- name: Add conda to system path
run: |
# $CONDA is an environment variable pointing to the root of the miniconda directory
echo $CONDA/bin >> $GITHUB_PATH
- name: Install dependencies
run: |
conda env update --file environment.yml --name base
- name: Lint with flake8
run: |
conda install flake8
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
conda install pytest
pytest

- name: Security Scan
run: |
pip install pip-audit
pip-audit
62 changes: 62 additions & 0 deletions COLAB_FIX_DETAILS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Google Colab Compatibility Fixes

This document details the changes made to the `chatterbox` project to ensure compatibility with the Google Colab environment.

## Overview of Changes

The primary issue preventing installation on Google Colab was strict version pinning in `pyproject.toml`. Google Colab environments come with pre-installed versions of major libraries (like PyTorch, NumPy, Transformers) that are updated frequently. Strict pinning (e.g., `==2.6.0`) causes conflicts with these pre-installed versions or forces unnecessary and time-consuming reinstallations that may break the environment.

## File: `pyproject.toml`

The following dependencies were modified:

| Package | Original Version | New Version | Reason |
| :--- | :--- | :--- | :--- |
| `numpy` | `>=1.24.0,<1.26.0` | `>=1.26.0` | Colab often uses newer NumPy versions. Relaxed upper bound constraint. |
| `librosa` | `==0.11.0` | `>=0.10.0` | Relaxed strict pin to allow compatible newer or slightly older versions. |
| `torch` | `==2.6.0` | `>=2.0.0` | **CRITICAL**: Colab has pre-installed PyTorch. Strict pinning forces a reinstall that can break CUDA compatibility or time out. Relaxed to any major 2.x version. |
| `torchaudio` | `==2.6.0` | `>=2.0.0` | Matched `torch` relaxtion. |
| `transformers` | `==4.46.3` | `>=4.46.0` | Relaxed strict pin. Colab often has recent transformers; exact match is unnecessary. |
| `diffusers` | `==0.29.0` | `>=0.29.0` | Relaxed strict pin to allow updates. |
| `resemble-perth` | `==1.0.1` | `>=1.0.1` | Relaxed pin. |
| `conformer` | `==0.3.2` | `>=0.3.2` | Relaxed pin. |
| `safetensors` | `==0.5.3` | `>=0.5.0` | Relaxed pin. |
| `pykakasi` | `==2.3.0` | `>=2.3.0` | Relaxed pin. |
| `gradio` | `==5.44.1` | `>=4.0.0` | Relaxed largely. Gradio 5.x is new, but 4.x is often sufficient. Allowing `>=4.0.0` gives maximum flexibility. |

## File: `src/chatterbox/mtl_tts.py`

**Issue:** The project uses `torch.load` to load model checkpoints (`ve.pt`, `s3gen.pt`). These checkpoints were saved on a CUDA device.
**Fix:** Added `map_location=torch.device('cpu')` logic when the current device is CPU or MPS. This prevents `RuntimeError: Attempting to deserialize object on a CUDA device...` when running on CPU-only Colab instances.

```python
# Added to from_local method:
if device in ["cpu", "mps"]:
map_location = torch.device('cpu')
else:
map_location = None

# Applied 'map_location=map_location' to torch.load calls
```

## File: `src/chatterbox/tts_turbo.py`

**Issue:** `snapshot_download` was forcing `token=True`, causing `LocalTokenNotFoundError` for users without a configured Hugging Face token.
**Fix:** Changed to `token=os.getenv("HF_TOKEN")` to make authentication optional for public models.

## File: `example_tts.py`

**Issue:** The script crashed with `FileNotFoundError` if the optional `YOUR_FILE.wav` audio prompt didn't exist.
**Fix:** Added an existence check `if os.path.exists(AUDIO_PROMPT_PATH):` to skip the voice cloning example gracefully if the file is missing.



## How to Install in Colab

In a Google Colab notebook cell, running the following should now work without errors:

```python
!git clone https://github.com/resemble-ai/chatterbox.git
%cd chatterbox
!pip install -e .
```
24 changes: 24 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: chatterbox
channels:
- defaults
dependencies:
- python>=3.10
- pip
- pip:
- numpy>=1.26.0
- librosa>=0.10.0
- s3tokenizer
- torch>=2.0.0
- torchaudio>=2.0.0
- transformers>=4.46.0
- diffusers>=0.29.0
- resemble-perth>=1.0.1
- conformer>=0.3.2
- safetensors>=0.5.0
- spacy-pkuseg
- pykakasi>=2.3.0
- gradio>=4.0.0
- pyloudnorm
- omegaconf
- gTTS
- soundfile
9 changes: 7 additions & 2 deletions example_tts.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,12 @@
ta.save("test-2.wav", wav, multilingual_model.sr)


# If you want to synthesize with a different voice, specify the audio prompt
# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-3.wav", wav, model.sr)
import os
if os.path.exists(AUDIO_PROMPT_PATH):
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-3.wav", wav, model.sr)
else:
print(f"Skipping voice cloning example: '{AUDIO_PROMPT_PATH}' not found.")
7 changes: 7 additions & 0 deletions locales/ml_IN.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"_comment": "Malayalam Translation File - Created by Ahmed Shajahan",
"settings": "ക്രമീകരണങ്ങൾ",
"start": "തുടങ്ങുക",
"language": "ഭാഷ",
"microphone": "മൈക്രോഫോൺ"
}
127 changes: 115 additions & 12 deletions multilingual_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@
"audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ko_f.flac",
"text": "지난달 우리는 유튜브 채널에서 이십억 조회수라는 새로운 이정표에 도달했습니다."
},
"ml": { # Added Malayalam support configuration - Contributed by Ahmed Shajahan
"audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/en_f1.flac",
"text": "കഴിഞ്ഞ മാസം, ഞങ്ങളുടെ YouTube ചാനലിൽ രണ്ട് ബില്യൺ കാഴ്‌ചകൾ എന്ന പുതിയ നാഴികക്കല്ല് ഞങ്ങൾ പിന്നിട്ടു."
},
"ms": {
"audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ms_f.flac",
"text": "Bulan lepas, kami mencapai pencapaian baru dengan dua bilion tontonan di saluran YouTube kami."
Expand Down Expand Up @@ -236,6 +240,48 @@ def generate_tts_audio(
print("Audio generation complete.")
return (current_model.sr, wav.squeeze(0).numpy())

import json
from pathlib import Path
from chatterbox.asr import SpeechRecognizer

# --- STT Initialization ---
STT_MODEL = None
try:
STT_MODEL = SpeechRecognizer()
print("STT Model initialized.")
except Exception as e:
print(f"Warning: STT Model failed to initialize: {e}")

# --- Localization ---
DEFAULT_LOCALE = {
"settings": "More options",
"start": "Generate",
"language": "Language",
"microphone": "Microphone (Speech to Text)",
"footer": "Malayalam support added by Ahmed Shajahan"
}

def load_locale(lang_code):
"""Load locale data, falling back to English defaults."""
if lang_code == "ml":
try:
with open("locales/ml_IN.json", "r") as f:
data = json.load(f)
# Map keys to UI element expectations if needed, or use directly
return data
except Exception as e:
print(f"Error loading locale for {lang_code}: {e}")
return DEFAULT_LOCALE

def transcribe_audio(audio_path, language_id):
"""Wrapper for STT transcription."""
if not audio_path:
return ""
if STT_MODEL:
return STT_MODEL.transcribe(audio_path, language_id)
return "STT Model not available."


with gr.Blocks() as demo:
gr.Markdown(
"""
Expand All @@ -246,22 +292,38 @@ def generate_tts_audio(

# Display supported languages
gr.Markdown(get_supported_languages_display())

# Shared State
current_locale = gr.State(DEFAULT_LOCALE)

with gr.Row():
with gr.Column():
initial_lang = "fr"
text = gr.Textbox(
value=default_text_for_ui(initial_lang),
label="Text to synthesize (max chars 300)",
max_lines=5
)

# LANGUAGE SELECTOR
language_id = gr.Dropdown(
choices=list(ChatterboxMultilingualTTS.get_supported_languages().keys()),
value=initial_lang,
label="Language",
label=DEFAULT_LOCALE["language"],
info="Select the language for text-to-speech synthesis"
)

# TEXT INPUT
text = gr.Textbox(
value=default_text_for_ui(initial_lang),
label="Text to synthesize (max chars 300)",
max_lines=5
)

# STT INPUT (Microphone)
# "Microphone" label requested by user
stt_input = gr.Audio(
sources=["microphone"],
type="filepath",
label=DEFAULT_LOCALE["microphone"]
)

# REFERENCE AUDIO
ref_wav = gr.Audio(
sources=["upload", "microphone"],
type="filepath",
Expand All @@ -281,24 +343,65 @@ def generate_tts_audio(
0.2, 1, step=.05, label="CFG/Pace", value=0.5
)

with gr.Accordion("More options", open=False):
# SETTINGS (Accordion)
with gr.Accordion(DEFAULT_LOCALE["settings"], open=False) as settings_acc:
seed_num = gr.Number(value=0, label="Random seed (0 for random)")
temp = gr.Slider(0.05, 5, step=.05, label="Temperature", value=.8)

run_btn = gr.Button("Generate", variant="primary")
# START BUTTON
run_btn = gr.Button(DEFAULT_LOCALE["start"], variant="primary")

# FOOTER
footer_text = gr.Markdown("")

with gr.Column():
audio_output = gr.Audio(label="Output Audio")

def on_language_change(lang, current_ref, current_text):
return default_audio_for_ui(lang), default_text_for_ui(lang)
def on_language_change(lang, current_text):
# 1. Get default text/audio for the language
new_text = default_text_for_ui(lang)
new_audio_prompt = default_audio_for_ui(lang)

# 2. Update Localization
loc = load_locale(lang)

# 3. Prepare updates for UI components
# Note: We update labels using the translation

# Footer update logic
footer_msg = f"**{loc.get('footer', '')}**" if lang == "ml" else ""

return (
new_audio_prompt, # ref_wav value
new_text, # text value
gr.update(label=loc.get("language", "Language")), # language_id label
gr.update(label=loc.get("start", "Generate")), # run_btn label
gr.update(label=loc.get("settings", "Settings")), # settings_acc label
gr.update(label=loc.get("microphone", "Microphone")), # stt_input label
footer_msg # footer_text value
)

language_id.change(
fn=on_language_change,
inputs=[language_id, ref_wav, text],
outputs=[ref_wav, text],
inputs=[language_id, text],
outputs=[
ref_wav,
text,
language_id,
run_btn,
settings_acc,
stt_input,
footer_text
],
show_progress=False
)

# Link STT to Textbox
stt_input.change(
fn=transcribe_audio,
inputs=[stt_input, language_id],
outputs=[text]
)

run_btn.click(
fn=generate_tts_audio,
Expand Down
26 changes: 14 additions & 12 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,23 @@ authors = [
{name = "resemble-ai", email = "engineering@resemble.ai"}
]
dependencies = [
"numpy>=1.24.0,<1.26.0",
"librosa==0.11.0",
"numpy>=1.26.0", # Modified for Google Colab compatibility (Relaxed from <1.26.0)
"librosa>=0.10.0", # Modified for Google Colab compatibility (Relaxed from ==0.11.0)
"s3tokenizer",
"torch==2.6.0",
"torchaudio==2.6.0",
"transformers==4.46.3",
"diffusers==0.29.0",
"resemble-perth==1.0.1",
"conformer==0.3.2",
"safetensors==0.5.3",
"torch>=2.0.0", # Modified for Google Colab compatibility (Relaxed from ==2.6.0 to avoid conflicts with pre-installed)
"torchaudio>=2.0.0", # Modified for Google Colab compatibility (Relaxed from ==2.6.0)
"transformers>=4.46.0", # Modified for Google Colab compatibility (Relaxed from ==4.46.3)
"diffusers>=0.29.0", # Modified for Google Colab compatibility (Relaxed from ==0.29.0)
"resemble-perth>=1.0.1", # Modified for Google Colab compatibility
"conformer>=0.3.2", # Modified for Google Colab compatibility
"safetensors>=0.5.0", # Modified for Google Colab compatibility
"spacy-pkuseg",
"pykakasi==2.3.0",
"gradio==5.44.1",
"pykakasi>=2.3.0", # Modified for Google Colab compatibility
"gradio>=4.0.0", # Modified for Google Colab compatibility (Relaxed from ==5.44.1 to allow broader range)
"pyloudnorm",
"omegaconf"
"omegaconf",
"gTTS",
"soundfile"
]

[project.urls]
Expand Down
31 changes: 31 additions & 0 deletions reproduce_turbo_issue.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@

import torchaudio as ta
import torch
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Long text (> 350 chars)
LONG_TEXT = """
In the heart of the bustling city, where neon lights flickered like distant stars, lived a detective named Jack.
Jack wasn't your ordinary investigator; he specialized in the peculiar, the unexplained, and the down-right weird.
One rainy Tuesday, a woman walked into his office, her coat dripping water onto his already stained rug.
She claimed her cat had started reciting Shakespeare in perfect iambic pentameter.
Intrigued, Jack grabbed his fedora and followed her into the storm, unaware that this case would lead him to a secret society of literary felines plotting world domination through sonnets.
"""

def reproduce():
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

try:
model = ChatterboxTurboTTS.from_pretrained(device=device)
print("Generating audio for long text (approx {} chars)...".format(len(LONG_TEXT)))

wav = model.generate(LONG_TEXT)
ta.save("turbo_long_test.wav", wav, model.sr)
print("Saved 'turbo_long_test.wav'. Check for hallucinations.")

except Exception as e:
print(f"Error: {e}")

if __name__ == "__main__":
reproduce()
Loading