resemble-ai · ahmadshajhan · Dec 25, 2025 · Dec 25, 2025 · Dec 25, 2025 · Dec 25, 2025
diff --git a/.github/workflows/python-package-conda.yml b/.github/workflows/python-package-conda.yml
@@ -0,0 +1,39 @@
+name: Python Package using Conda
+
+on: [push]
+
+jobs:
+  build-linux:
+    runs-on: ubuntu-latest
+    strategy:
+      max-parallel: 5
+
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python 3.10
+      uses: actions/setup-python@v3
+      with:
+        python-version: '3.10'
+    - name: Add conda to system path
+      run: |
+        # $CONDA is an environment variable pointing to the root of the miniconda directory
+        echo $CONDA/bin >> $GITHUB_PATH
+    - name: Install dependencies
+      run: |
+        conda env update --file environment.yml --name base
+    - name: Lint with flake8
+      run: |
+        conda install flake8
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Test with pytest
+      run: |
+        conda install pytest
+        pytest
+
+    - name: Security Scan
+      run: |
+        pip install pip-audit
+        pip-audit
diff --git a/COLAB_FIX_DETAILS.md b/COLAB_FIX_DETAILS.md
@@ -0,0 +1,62 @@
+# Google Colab Compatibility Fixes
+
+This document details the changes made to the `chatterbox` project to ensure compatibility with the Google Colab environment.
+
+## Overview of Changes
+
+The primary issue preventing installation on Google Colab was strict version pinning in `pyproject.toml`. Google Colab environments come with pre-installed versions of major libraries (like PyTorch, NumPy, Transformers) that are updated frequently. Strict pinning (e.g., `==2.6.0`) causes conflicts with these pre-installed versions or forces unnecessary and time-consuming reinstallations that may break the environment.
+
+## File: `pyproject.toml`
+
+The following dependencies were modified:
+
+| Package | Original Version | New Version | Reason |
+| :--- | :--- | :--- | :--- |
+| `numpy` | `>=1.24.0,<1.26.0` | `>=1.26.0` | Colab often uses newer NumPy versions. Relaxed upper bound constraint. |
+| `librosa` | `==0.11.0` | `>=0.10.0` | Relaxed strict pin to allow compatible newer or slightly older versions. |
+| `torch` | `==2.6.0` | `>=2.0.0` | **CRITICAL**: Colab has pre-installed PyTorch. Strict pinning forces a reinstall that can break CUDA compatibility or time out. Relaxed to any major 2.x version. |
+| `torchaudio` | `==2.6.0` | `>=2.0.0` | Matched `torch` relaxtion. |
+| `transformers` | `==4.46.3` | `>=4.46.0` | Relaxed strict pin. Colab often has recent transformers; exact match is unnecessary. |
+| `diffusers` | `==0.29.0` | `>=0.29.0` | Relaxed strict pin to allow updates. |
+| `resemble-perth` | `==1.0.1` | `>=1.0.1` | Relaxed pin. |
+| `conformer` | `==0.3.2` | `>=0.3.2` | Relaxed pin. |
+| `safetensors` | `==0.5.3` | `>=0.5.0` | Relaxed pin. |
+| `pykakasi` | `==2.3.0` | `>=2.3.0` | Relaxed pin. |
+| `gradio` | `==5.44.1` | `>=4.0.0` | Relaxed largely. Gradio 5.x is new, but 4.x is often sufficient. Allowing `>=4.0.0` gives maximum flexibility. |
+
+## File: `src/chatterbox/mtl_tts.py`
+
+**Issue:** The project uses `torch.load` to load model checkpoints (`ve.pt`, `s3gen.pt`). These checkpoints were saved on a CUDA device.
+**Fix:** Added `map_location=torch.device('cpu')` logic when the current device is CPU or MPS. This prevents `RuntimeError: Attempting to deserialize object on a CUDA device...` when running on CPU-only Colab instances.
+
+```python
+# Added to from_local method:
+if device in ["cpu", "mps"]:
+    map_location = torch.device('cpu')
+else:
+    map_location = None
+
+# Applied 'map_location=map_location' to torch.load calls
+```
+
+## File: `src/chatterbox/tts_turbo.py`
+
+**Issue:** `snapshot_download` was forcing `token=True`, causing `LocalTokenNotFoundError` for users without a configured Hugging Face token.
+**Fix:** Changed to `token=os.getenv("HF_TOKEN")` to make authentication optional for public models.
+
+## File: `example_tts.py`
+
+**Issue:** The script crashed with `FileNotFoundError` if the optional `YOUR_FILE.wav` audio prompt didn't exist.
+**Fix:** Added an existence check `if os.path.exists(AUDIO_PROMPT_PATH):` to skip the voice cloning example gracefully if the file is missing.
+
+
+
+## How to Install in Colab
+
+In a Google Colab notebook cell, running the following should now work without errors:
+
+```python
+!git clone https://github.com/resemble-ai/chatterbox.git
+%cd chatterbox
+!pip install -e .
+```
diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,24 @@
+name: chatterbox
+channels:
+  - defaults
+dependencies:
+  - python>=3.10
+  - pip
+  - pip:
+    - numpy>=1.26.0
+    - librosa>=0.10.0
+    - s3tokenizer
+    - torch>=2.0.0
+    - torchaudio>=2.0.0
+    - transformers>=4.46.0
+    - diffusers>=0.29.0
+    - resemble-perth>=1.0.1
+    - conformer>=0.3.2
+    - safetensors>=0.5.0
+    - spacy-pkuseg
+    - pykakasi>=2.3.0
+    - gradio>=4.0.0
+    - pyloudnorm
+    - omegaconf
+    - gTTS
+    - soundfile
diff --git a/example_tts.py b/example_tts.py
@@ -25,7 +25,12 @@
 ta.save("test-2.wav", wav, multilingual_model.sr)
 
 
+# If you want to synthesize with a different voice, specify the audio prompt
 # If you want to synthesize with a different voice, specify the audio prompt
 AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
-wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
-ta.save("test-3.wav", wav, model.sr)
+import os
+if os.path.exists(AUDIO_PROMPT_PATH):
+    wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
+    ta.save("test-3.wav", wav, model.sr)
+else:
+    print(f"Skipping voice cloning example: '{AUDIO_PROMPT_PATH}' not found.")
diff --git a/locales/ml_IN.json b/locales/ml_IN.json
@@ -0,0 +1,7 @@
+{
+    "_comment": "Malayalam Translation File - Created by Ahmed Shajahan",
+    "settings": "ക്രമീകരണങ്ങൾ",
+    "start": "തുടങ്ങുക",
+    "language": "ഭാഷ",
+    "microphone": "മൈക്രോഫോൺ"
+}
diff --git a/multilingual_app.py b/multilingual_app.py
@@ -63,6 +63,10 @@
         "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ko_f.flac",
         "text": "지난달 우리는 유튜브 채널에서 이십억 조회수라는 새로운 이정표에 도달했습니다."
     },
+    "ml": {  # Added Malayalam support configuration - Contributed by Ahmed Shajahan
+        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/en_f1.flac", 
+        "text": "കഴിഞ്ഞ മാസം, ഞങ്ങളുടെ YouTube ചാനലിൽ രണ്ട് ബില്യൺ കാഴ്‌ചകൾ എന്ന പുതിയ നാഴികക്കല്ല് ഞങ്ങൾ പിന്നിട്ടു."
+    },
     "ms": {
         "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ms_f.flac",
         "text": "Bulan lepas, kami mencapai pencapaian baru dengan dua bilion tontonan di saluran YouTube kami."
@@ -236,6 +240,48 @@ def generate_tts_audio(
     print("Audio generation complete.")
     return (current_model.sr, wav.squeeze(0).numpy())
 
+import json
+from pathlib import Path
+from chatterbox.asr import SpeechRecognizer
+
+# --- STT Initialization ---
+STT_MODEL = None
+try:
+    STT_MODEL = SpeechRecognizer()
+    print("STT Model initialized.")
+except Exception as e:
+    print(f"Warning: STT Model failed to initialize: {e}")
+
+# --- Localization ---
+DEFAULT_LOCALE = {
+    "settings": "More options",
+    "start": "Generate",
+    "language": "Language",
+    "microphone": "Microphone (Speech to Text)",
+    "footer": "Malayalam support added by Ahmed Shajahan" 
+}
+
+def load_locale(lang_code):
+    """Load locale data, falling back to English defaults."""
+    if lang_code == "ml":
+        try:
+            with open("locales/ml_IN.json", "r") as f:
+                data = json.load(f)
+                # Map keys to UI element expectations if needed, or use directly
+                return data
+        except Exception as e:
+            print(f"Error loading locale for {lang_code}: {e}")
+    return DEFAULT_LOCALE
+
+def transcribe_audio(audio_path, language_id):
+    """Wrapper for STT transcription."""
+    if not audio_path:
+        return ""
+    if STT_MODEL:
+        return STT_MODEL.transcribe(audio_path, language_id)
+    return "STT Model not available."
+
+
 with gr.Blocks() as demo:
     gr.Markdown(
         """
@@ -246,22 +292,38 @@ def generate_tts_audio(
 
     # Display supported languages
     gr.Markdown(get_supported_languages_display())
+
+    # Shared State
+    current_locale = gr.State(DEFAULT_LOCALE)
+
     with gr.Row():
         with gr.Column():
             initial_lang = "fr"
-            text = gr.Textbox(
-                value=default_text_for_ui(initial_lang),
-                label="Text to synthesize (max chars 300)",
-                max_lines=5
-            )
 
+            # LANGUAGE SELECTOR
             language_id = gr.Dropdown(
                 choices=list(ChatterboxMultilingualTTS.get_supported_languages().keys()),
                 value=initial_lang,
-                label="Language",
+                label=DEFAULT_LOCALE["language"],
                 info="Select the language for text-to-speech synthesis"
             )
 
+            # TEXT INPUT
+            text = gr.Textbox(
+                value=default_text_for_ui(initial_lang),
+                label="Text to synthesize (max chars 300)",
+                max_lines=5
+            )
+
+            # STT INPUT (Microphone)
+            # "Microphone" label requested by user
+            stt_input = gr.Audio(
+                sources=["microphone"], 
+                type="filepath", 
+                label=DEFAULT_LOCALE["microphone"]
+            )
+
+            # REFERENCE AUDIO
             ref_wav = gr.Audio(
                 sources=["upload", "microphone"],
                 type="filepath",
@@ -281,24 +343,65 @@ def generate_tts_audio(
                 0.2, 1, step=.05, label="CFG/Pace", value=0.5
             )
 
-            with gr.Accordion("More options", open=False):
+            # SETTINGS (Accordion)
+            with gr.Accordion(DEFAULT_LOCALE["settings"], open=False) as settings_acc:
                 seed_num = gr.Number(value=0, label="Random seed (0 for random)")
                 temp = gr.Slider(0.05, 5, step=.05, label="Temperature", value=.8)
 
-            run_btn = gr.Button("Generate", variant="primary")
+            # START BUTTON
+            run_btn = gr.Button(DEFAULT_LOCALE["start"], variant="primary")
+
+            # FOOTER
+            footer_text = gr.Markdown("") 
 
         with gr.Column():
             audio_output = gr.Audio(label="Output Audio")
 
-        def on_language_change(lang, current_ref, current_text):
-            return default_audio_for_ui(lang), default_text_for_ui(lang)
+        def on_language_change(lang, current_text):
+            # 1. Get default text/audio for the language
+            new_text = default_text_for_ui(lang)
+            new_audio_prompt = default_audio_for_ui(lang)
+
+            # 2. Update Localization
+            loc = load_locale(lang)
+
+            # 3. Prepare updates for UI components
+            # Note: We update labels using the translation
+
+            # Footer update logic
+            footer_msg = f"**{loc.get('footer', '')}**" if lang == "ml" else ""
+
+            return (
+                new_audio_prompt,           # ref_wav value
+                new_text,                   # text value
+                gr.update(label=loc.get("language", "Language")), # language_id label
+                gr.update(label=loc.get("start", "Generate")),    # run_btn label
+                gr.update(label=loc.get("settings", "Settings")), # settings_acc label
+                gr.update(label=loc.get("microphone", "Microphone")), # stt_input label
+                footer_msg                  # footer_text value
+            )
 
         language_id.change(
             fn=on_language_change,
-            inputs=[language_id, ref_wav, text],
-            outputs=[ref_wav, text],
+            inputs=[language_id, text],
+            outputs=[
+                ref_wav, 
+                text, 
+                language_id, 
+                run_btn, 
+                settings_acc, 
+                stt_input,
+                footer_text
+            ],
             show_progress=False
         )
+
+        # Link STT to Textbox
+        stt_input.change(
+            fn=transcribe_audio,
+            inputs=[stt_input, language_id],
+            outputs=[text]
+        )
 
     run_btn.click(
         fn=generate_tts_audio,

diff --git a/pyproject.toml b/pyproject.toml
@@ -9,21 +9,23 @@ authors = [
     {name = "resemble-ai", email = "engineering@resemble.ai"}
 ]
 dependencies = [
-    "numpy>=1.24.0,<1.26.0",
-    "librosa==0.11.0",
+    "numpy>=1.26.0", # Modified for Google Colab compatibility (Relaxed from <1.26.0)
+    "librosa>=0.10.0", # Modified for Google Colab compatibility (Relaxed from ==0.11.0)
     "s3tokenizer",
-    "torch==2.6.0",
-    "torchaudio==2.6.0",
-    "transformers==4.46.3",
-    "diffusers==0.29.0",
-    "resemble-perth==1.0.1",
-    "conformer==0.3.2",
-    "safetensors==0.5.3",
+    "torch>=2.0.0", # Modified for Google Colab compatibility (Relaxed from ==2.6.0 to avoid conflicts with pre-installed)
+    "torchaudio>=2.0.0", # Modified for Google Colab compatibility (Relaxed from ==2.6.0)
+    "transformers>=4.46.0", # Modified for Google Colab compatibility (Relaxed from ==4.46.3)
+    "diffusers>=0.29.0", # Modified for Google Colab compatibility (Relaxed from ==0.29.0)
+    "resemble-perth>=1.0.1", # Modified for Google Colab compatibility
+    "conformer>=0.3.2", # Modified for Google Colab compatibility
+    "safetensors>=0.5.0", # Modified for Google Colab compatibility
     "spacy-pkuseg",
-    "pykakasi==2.3.0",
-    "gradio==5.44.1",
+    "pykakasi>=2.3.0", # Modified for Google Colab compatibility
+    "gradio>=4.0.0", # Modified for Google Colab compatibility (Relaxed from ==5.44.1 to allow broader range)
     "pyloudnorm",
-    "omegaconf"
+    "omegaconf",
+    "gTTS",
+    "soundfile"
 ]
 
 [project.urls]

diff --git a/reproduce_turbo_issue.py b/reproduce_turbo_issue.py
@@ -0,0 +1,31 @@
+
+import torchaudio as ta
+import torch
+from chatterbox.tts_turbo import ChatterboxTurboTTS
+
+# Long text (> 350 chars)
+LONG_TEXT = """
+In the heart of the bustling city, where neon lights flickered like distant stars, lived a detective named Jack. 
+Jack wasn't your ordinary investigator; he specialized in the peculiar, the unexplained, and the down-right weird. 
+One rainy Tuesday, a woman walked into his office, her coat dripping water onto his already stained rug. 
+She claimed her cat had started reciting Shakespeare in perfect iambic pentameter. 
+Intrigued, Jack grabbed his fedora and followed her into the storm, unaware that this case would lead him to a secret society of literary felines plotting world domination through sonnets.
+"""
+
+def reproduce():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+
+    try:
+        model = ChatterboxTurboTTS.from_pretrained(device=device)
+        print("Generating audio for long text (approx {} chars)...".format(len(LONG_TEXT)))
+
+        wav = model.generate(LONG_TEXT)
+        ta.save("turbo_long_test.wav", wav, model.sr)
+        print("Saved 'turbo_long_test.wav'. Check for hallucinations.")
+
+    except Exception as e:
+        print(f"Error: {e}")
+
+if __name__ == "__main__":
+    reproduce()