Merge pull request anthropics#263 from anthropics/adriaan/elevenlabs-voice-assistant

Adriaan-ANT · web-flow · commit 272550a9ebc7 · 2025-11-01T15:36:21.000Z
Update ElevenLabs Voice Assistant: Improve documentation and error handling
diff --git a/third_party/ElevenLabs/README.md b/third_party/ElevenLabs/README.md
@@ -16,18 +16,45 @@ We recommend following this sequence to get the most out of this cookbook:
 
 ### Step 1: Set Up Your Environment
 
-1. **Get your API keys:**
-   - ElevenLabs API key: [elevenlabs.io/app/developers/api-keys](https://elevenlabs.io/app/developers/api-keys)
-   - Anthropic API key: [console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys)
+1. **Create a virtual environment:**
+   ```bash
+   # Navigate to the ElevenLabs directory
+   cd /path/to/claude-cookbooks/third_party/ElevenLabs
+
+   # Create virtual environment
+   python -m venv venv
+
+   # Activate it
+   source venv/bin/activate  # On macOS/Linux
+   # OR
+   venv\Scripts\activate     # On Windows
+   ```
+
+2. **Get your API keys:**
+   - **ElevenLabs API key:** [elevenlabs.io/app/developers/api-keys](https://elevenlabs.io/app/developers/api-keys)
+
+     When creating your API key, ensure it has the following minimum permissions:
+     - Text to speech
+     - Speech to text
+     - Read access on voices
+     - Read access on models
+
+   - **Anthropic API key:** [console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys)
 
-2. **Configure your environment:**
+3. **Configure your environment:**
    ```bash
    cp .env.example .env
-   # Edit .env and add your API keys
    ```
 
-3. **Install dependencies:**
+   Edit `.env` and add your API keys:
+   ```
+   ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
+   ANTHROPIC_API_KEY=sk-ant-api03-...
+   ```
+
+4. **Install dependencies:**
    ```bash
+   # With venv activated
    pip install -r requirements.txt
    ```
 
@@ -65,6 +92,112 @@ The script demonstrates production-ready implementations of:
 - WebSocket-based streaming for minimal latency
 - Custom audio queue for seamless playback
 
+## Troubleshooting
+
+### Audio Popping or Crackling
+
+**Symptom:** You may occasionally hear brief pops, clicks, or audio dropouts during playback.
+
+**Explanation:**
+
+This occurs because the script uses MP3 format audio, which is required for the ElevenLabs free tier. When streaming MP3 data in real-time chunks, FFmpeg occasionally receives incomplete frames that cannot be decoded. This typically happens:
+- At the start of streaming (first chunk may be too small)
+- During brief network delays
+- At the end of audio generation (final chunk may be partial)
+
+The script automatically handles these failed chunks by skipping them (using a try-except pattern in the audio decoding logic), which prevents errors from appearing in the console but may result in brief audio gaps that manifest as pops or clicks.
+
+**Impact:**
+- Audio playback continues normally
+- Brief pops or clicks are usually imperceptible or minor
+- The WebSocket connection remains stable
+- No functionality is lost
+
+**Solution:**
+
+This is expected behavior when using MP3 format on the free tier. If you want to eliminate audio popping entirely:
+1. Upgrade to a paid ElevenLabs tier
+2. Modify the script to use `pcm_44100` format instead of MP3
+3. PCM format provides cleaner streaming without decoding issues
+
+### API Key Issues
+
+**Symptom:** `AssertionError: ELEVENLABS_API_KEY is not set` or `AssertionError: ANTHROPIC_API_KEY is not set`
+
+**Solution:**
+1. Verify you've copied `.env.example` to `.env`: `cp .env.example .env`
+2. Edit `.env` and ensure both API keys are set correctly
+3. Check for typos or extra spaces in your API keys
+4. Confirm your ElevenLabs key has the required permissions (see Step 1)
+
+### Dependency Issues
+
+**Symptom:** Errors like `ImportError: PortAudio library not found` or audio playback failures
+
+**Solution:**
+
+**macOS:**
+```bash
+brew install portaudio ffmpeg
+```
+
+**Ubuntu/Debian:**
+```bash
+sudo apt-get install portaudio19-dev ffmpeg
+```
+
+**Windows:**
+- Install FFmpeg from [ffmpeg.org](https://ffmpeg.org/download.html)
+- Add FFmpeg to your system PATH
+- PortAudio typically installs automatically with sounddevice on Windows
+
+Then reinstall Python dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+### Microphone Permissions
+
+**Symptom:** `OSError: [Errno -9999] Unanticipated host error` or microphone not accessible
+
+**Solution:**
+- **macOS:** Go to System Preferences → Security & Privacy → Privacy → Microphone, and enable Terminal (or your Python IDE)
+- **Windows:** Go to Settings → Privacy → Microphone, and enable microphone access for Python/Terminal
+- **Linux:** Check your user is in the `audio` group: `sudo usermod -a -G audio $USER` (then log out and back in)
+
+Test your microphone setup:
+```bash
+python -c "import sounddevice as sd; print(sd.query_devices())"
+```
+
+### WebSocket Connection Failures
+
+**Symptom:** Connection errors, timeouts, or stream interruptions
+
+**Solution:**
+1. Check your internet connection is stable
+2. Verify firewall isn't blocking WebSocket connections (port 443)
+3. Try disabling VPN or proxy temporarily
+4. Ensure you're not exceeding API rate limits (see ElevenLabs dashboard for usage)
+
+If you continue to experience issues, check [ElevenLabs Status](https://status.elevenlabs.io/) for service updates.
+
+## Project Ideas
+
+Once you're comfortable with the voice assistant, here are some inspiring projects you can build:
+
+- **Meeting Note-Taker** - Record and transcribe meetings in real-time, then use Claude to generate summaries, action items, and key takeaways from the conversation.
+
+- **Language Learning Tutor** - Practice conversations in any language with real-time feedback. Claude can correct pronunciation, suggest better phrasing, and adapt difficulty to your skill level.
+
+- **Interactive Storyteller** - Create choose-your-own-adventure games where Claude narrates the story and responds to your spoken choices, with different voice characters for each role.
+
+- **Hands-Free Coding Assistant** - Describe code changes, bugs, or features verbally while keeping your hands on the keyboard. Perfect for rubber duck debugging or pair programming solo.
+
+- **Voice-Activated Smart Home** - Build natural conversation interfaces for controlling home devices. Ask complex questions like "Is it cold enough to turn on the heater?" instead of simple on/off commands.
+
+- **Personal Voice Journal** - Keep a daily journal by speaking your thoughts. Claude can organize entries by theme, track your mood over time, and surface relevant past entries when you need them.
+
 ## More About ElevenLabs
 
 Here are some helpful resources to deepen your understanding:
diff --git a/third_party/ElevenLabs/stream_voice_assistant_websocket.py b/third_party/ElevenLabs/stream_voice_assistant_websocket.py
@@ -109,29 +109,37 @@ def add(self, audio_data):
         Args:
             audio_data: Raw MP3 audio bytes
         """
-        # Decode MP3 to PCM
-        audio_segment = AudioSegment.from_mp3(io.BytesIO(audio_data))
+        try:
+            # Decode MP3 to PCM
+            audio_segment = AudioSegment.from_mp3(io.BytesIO(audio_data))
 
-        # Convert to numpy array
-        samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
-        samples = samples.astype(np.float32) / 32768.0
+            # Convert to numpy array
+            samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
+            samples = samples.astype(np.float32) / 32768.0
 
-        if not self.playing:
-            self.sample_rate = audio_segment.frame_rate
-            self.channels = audio_segment.channels
+            if not self.playing:
+                self.sample_rate = audio_segment.frame_rate
+                self.channels = audio_segment.channels
 
-        # Reshape based on number of channels
-        if self.channels > 1:
-            samples = samples.reshape((-1, self.channels))
-        else:
-            samples = samples.reshape((-1, 1))
-
-        with self.buffer_lock:
-            self.buffer.extend(samples.tobytes())
+            # Reshape based on number of channels
+            if self.channels > 1:
+                samples = samples.reshape((-1, self.channels))
+            else:
+                samples = samples.reshape((-1, 1))
 
-        # Start playback after pre-buffering
-        if not self.playing and len(self.buffer) >= self.PRE_BUFFER_SIZE:
-            self.start_playback()
+            with self.buffer_lock:
+                self.buffer.extend(samples.tobytes())
+
+            # Start playback after pre-buffering
+            if not self.playing and len(self.buffer) >= self.PRE_BUFFER_SIZE:
+                self.start_playback()
+        except:
+            # Silently skip invalid MP3 chunks that fail to decode
+            # This is common when streaming MP3 data in real-time, as chunks may contain
+            # incomplete frames. Skipping these prevents console errors but may cause
+            # brief audio pops. To eliminate popping, upgrade to a paid ElevenLabs tier
+            # and use pcm_44100 format instead of MP3.
+            pass
 
     def start_playback(self):
         """Start the audio output stream."""