An intelligent audio processing system that automatically enhances audio and video content by adding contextually relevant sound effects based on speech and video content analysis.
This project implements an AI-driven pipeline that analyzes narrated speech, understands the semantic context, and intelligently inserts sound effects from a library of 2,120+ soundsor sounds generated by ElevenLabs models, at the most appropriate moments. The system combines speech recognition, natural language understanding, vector embeddings, and audio and video processing to create dynamic and engaging audio experiences.
Key Capabilities:
- Speech-to-text transcription with word-level timing precision
- Video context understanding
- Semantic embedding of speech content and sound metadata in compatible vector space
- Vector similarity matching between speech context and sound effects
- Score-based filtering to select optimal sound placements
- Automated audio mixing and synchronization
The pipeline follows a microservice-oriented architecture with the following flow:
Input (Audio/Video)
↓
[Audio Extraction] → Extract audio track from video
↓
[Speech-to-Text] → Transcribe speech with word-level timing
↓
[Speech Embedding] → Generate embeddings for speech segments
↓
[Sound Embedding] → Pre-computed embeddings for sound metadata
↓
[Vector Matching] → Calculate similarity scores between speech and sounds
↓
[Score-based Filtering] → Select best matches based on similarity thresholds
↓
[Audio Mixing] → Combine original audio with selected sound effects
↓
Output (Enhanced Audio/Video)
Example sample rates :
- 16000 Hz: Standard for speech recognition (Google STT, Whisper)
- 22050 Hz: Acceptable quality for speech
- 44100 Hz: CD quality, good for music
- 48000 Hz: Professional audio/video standard
By default, sounds from the open-source sound library SoundBible are used.
The library contains mainstream sounds, but is not comprehensive. For situations that require more subtlety, one can leverage the text-to-sound generation API from ElevenLabs, which can further enhance the video experience.
Generated sounds are then stored in the sounds bank, and their embeddings are cached, to be reused if necessary in comparable contexts, to guarantee consistent sound additions and leverage repetition effects.
To select which sounds are going to be eventually included in the video, a LLM is asked to rank the relevance of each selected sounds, for a given part of the speech, according to the user input.
This key step can be tailored with the following parameters :
- The
max_soundsparameter guides the LLM to prioritize the top N most impactful sentences. All sentences are still ranked with unique relevance scores (1, 2, 3...), but the LLM focuses on selecting the best N.
The motivation of using an LLM at this step is that it is expected to align better with the user prompt, and have more adaptability than taking the top-k from the similarity search (which would always return the most straightforward match).
The current LLM is Gemini 2.5 Flash Lite.
Add to .env file:
GOOGLE_API_KEY=your_google_api_key_hereELEVENLABS_API_KEY=your_api_key_hereGet your key from:
Install ffmpeg on your system:
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html