Skip to content

Clementhec/audio-noise-effects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

198 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Sound Editing Pipeline

image

An intelligent audio processing system that automatically enhances audio and video content by adding contextually relevant sound effects based on speech and video content analysis.

Overview

This project implements an AI-driven pipeline that analyzes narrated speech, understands the semantic context, and intelligently inserts sound effects from a library of 2,120+ soundsor sounds generated by ElevenLabs models, at the most appropriate moments. The system combines speech recognition, natural language understanding, vector embeddings, and audio and video processing to create dynamic and engaging audio experiences.

Key Capabilities:

  • Speech-to-text transcription with word-level timing precision
  • Video context understanding
  • Semantic embedding of speech content and sound metadata in compatible vector space
  • Vector similarity matching between speech context and sound effects
  • Score-based filtering to select optimal sound placements
  • Automated audio mixing and synchronization

Architecture

The pipeline follows a microservice-oriented architecture with the following flow:

Input (Audio/Video)
    ↓
[Audio Extraction] → Extract audio track from video
    ↓
[Speech-to-Text] → Transcribe speech with word-level timing
    ↓
[Speech Embedding] → Generate embeddings for speech segments
    ↓
[Sound Embedding] → Pre-computed embeddings for sound metadata
    ↓
[Vector Matching] → Calculate similarity scores between speech and sounds
    ↓
[Score-based Filtering] → Select best matches based on similarity thresholds
    ↓
[Audio Mixing] → Combine original audio with selected sound effects
    ↓
Output (Enhanced Audio/Video)

Audio Extraction

Example sample rates :

  • 16000 Hz: Standard for speech recognition (Google STT, Whisper)
  • 22050 Hz: Acceptable quality for speech
  • 44100 Hz: CD quality, good for music
  • 48000 Hz: Professional audio/video standard

Sounds

By default, sounds from the open-source sound library SoundBible are used.

The library contains mainstream sounds, but is not comprehensive. For situations that require more subtlety, one can leverage the text-to-sound generation API from ElevenLabs, which can further enhance the video experience.

Generated sounds are then stored in the sounds bank, and their embeddings are cached, to be reused if necessary in comparable contexts, to guarantee consistent sound additions and leverage repetition effects.

Sound selection

To select which sounds are going to be eventually included in the video, a LLM is asked to rank the relevance of each selected sounds, for a given part of the speech, according to the user input.

This key step can be tailored with the following parameters :

  • The max_sounds parameter guides the LLM to prioritize the top N most impactful sentences. All sentences are still ranked with unique relevance scores (1, 2, 3...), but the LLM focuses on selecting the best N.

The motivation of using an LLM at this step is that it is expected to align better with the user prompt, and have more adaptability than taking the top-k from the similarity search (which would always return the most straightforward match).

The current LLM is Gemini 2.5 Flash Lite.

Configuration

API Key Setup

Add to .env file:

GOOGLE_API_KEY=your_google_api_key_here
ELEVENLABS_API_KEY=your_api_key_here

Get your key from:

Requirements

Audio extraction and merging

Install ffmpeg on your system:

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •