Agentic Sound Editing Pipeline

An intelligent audio processing system that automatically enhances audio and video content by adding contextually relevant sound effects based on speech and video content analysis.

Overview

This project implements an AI-driven pipeline that analyzes narrated speech, understands the semantic context, and intelligently inserts sound effects from a library of 2,120+ soundsor sounds generated by ElevenLabs models, at the most appropriate moments. The system combines speech recognition, natural language understanding, vector embeddings, and audio and video processing to create dynamic and engaging audio experiences.

Key Capabilities:

Speech-to-text transcription with word-level timing precision
Video context understanding
Semantic embedding of speech content and sound metadata in compatible vector space
Vector similarity matching between speech context and sound effects
Score-based filtering to select optimal sound placements
Automated audio mixing and synchronization

Architecture

The pipeline follows a microservice-oriented architecture with the following flow:

Input (Audio/Video)
    ↓
[Audio Extraction] → Extract audio track from video
    ↓
[Speech-to-Text] → Transcribe speech with word-level timing
    ↓
[Speech Embedding] → Generate embeddings for speech segments
    ↓
[Sound Embedding] → Pre-computed embeddings for sound metadata
    ↓
[Vector Matching] → Calculate similarity scores between speech and sounds
    ↓
[Score-based Filtering] → Select best matches based on similarity thresholds
    ↓
[Audio Mixing] → Combine original audio with selected sound effects
    ↓
Output (Enhanced Audio/Video)

Audio Extraction

Example sample rates :

16000 Hz: Standard for speech recognition (Google STT, Whisper)
22050 Hz: Acceptable quality for speech
44100 Hz: CD quality, good for music
48000 Hz: Professional audio/video standard

Sounds

By default, sounds from the open-source sound library SoundBible are used.

The library contains mainstream sounds, but is not comprehensive. For situations that require more subtlety, one can leverage the text-to-sound generation API from ElevenLabs, which can further enhance the video experience.

Generated sounds are then stored in the sounds bank, and their embeddings are cached, to be reused if necessary in comparable contexts, to guarantee consistent sound additions and leverage repetition effects.

Sound selection

To select which sounds are going to be eventually included in the video, a LLM is asked to rank the relevance of each selected sounds, for a given part of the speech, according to the user input.

This key step can be tailored with the following parameters :

The max_sounds parameter guides the LLM to prioritize the top N most impactful sentences. All sentences are still ranked with unique relevance scores (1, 2, 3...), but the LLM focuses on selecting the best N.

The motivation of using an LLM at this step is that it is expected to align better with the user prompt, and have more adaptability than taking the top-k from the similarity search (which would always return the most straightforward match).

The current LLM is Gemini 2.5 Flash Lite.

Configuration

API Key Setup

Add to .env file:

GOOGLE_API_KEY=your_google_api_key_here

ELEVENLABS_API_KEY=your_api_key_here

Get your key from:

Google AI Studio

Requirements

Audio extraction and merging

Install ffmpeg on your system:

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
img		img
src		src
ui		ui
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Sound Editing Pipeline

Overview

Architecture

Audio Extraction

Sounds

Sound selection

Configuration

API Key Setup

Requirements

Audio extraction and merging

References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Clementhec/audio-noise-effects

Folders and files

Latest commit

History

Repository files navigation

Agentic Sound Editing Pipeline

Overview

Architecture

Audio Extraction

Sounds

Sound selection

Configuration

API Key Setup

Requirements

Audio extraction and merging

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages