Everything Audio: From real-time AI transcription to additive synthesis and music analysis.
This repository is a comprehensive playground for audio experiments, exploring the intersection of modern AI and classic sound synthesis.
Located in /experiments/additive_synth
A series of educational scripts exploring how to build complex waveforms from scratch using mathematics and pure sine waves.
- Wave Types & Aliasing: Demonstrates the difference between mathematical "perfect" waves (which can sound harsh/aliased) and band-limited waves built via additive synthesis.
- Envelopes: Implementation of ADSR (Attack, Decay, Sustain, Release) and other amplitude envelopes to shape sound.
- Noise Generation: Custom implementations for generating White, Pink, and Brown noise.
- Intervals: Exploration of musical intervals and tuning.
Located in /experiments/text_to_speech
Experiments with modern, open-source TTS models for generating high-quality speech.
- Coqui TTS: Scripts for generating speech using VITS and XTTS v2 models.
- Kokoro 82M: Experiments with this lightweight, high-quality model, including voice blending and weighted blending techniques.
Located in /experiments/basic-pitch
Audio-to-MIDI conversion experiments using Spotify's Basic Pitch.
- Inference: converting raw audio files into MIDI sequences.
- Playback: Scripts to play back the analyzed results.
- Comparison: Includes references to original audio and the resulting MIDI analysis for quality comparison.
Located in /server and /client
The project's original core functionality.
- Browser-to-Server Streaming: WebSocket-based audio capture from the browser.
- gRPC Backend: Python server handling the audio stream and performing transcription using OpenAI's Whisper model.
- Bidirectional Streaming: Real-time feedback loop sending transcription results back to the client as they happen.
The original core of this project demonstrates real-time audio streaming from a browser to a gRPC server for transcription.
Recorded Demo: Watch on YouTube
The system bridges browser audio (WebSocket) to a backend gRPC stream.
graph LR
A[Browser] -- "WebSocket (Raw PCM)" --> B[Client Container - FastAPI]
B -- "BiDi: AudioChunk" --> C[Server Container - gRPC]
C -- "BiDi: TranscriptionResult" --> B
B -- "WebSocket (JSON)" --> A
Helpful resources for debugging audio input/output issues on Linux.
# Check input devices
arecord -l
# Update and install audio utilities
sudo apt update
sudo apt install pavucontrol alsa-utils
# Check PulseAudio/PipeWire status
systemctl --user status pipewire pipewire-pulse wireplumberPython Audio & Signal Processing
- Pydub - Simple and easy high-level interface for manipulating audio.
- Librosa - The gold standard for audio and music analysis in Python.
- Spotify Pedalboard - High-quality, studio-grade audio effects in Python.
AI Models & Frameworks
- OpenAI Whisper - General-purpose speech recognition model.
- Coqui TTS - Deep learning toolkit for Text-to-Speech, offering high-performance models.
- Kokoro ONNX & HuggingFace Reference - Lightweight and fast TTS model implementations.
- Magenta - Research project exploring the role of machine learning in the process of creating art and music.
Music Information Retrieval
- Basic Pitch - A lightweight yet robust audio-to-MIDI converter.
- Signal MIDI - A capable web-based MIDI editor for visualizing and editing MIDI files.
Learning
- The Sound of AI (Valerio Velardo) - Excellent channel for learning AI audio, DSP, and Deep Learning for audio.
- Mike Russell - Professional audio production tips and tutorials.
- Audacity - Audio editing and analysis
- 20 - 20,000 Hz Audio Sweep - Test frequency response for web audio, using spectrogram.
- Online Tone Generator - Generate reference tones
- Musicca Virtual Instruments - Test MIDI/Synthesis
