Visual-Mic

A Python implementation of the Visual Microphone algorithm, which recovers sound from high-speed video by analyzing sub-pixel surface vibrations. When sound hits an object, it causes tiny vibrations on the surface—far too small to see with the naked eye, but detectable in the phase of complex wavelet coefficients. This tool extracts those vibrations and reconstructs an audible signal, effectively turning everyday objects into microphones.

The original work by Davis et al. (MIT CSAIL, SIGGRAPH 2014) used Complex Steerable Pyramids for the video decomposition. This project uses 2D Dual-Tree Complex Wavelet Transform (DTCWT) instead, which is ~5x more computationally efficient while still providing reliable phase information for motion estimation. We test against the same high-speed videos provided by MIT CSAIL.

The sample videos can be downloaded from here. For example, Chips1-2200Hz-Mary_Had-input.avi is a high-speed video of a bag of chips vibrating to "Mary Had A Little Lamb" (704x704, 22,859 frames captured at 2200 fps, ~14 GB). Note: the AVI container reports ~30 fps, but the actual capture rate is 2200 Hz — use --fps 2200 when running visualmic.py to get the correct audio sample rate.

Setup

A. Local Setup

git clone https://github.com/joeljose/Visual-Mic.git
cd Visual-Mic
pip install -r requirements.txt
python visualmic.py -i testvid.avi -o recovered_audio.wav

Requirements: Python 3.8+

B. Docker (CPU)

# Build
./docker-build.sh

# Run
docker run --rm -v /path/to/videos:/data \
    visual-mic:latest \
    -i /data/testvid.avi -o /data/sound.wav

C. Docker (GPU)

Requires nvidia-container-toolkit.

# Build
./docker-build-gpu.sh

# Run
docker run --rm --gpus all -v /path/to/videos:/data \
    visual-mic-gpu:latest \
    --gpu -i /data/Chips1-2200Hz-Mary_Had-input.avi \
    -o /data/sound.wav --fps 2200 --batch-size 32

The --batch-size flag controls how many frames are processed per GPU batch (default: 16). Larger batches are faster but use more GPU memory. At 704x704, each frame uses ~10 MB of GPU memory, so --batch-size 32 needs ~660 MB including overhead.

Note: GPU mode produces very similar but not bit-identical output compared to CPU mode, due to float32 vs float64 precision differences and different DTCWT implementations (pytorch_wavelets vs dtcwt).

Usage

CLI Tool

# Basic usage
python visualmic.py -i testvid.avi -o recovered_audio.wav

# With temporal bandpass filter
python visualmic.py -i testvid.avi -fl 80 -fh 1000

# With ROI (focus on vibrating object)
python visualmic.py -i testvid.avi --roi 100,50,200,150

# Override frame rate for high-speed video
python visualmic.py -i Chips1-2200Hz-Mary_Had-input.avi --fps 2200

# GPU acceleration
python visualmic.py -i testvid.avi --gpu --batch-size 32

# Custom wavelet filters
python visualmic.py -i testvid.avi --biort near_sym_a --qshift qshift_a

Flag	Default	Description
`-i / --input`	(required)	Input video path
`-o / --output`	`sound.wav`	Output audio path
`-fl / --freq-low`	—	Lower cutoff frequency (Hz) for temporal bandpass filter
`-fh / --freq-high`	—	Upper cutoff frequency (Hz) for temporal bandpass filter
`--fps`	—	Override video frame rate (Hz) for audio sample rate
`--roi`	—	Region of interest as `x,y,w,h`
`--gpu`	off	Use GPU-accelerated DTCWT (requires CUDA + pytorch_wavelets)
`--batch-size`	16	Frames per GPU batch (GPU mode only)
`--nlevels`	3	Number of DTCWT decomposition levels
`--biort`	`near_sym_b`	Biorthogonal wavelet filter for DTCWT level 1
`--qshift`	`qshift_b`	Quarter-shift wavelet filter for DTCWT levels 2+
`--version`	—	Show program version and exit

Available wavelet filters:

--biort: antonini, legall, near_sym_a, near_sym_b
--qshift: qshift_06, qshift_a, qshift_b, qshift_c, qshift_d

When -fl and/or -fh are specified, a Butterworth filter is applied to the phase signals before audio reconstruction, rejecting low-frequency drift and high-frequency noise.

When --fps is specified, the given value is used as the audio sample rate instead of the frame rate reported by the video container. This is necessary for high-speed camera footage where the container frame rate does not reflect the actual capture rate.

When --roi is specified, each frame is cropped to the given rectangle before the DTCWT decomposition. This reduces computation and can improve SNR by focusing on the vibrating object.

Tips

Start with default settings and adjust from there.
Use --roi to focus on the vibrating object — improves SNR and reduces computation.
For MIT CSAIL videos, always use --fps 2200 (the container reports ~30 fps incorrectly).
Use --gpu for large videos — the DTCWT forward pass is the main bottleneck.
If GPU runs out of memory, reduce --batch-size.

GPU Acceleration

The --gpu flag enables GPU-accelerated processing via PyTorch and pytorch_wavelets. The GPU path replaces the CPU DTCWT forward transform with a CUDA-accelerated batched equivalent while keeping the same algorithmic pipeline. Temporal postprocessing (bandpass filtering, cross-correlation, sub-band summation) remains on CPU/NumPy since it operates on the small phase signal array, not full wavelet coefficients.

Batched DTCWT Architecture

The GPU path processes frames in configurable batches rather than one at a time:

Batch accumulation: Grayscale frames are collected into batches of --batch-size frames (default: 16)
GPU transfer: The batch is stacked into a (B, 1, H, W) float32 tensor and sent to GPU
Batched forward DTCWT: pytorch_wavelets.DTCWTForward processes all frames in the batch simultaneously, producing Yh[level] with shape (B, 1, 6, H_l, W_l, 2) where the last dimension is real/imaginary
Phase extraction: Performed on-GPU for the entire batch (see below)
Transfer back: Only the small phase signal array (B, nlevels, 6) is transferred to CPU — the full wavelet coefficients are discarded
Memory cleanup: GPU tensors are explicitly deleted (del batch_tensor, Yl, Yh) after each batch

This streaming architecture means GPU memory usage is proportional to batch_size, not frame_count — enabling processing of arbitrarily long videos.

GPU Phase Extraction

The CPU path uses NumPy's complex number support (np.angle(coeffs * ref_conj)). The GPU path must handle complex arithmetic manually because pytorch_wavelets represents coefficients as real/imaginary pairs in the last dimension:

Yh[level] shape: (B, 1, 6, H, W, 2)
                                   └── [0]=real, [1]=imag

Conjugate multiplication (phase difference from reference):

# (c + id)(a - ib) = (ca + db) + i(da - cb)
prod_real = c_real * r_real + c_imag * r_imag
prod_imag = c_imag * r_real - c_real * r_imag

Phase and amplitude-squared weighting (vectorized over entire batch):

phase_diff = torch.atan2(prod_imag, prod_real)
amp_sq = c_real * c_real + c_imag * c_imag
weighted = (amp_sq * phase_diff).sum(dim=(-2, -1))  # sum over H, W

This produces the same $A^2$-weighted spatial average as the CPU path, but computed entirely on GPU for the full batch at once.

Reference frame handling: The reference frame's coefficients are extracted when the batch containing ref_index is processed, then retained as a small GPU tensor for subsequent batches. Frames before the reference produce zero phase signals.

Memory Management

Pre-flight VRAM estimation: Before processing, estimate_vram() calculates peak VRAM usage based on batch size and frame dimensions. The DTCWT forward transform requires ~15x the input frame size in working memory (filter banks, intermediate convolutions). If estimated usage exceeds 70% of available VRAM, a warning is printed with suggestions to reduce --batch-size, use --roi, or switch to CPU mode.

OOM handling: If a CUDA out-of-memory error occurs during processing, the tool catches it and exits with an actionable error message rather than a raw PyTorch traceback.

CPU/GPU output differences: The GPU path uses float32 (PyTorch/CUDA standard) while the CPU path uses float64. Combined with different DTCWT implementations (pytorch_wavelets vs dtcwt), outputs are similar but not bit-identical. Both produce valid audio recovery.

Performance

Benchmarked on Chips2-2200Hz-Mary_MIDI-input.avi (704x400, 38,083 frames, 2200 fps) with an RTX 4050 (6 GB VRAM):

Configuration	Time	Notes
GPU, default settings	3m 50s	`--batch-size 32`, `near_sym_b`/`qshift_b`
GPU, `--nlevels 2`	3m 49s	Fewer decomposition levels
GPU, old filters	3m 30s	`--biort near_sym_a --qshift qshift_a`
GPU, with ROI	2m 46s	`--roi 100,50,400,300` (smaller region)

Hardware requirements (GPU path):

NVIDIA GPU with CUDA 12.1+ support
Minimum ~1 GB VRAM for typical videos (scales with --batch-size and resolution)
nvidia-container-toolkit for Docker GPU support

Part 1: The Original Work (Davis et al., SIGGRAPH 2014)

Paper: "The Visual Microphone: Passive Recovery of Sound from Video" Authors: Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham J. Mysore, Frédo Durand, William T. Freeman Venue: ACM Transactions on Graphics (SIGGRAPH 2014), Vol 33, No 4 Institutions: MIT CSAIL, Stanford, Adobe Research

1.1 The Physical Phenomenon

When sound travels through air, it creates pressure waves. When these waves hit an object's surface, they cause tiny vibrations — displacements on the order of micrometers or less. These vibrations are far too small to see with the naked eye, but a high-speed camera recording thousands of frames per second can capture them as subtle pixel-level changes.

Key insight: If we can measure those sub-pixel surface displacements over time, we effectively have a recording of the sound pressure wave — we've turned the object into a microphone.

Example: Playing "Mary Had A Little Lamb" near a bag of chips causes the bag's surface to vibrate at the frequencies of the music. A high-speed camera (2000–6000 fps) pointed at the bag captures these vibrations as tiny frame-to-frame changes.

1.2 Why Not Just Track Pixels?

You might think: "Just compute optical flow between frames and track the motion." The problem:

The motions are sub-pixel — typically $\frac{1}{100}$ to $\frac{1}{1000}$ of a pixel. Standard optical flow fails at this scale.
Noise dominates — sensor noise, quantization noise, and lighting fluctuations are all larger than the actual vibration signal.
You need temporal precision — to recover audio at meaningful frequencies, you need to track motion at every single frame with high temporal fidelity.

Solution: Instead of tracking pixels in the spatial domain, work in the frequency domain using the phase of complex wavelet/pyramid coefficients. Phase is far more sensitive to small motions than amplitude.

1.3 The Key Insight: Phase = Motion

Consider a 1D signal shifted by a small displacement $\delta$:

$$f(x) \rightarrow f(x + \delta)$$

In the Fourier domain, a spatial shift becomes a phase shift:

$$F(\omega) \rightarrow F(\omega) \cdot e^{i\omega\delta}$$

So if you decompose an image into frequency bands and track how the phase of each band changes over time, you're directly measuring local displacement at that spatial frequency.

For a band-pass filtered signal at spatial frequency $\omega_0$:

$$\Delta\phi \approx \omega_0 \cdot \delta$$

where $\delta$ is the local displacement. This is the foundation of the entire method.

Why phase is better than amplitude

Property	Amplitude $A$	Phase $\phi$
Physical meaning	"How much texture is here"	"Where exactly is this texture positioned"
Response to small motion	Relatively stable	Shifts linearly with displacement
Sub-pixel sensitivity	Poor	Excellent — detects fractions of a pixel

1.4 Complex Steerable Pyramid

The original paper uses a Complex Steerable Pyramid to decompose each video frame.

What is a steerable pyramid?

A multi-scale, multi-orientation filter bank that decomposes an image into:

Multiple scales (frequency bands): coarse $\rightarrow$ fine detail
Multiple orientations at each scale: e.g., $0°, 30°, 60°, 90°, 120°, 150°$ (for 6 orientations)
A lowpass residual (the blurry base image)
A highpass residual (the finest details)

Why "complex"?

Each sub-band produces complex-valued coefficients. At each spatial location $(x, y)$, for scale $s$ and orientation $\theta$, you get:

$$C(s, \theta, x, y) = A(s, \theta, x, y) \cdot e^{i \cdot \phi(s, \theta, x, y)}$$

where:

$A$ = amplitude (how strong the texture is at this location/scale/orientation)
$\phi$ = phase (the precise position of the texture pattern)

Why "steerable"?

The filters can be analytically rotated to any orientation without recomputing — this gives fine directional control and avoids aliasing artifacts.

Key Properties

Translation equivariant: shifting the input shifts the coefficients predictably (phase changes linearly)
Overcomplete (~21x for 8 orientations): more coefficients than pixels $\rightarrow$ redundancy helps with noise
Shift-invariant: no downsampling artifacts that would corrupt phase measurements

1.5 The Original Algorithm — Step by Step

Input

High-speed video $V(x, y, t)$ with $N$ frames at $F$ fps
Grayscale frames (color not needed for vibration)

Step 1: Decompose Every Frame

For each frame $t = 0, 1, \ldots, N-1$:

$${C(s, \theta, x, y, t)} = \text{ComplexSteerablePyramid}(V(:,:,t))$$

This gives complex coefficients at $S$ scales and $K$ orientations.

Step 2: Extract Amplitude and Phase

For each coefficient:

$$A(s, \theta, x, y, t) = |C(s, \theta, x, y, t)|$$

$$\phi(s, \theta, x, y, t) = \angle C(s, \theta, x, y, t)$$

Step 3: Compute Phase Variation (Local Motion Signal)

Choose a reference frame $t_0$ (usually frame 0). For each subsequent frame:

$$\phi_v(s, \theta, x, y, t) = \phi(s, \theta, x, y, t) - \phi(s, \theta, x, y, t_0)$$

This phase difference is proportional to how much the texture at location $(x, y)$ has moved since the reference frame, at that particular scale and orientation.

Why subtract the reference? The absolute phase values encode the texture pattern itself (which we don't care about). By subtracting the reference, we isolate the change — which is the vibration.

Step 4: Compute Global Motion Signal (Amplitude-Weighted Spatial Average)

For each scale $s$ and orientation $\theta$, collapse the spatial dimensions:

$$\Phi(s, \theta, t) = \sum_{x,y} A(s, \theta, x, y, t)^2 \cdot \phi_v(s, \theta, x, y, t)$$

Why weight by $A^2$?

Regions with strong texture (high amplitude) give reliable phase measurements
Regions with weak/no texture (low amplitude) have noisy/random phase — we want to suppress these
$A^2$ weighting is effectively a "reliability-weighted average" that emphasizes trustworthy measurements

This produces one 1D time signal per $(s, \theta)$ pair.

Step 5: Temporal Alignment Across Sub-bands

Different scales and orientations may have phase offsets relative to each other. Align them using cross-correlation:

Pick a reference sub-band (e.g., scale 0, orientation 0): $\text{ref} = \Phi(0, 0, t)$
For each other $(s, \theta)$, find the time shift that maximizes correlation:

$$\tau(s, \theta) = \arg\max_{\tau} \sum_t \text{ref}(t) \cdot \Phi(s, \theta, t - \tau)$$

Shift each sub-band signal by its optimal lag $\tau$.

Step 6: Average Across Scales and Orientations

$$\hat{s}(t) = \sum_{s, \theta} \Phi(s, \theta, t - \tau(s, \theta))$$

This averaging acts as denoising — the vibration signal is coherent across sub-bands (adds constructively) while noise is incoherent (partially cancels).

Step 7: Normalize

$$\hat{s}_{\text{norm}}(t) = \frac{2 \cdot \hat{s}(t) - (\max + \min)}{\max - \min}$$

Maps the signal to $[-1, 1]$ range.

Step 8: Output Audio

Write as WAV file with sampling rate = video FPS.

Critical: If the video is 2200 fps, the audio is sampled at 2200 Hz. By the Nyquist theorem, this captures frequencies up to 1100 Hz — covering most speech fundamental frequencies and low musical tones.

1.6 Rolling Shutter Trick (Consumer Cameras)

High-speed cameras are expensive. But most consumer cameras have rolling shutter — the sensor reads rows sequentially, not all at once. Each row is exposed at a slightly different time.

For a 60 fps camera with 480 rows:

Each row is a separate temporal sample
Effective sampling rate: $60 \times 480 / 60 \approx 480$ Hz (8x boost)
Sufficient to capture speech fundamentals

The algorithm adapts by:

Treating each row as a separate temporal sample
Computing 1D transforms along rows instead of 2D pyramids
Stitching the temporal information together

This allowed recovering intelligible speech from a standard 60 fps consumer camera.

1.7 Limitations of the Original

Requires high-speed video for good quality (2000+ fps ideal; 60 fps with rolling shutter is limited)
Object must have visible texture — smooth featureless surfaces give poor results
Sound-to-noise ratio depends on object material, distance, and sound volume
Computationally expensive — complex steerable pyramids are ~21x overcomplete
Global averaging loses spatial information — all vibrations are mixed together

Part 2: Our Implementation (2D DTCWT)

2.1 What is the Dual-Tree Complex Wavelet Transform?

The DTCWT was developed by Nick Kingsbury (Cambridge, late 1990s) as an improvement over the standard Discrete Wavelet Transform (DWT).

The problem with standard DWT

Not shift-invariant: shifting input by 1 pixel completely changes the coefficients
Poor directional selectivity: only separates horizontal, vertical, diagonal — no fine orientations
Oscillating coefficients: makes phase extraction unreliable

How DTCWT works

Run two parallel DWT filter banks (two "trees"):

Tree $a$: uses one set of filters $\rightarrow$ produces real part
Tree $b$: uses a slightly different (quarter-sample shifted) set of filters $\rightarrow$ produces imaginary part

The filters are designed so that Tree $b$'s wavelet is approximately the Hilbert transform of Tree $a$'s wavelet. Combining them gives:

$$\psi_{\text{complex}}(x) = \psi_a(x) + i \cdot \psi_b(x)$$

This complex wavelet is approximately analytic (has energy only on one side of the frequency spectrum), which provides:

Approximate shift invariance (2x oversampling eliminates most aliasing)
Clean phase information (no oscillation artifacts)

2D DTCWT Specifically

For 2D images, the DTCWT produces 6 complex sub-bands per scale, oriented at approximately:

$$\pm 15°, \quad \pm 45°, \quad \pm 75°$$

This is fewer orientations than a typical steerable pyramid (which might use 8+), but:

Only ~4x overcomplete (vs ~21x for steerable pyramid)
Much faster to compute
Still provides good directional selectivity
Phase information is reliable for motion estimation

2.2 DTCWT vs Complex Steerable Pyramid

Property	Complex Steerable Pyramid	2D DTCWT
Shift invariant	Yes (exactly)	Approximately
Orientations per scale	Configurable (typically 8)	6 (fixed)
Overcompleteness	~21x (8 orientations)	~4x
Computation speed	Slow (frequency domain)	Fast (filter banks)
Phase quality	Excellent	Good
Python library	`pyrtools`	`dtcwt`
Reconstruction	Perfect	Near-perfect

Trade-off: DTCWT is ~5x more computationally efficient at the cost of slightly fewer orientations and approximate (rather than exact) shift invariance. For the visual microphone application, this is a favorable trade-off — the phase information is still good enough to detect sub-pixel vibrations.

2.3 Our Algorithm — Mapped to Code

Here's how visualmic.py implements the pipeline, with line references.

Steps 1–3: Stream Video, ROI Crop, DTCWT, and Phase Extraction

Frames are streamed directly from the video file — each frame is read, transformed, and discarded immediately, so only one raw frame is in memory at a time. This enables processing of arbitrarily long videos without running out of memory. If an ROI is specified, each frame is cropped before the DTCWT decomposition, reducing computation and focusing on the vibrating object.

def extract_audio(cap, frame_count, nlevels, n_orient, ref_index, ref_orient, ref_level, ..., roi=None):
    transform = dtcwt.Transform2d()
    ref_conj = None
    phase_signals = []

    for fc in range(frame_count):
        ret, raw_frame = cap.read()
        if not ret or raw_frame is None:
            break
        gray = cv2.cvtColor(raw_frame, cv2.COLOR_BGR2GRAY)
        if roi is not None:
            rx, ry, rw, rh = roi
            gray = gray[ry:ry+rh, rx:rx+rw]
        dtcwt_frame = transform.forward(gray, nlevels=nlevels)

        if fc == ref_index:
            ref_conj = [np.conj(dtcwt_frame.highpasses[level]) for level in range(nlevels)]

        frame_phases = np.zeros((nlevels, n_orient))
        for level in range(nlevels):
            coeffs = dtcwt_frame.highpasses[level]
            amp = np.abs(coeffs)
            phase_diff = np.angle(coeffs * ref_conj[level])
            frame_phases[level, :] = np.sum(amp * amp * phase_diff, axis=(0, 1))
        phase_signals.append(frame_phases)

    phase_signals = np.array(phase_signals)  # shape: (frame_count, nlevels, n_orient)

Vectorized NumPy operations on entire 2D spatial slices:

Operation	Code	Corresponds to
Extract amplitude $A$	`np.abs(coeffs)`	Step 2 of original
Phase variation $\phi_v$ (wrapped to $[-\pi, \pi]$)	`np.angle(coeffs * ref_conj[level])`	Step 3 of original
$A^2$-weighted accumulation	`amp * amp * phase_diff`	Step 4 of original
Spatial sum $\sum_{x,y}$	`np.sum(..., axis=(0, 1))`	Step 4 of original

The conjugate multiplication coeffs * conj(ref) computes the phase difference directly: angle(z * conj(w)) = angle(z) - angle(w), automatically wrapped to $[-\pi, \pi]$. This is more numerically stable than computing phases separately and subtracting.

Result: phase_signals[fc, level, angle] $= \Phi(\text{level}, \text{angle}, fc)$ — one scalar per frame per sub-band.

Step 3.5: Temporal Bandpass Filtering (optional)

When -fl and/or -fh are specified, a 4th-order Butterworth filter is applied to each of the 18 phase signals before cross-correlation:

nyquist = fps / 2.0
sos = signal.butter(4, [freq_low / nyquist, freq_high_clamped / nyquist],
                    btype='bandpass', output='sos')
for i in range(nlevels):
    for j in range(n_orient):
        phase_signals[:, i, j] = signal.sosfiltfilt(sos, phase_signals[:, i, j])

sosfiltfilt applies the filter forward and backward (zero-phase), so no time delay is introduced
The filter rejects low-frequency drift (camera shake, thermal effects) and high-frequency noise
Upper cutoff is automatically clamped to 99% of Nyquist to avoid instability
Skipped if video has fewer than 13 frames (minimum required for filtfilt)
If only -fl is given, acts as highpass; if only -fh, acts as lowpass

Step 4: Temporal Alignment via Cross-Correlation

ref_vector = phase_signals[:, ref_level, ref_orient].reshape(-1)
for i in range(nlevels):
    for j in range(n_orient):
        shift_matrix[i, j] = find_best_shift(ref_vector, phase_signals[:, i, j].reshape(-1))

The find_best_shift function uses scipy.signal.correlate for $O(n \log n)$ cross-correlation:

def find_best_shift(a, b):
    correlation = signal.correlate(a, b, mode='full')
    return np.argmax(correlation) - (len(b) - 1)

Step 5: Sum Across Sub-bands with Temporal Shifts

sound_raw = np.zeros(frame_count)
for i in range(nlevels):
    for j in range(n_orient):
        sound_raw += np.roll(phase_signals[:, i, j], int(shift_matrix[i, j]))

Step 6: Normalize to $[-1, 1]$

p_min = np.min(sound_raw)
p_max = np.max(sound_raw)
if p_max == p_min:
    sound_data = np.zeros_like(sound_raw)  # silent output if no motion
else:
    sound_data = ((2 * sound_raw) - (p_min + p_max)) / (p_max - p_min)

Step 7: Output WAV

def save_wav(samples, output_name, sample_rate):
    waveform_integers = np.int16(samples * 32767)
    write(output_name, sample_rate, waveform_integers)

The sample_rate is set to the video's FPS, ensuring the output audio matches the temporal resolution of the input video.

2.4 Parameters Used

Parameter	Default	Meaning
`nlevels`	3	Number of wavelet decomposition scales
`n_orient`	6	Number of orientations per scale (fixed by DTCWT)
`ref_index`	0	Reference frame index (first frame)
`ref_level`	0	Reference sub-band: finest scale
`ref_orient`	0	Reference sub-band: first orientation (~$+15°$)
`biort`	`near_sym_b`	Biorthogonal filter for level 1
`qshift`	`qshift_b`	Quarter-shift filter for levels 2+

2.5 What Each Scale Captures

With 3 levels of DTCWT decomposition:

Level	Spatial Frequency	What It Captures	Spatial Resolution
0 (finest)	High	Fine textures, edges, sharp details	$H/2 \times W/2$
1 (middle)	Medium	Medium-scale patterns	$H/4 \times W/4$
2 (coarsest)	Low	Broad structures, large features	$H/8 \times W/8$

The vibration signal is present across all scales (the whole surface moves), but the signal-to-noise ratio varies:

Fine scales: more spatial locations to average $\rightarrow$ better denoising
Coarse scales: fewer locations but stronger phase response to motion

Part 3: Literature Survey

Foundational Work

Year	Paper	Key Contribution
2012	Eulerian Video Magnification (SIGGRAPH)	Predecessor: amplifies color/intensity changes to visualize motion
2013	Phase-Based Video Motion Processing (SIGGRAPH)	Established that phase of complex steerable pyramid coefficients = local motion
2014	Riesz Pyramids (ICCP)	Compact 4x overcomplete pyramid for real-time phase-based processing
2014	The Visual Microphone (SIGGRAPH)	Recovered sound from video using phase-based motion analysis
2016	Visual Vibration Analysis (PhD Thesis, Abe Davis)	Extended to modal analysis, material properties, damping estimation

Follow-Up Research

Year	Work	Advance
2018	Local Visual Microphones	Local vibration aggregation (not global averaging), 100–1000x speedup, sound direction estimation
2022	Effect of Video Resolution	Studies resolution impact on recovery quality; frame-wise denoising preprocessing
2023	Event-Based Visual Microphone (CVPR Workshop)	Neuromorphic event cameras for cheap, efficient vibration capture
2024	PSO-CNN Hybrid	Particle Swarm Optimization + CNN for enhanced sound restoration
2025	Single-Pixel Visual Microphone (Optica)	Single-pixel imaging with spatial light modulator — no expensive high-speed camera needed

Alternative Implementations

Implementation	Technique	Language
MIT Original	Complex Steerable Pyramid	MATLAB
dsforza96/visual-mic	Complex Steerable Pyramid (`pyrtools`)	Python
This repo (Visual-Mic)	2D DTCWT (`dtcwt`)	Python

Future Work

Multiprocessing across frames: Frame processing is independent after the reference frame is computed. Reading frames remains sequential (VideoCapture limitation), but the DTCWT + phase extraction can be parallelized across CPU cores using batch processing with multiprocessing.Pool, giving ~Nx speedup on an N-core machine.
Better post-processing / signal recovery: The current algorithm uses properly wrapped phase differences (np.angle(coeffs * conj(ref)), bounded to [-π, π]), which is mathematically correct but produces lower-amplitude signals for very small vibrations. Exploring better post-processing — such as phase unwrapping, adaptive Wiener filtering, or learned denoising — could recover signal strength without reintroducing the phase wrapping artifacts.

Development

Running Tests

All tests run inside Docker — no local Python dependencies needed:

# CPU: lint + unit tests (builds image automatically if not found)
./test.sh

# GPU: lint + unit tests (requires nvidia-container-toolkit)
./test.sh gpu

# Force rebuild before testing
./test.sh --build
./test.sh gpu --build

CPU tests (tests/test_visualmic.py) cover:

Utility functions (format_duration, find_best_shift, save_wav)
Phase signal postprocessing (cross-correlation, normalization, Butterworth filter)
VRAM estimation arithmetic
Full extract_audio pipeline on synthetic 256x256 video (shape, finiteness)
All CLI validation error paths

GPU tests (tests/test_visualmic_gpu.py) cover:

DTCWTForward shapes, finiteness, and custom filter selection
Full extract_audio_gpu pipeline on synthetic 256x256 video
All GPU tests skip automatically on systems without CUDA

Versioning

Version is tracked in a VERSION file at the project root. visualmic.py has __version__ baked into the source (updated at release time).

To cut a release:

Update VERSION with the new version number
Update __version__ in visualmic.py
Update CHANGELOG.md — move items from [Unreleased] to [X.Y.Z] - YYYY-MM-DD
Commit: Release vX.Y.Z
Tag: git tag -a vX.Y.Z -m "Release vX.Y.Z"
Push: git push && git push origin vX.Y.Z
Rebuild Docker images: ./docker-build.sh && ./docker-build-gpu.sh

Project Structure

visualmic.py               # CLI tool (CPU + GPU paths)
Dockerfile                 # CPU Docker image (python:3.11-slim)
Dockerfile.gpu             # GPU Docker image (pytorch:2.1.2-cuda12.1)
docker-build.sh            # Build + tag CPU image
docker-build-gpu.sh        # Build + tag GPU image
test.sh                    # Run lint + tests (Docker, supports cpu/gpu mode)
requirements.txt           # CPU runtime dependencies
requirements-gpu.txt       # GPU runtime dependencies
requirements-dev.txt       # Dev dependencies (pytest, ruff)
tests/
  test_visualmic.py        # CPU unit tests
  test_visualmic_gpu.py    # GPU unit tests (CUDA-only, skip on CPU)
docs/design/               # Architecture decision records
  visualmic-hardening.md   # Hardening design doc
VERSION                    # Single source of truth for version
CHANGELOG.md               # Release history
CONTRIBUTING.md            # Contribution guidelines

References

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G.J., Durand, F., & Freeman, W.T. (2014). The Visual Microphone: Passive Recovery of Sound from Video. ACM Transactions on Graphics (SIGGRAPH), 33(4). Paper PDF | Project Page
Wadhwa, N., Rubinstein, M., Durand, F., & Freeman, W.T. (2013). Phase-Based Video Motion Processing. ACM Transactions on Graphics (SIGGRAPH). Project Page
Wadhwa, N., Rubinstein, M., Durand, F., & Freeman, W.T. (2014). Riesz Pyramids for Fast Phase-Based Video Magnification. IEEE ICCP. Project Page
Selesnick, I.W., Baraniuk, R.G., & Kingsbury, N.G. (2005). The Dual-Tree Complex Wavelet Transform. IEEE Signal Processing Magazine, 22(6), 123–151. Tutorial PDF
Davis, A. (2016). Visual Vibration Analysis. PhD Thesis, MIT. Thesis PDF
Shen, M. & Bhatt, S. (2018). Local Visual Microphones: Improved Sound Extraction from Silent Video. arXiv:1801.09436
Niwa, T., Fushimi, T., Yamamoto, S., & Ochiai, Y. (2023). Live Demonstration: Event-based Visual Microphone. CVPR Workshop on Event-based Vision. Paper PDF
dtcwt Python library. Documentation | GitHub
MIT CSAIL Visual Microphone Dataset. Download

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
assets		assets
docs/design		docs/design
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
docker-build-gpu.sh		docker-build-gpu.sh
docker-build.sh		docker-build.sh
requirements-dev.txt		requirements-dev.txt
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt
test.sh		test.sh
visualmic.py		visualmic.py

Folders and files

Latest commit

History

Repository files navigation

Visual-Mic

Table of Contents

Setup

A. Local Setup

B. Docker (CPU)

C. Docker (GPU)

Usage

CLI Tool

Tips

GPU Acceleration

Batched DTCWT Architecture

GPU Phase Extraction

Memory Management

Performance

Part 1: The Original Work (Davis et al., SIGGRAPH 2014)

1.1 The Physical Phenomenon

1.2 Why Not Just Track Pixels?

1.3 The Key Insight: Phase = Motion

Why phase is better than amplitude

1.4 Complex Steerable Pyramid

What is a steerable pyramid?

Why "complex"?

Why "steerable"?

Key Properties

1.5 The Original Algorithm — Step by Step

Input

Step 1: Decompose Every Frame

Step 2: Extract Amplitude and Phase

Step 3: Compute Phase Variation (Local Motion Signal)

Step 4: Compute Global Motion Signal (Amplitude-Weighted Spatial Average)

Step 5: Temporal Alignment Across Sub-bands

Step 6: Average Across Scales and Orientations

Step 7: Normalize

Step 8: Output Audio

1.6 Rolling Shutter Trick (Consumer Cameras)

1.7 Limitations of the Original

Part 2: Our Implementation (2D DTCWT)

2.1 What is the Dual-Tree Complex Wavelet Transform?

The problem with standard DWT

How DTCWT works

2D DTCWT Specifically

2.2 DTCWT vs Complex Steerable Pyramid

2.3 Our Algorithm — Mapped to Code

Steps 1–3: Stream Video, ROI Crop, DTCWT, and Phase Extraction

Step 3.5: Temporal Bandpass Filtering (optional)

Step 4: Temporal Alignment via Cross-Correlation

Step 5: Sum Across Sub-bands with Temporal Shifts

Step 6: Normalize to $[-1, 1]$

Step 7: Output WAV

2.4 Parameters Used

2.5 What Each Scale Captures

Part 3: Literature Survey

Foundational Work

Follow-Up Research

Alternative Implementations

Future Work

Development

Running Tests

Versioning

Project Structure

References

Follow Me

Show your support by starring the repository 🙂

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Packages