A Python implementation of the Visual Microphone algorithm, which recovers sound from high-speed video by analyzing sub-pixel surface vibrations. When sound hits an object, it causes tiny vibrations on the surface—far too small to see with the naked eye, but detectable in the phase of complex wavelet coefficients. This tool extracts those vibrations and reconstructs an audible signal, effectively turning everyday objects into microphones.
The original work by Davis et al. (MIT CSAIL, SIGGRAPH 2014) used Complex Steerable Pyramids for the video decomposition. This project uses 2D Dual-Tree Complex Wavelet Transform (DTCWT) instead, which is ~5x more computationally efficient while still providing reliable phase information for motion estimation. We test against the same high-speed videos provided by MIT CSAIL.
The sample videos can be downloaded from here. For example, Chips1-2200Hz-Mary_Had-input.avi is a high-speed video of a bag of chips vibrating to "Mary Had A Little Lamb" (704x704, 22,859 frames captured at 2200 fps, ~14 GB). Note: the AVI container reports ~30 fps, but the actual capture rate is 2200 Hz — use --fps 2200 when running visualmic.py to get the correct audio sample rate.
- Setup
- Usage
- GPU Acceleration
- Part 1: The Original Work (Davis et al., SIGGRAPH 2014)
- Part 2: Our Implementation (2D DTCWT)
- Part 3: Literature Survey
- Future Work
- Development
- References
git clone https://github.com/joeljose/Visual-Mic.git
cd Visual-Mic
pip install -r requirements.txt
python visualmic.py -i testvid.avi -o recovered_audio.wavRequirements: Python 3.8+
# Build
./docker-build.sh
# Run
docker run --rm -v /path/to/videos:/data \
visual-mic:latest \
-i /data/testvid.avi -o /data/sound.wavRequires nvidia-container-toolkit.
# Build
./docker-build-gpu.sh
# Run
docker run --rm --gpus all -v /path/to/videos:/data \
visual-mic-gpu:latest \
--gpu -i /data/Chips1-2200Hz-Mary_Had-input.avi \
-o /data/sound.wav --fps 2200 --batch-size 32The --batch-size flag controls how many frames are processed per GPU batch (default: 16). Larger batches are faster but use more GPU memory. At 704x704, each frame uses ~10 MB of GPU memory, so --batch-size 32 needs ~660 MB including overhead.
Note: GPU mode produces very similar but not bit-identical output compared to CPU mode, due to float32 vs float64 precision differences and different DTCWT implementations (pytorch_wavelets vs dtcwt).
# Basic usage
python visualmic.py -i testvid.avi -o recovered_audio.wav
# With temporal bandpass filter
python visualmic.py -i testvid.avi -fl 80 -fh 1000
# With ROI (focus on vibrating object)
python visualmic.py -i testvid.avi --roi 100,50,200,150
# Override frame rate for high-speed video
python visualmic.py -i Chips1-2200Hz-Mary_Had-input.avi --fps 2200
# GPU acceleration
python visualmic.py -i testvid.avi --gpu --batch-size 32
# Custom wavelet filters
python visualmic.py -i testvid.avi --biort near_sym_a --qshift qshift_a| Flag | Default | Description |
|---|---|---|
-i / --input |
(required) | Input video path |
-o / --output |
sound.wav |
Output audio path |
-fl / --freq-low |
— | Lower cutoff frequency (Hz) for temporal bandpass filter |
-fh / --freq-high |
— | Upper cutoff frequency (Hz) for temporal bandpass filter |
--fps |
— | Override video frame rate (Hz) for audio sample rate |
--roi |
— | Region of interest as x,y,w,h |
--gpu |
off | Use GPU-accelerated DTCWT (requires CUDA + pytorch_wavelets) |
--batch-size |
16 | Frames per GPU batch (GPU mode only) |
--nlevels |
3 | Number of DTCWT decomposition levels |
--biort |
near_sym_b |
Biorthogonal wavelet filter for DTCWT level 1 |
--qshift |
qshift_b |
Quarter-shift wavelet filter for DTCWT levels 2+ |
--version |
— | Show program version and exit |
Available wavelet filters:
--biort:antonini,legall,near_sym_a,near_sym_b--qshift:qshift_06,qshift_a,qshift_b,qshift_c,qshift_d
When -fl and/or -fh are specified, a Butterworth filter is applied to the phase signals before audio reconstruction, rejecting low-frequency drift and high-frequency noise.
When --fps is specified, the given value is used as the audio sample rate instead of the frame rate reported by the video container. This is necessary for high-speed camera footage where the container frame rate does not reflect the actual capture rate.
When --roi is specified, each frame is cropped to the given rectangle before the DTCWT decomposition. This reduces computation and can improve SNR by focusing on the vibrating object.
- Start with default settings and adjust from there.
- Use
--roito focus on the vibrating object — improves SNR and reduces computation. - For MIT CSAIL videos, always use
--fps 2200(the container reports ~30 fps incorrectly). - Use
--gpufor large videos — the DTCWT forward pass is the main bottleneck. - If GPU runs out of memory, reduce
--batch-size.
The --gpu flag enables GPU-accelerated processing via PyTorch and pytorch_wavelets. The GPU path replaces the CPU DTCWT forward transform with a CUDA-accelerated batched equivalent while keeping the same algorithmic pipeline. Temporal postprocessing (bandpass filtering, cross-correlation, sub-band summation) remains on CPU/NumPy since it operates on the small phase signal array, not full wavelet coefficients.
The GPU path processes frames in configurable batches rather than one at a time:
- Batch accumulation: Grayscale frames are collected into batches of
--batch-sizeframes (default: 16) - GPU transfer: The batch is stacked into a
(B, 1, H, W)float32 tensor and sent to GPU - Batched forward DTCWT:
pytorch_wavelets.DTCWTForwardprocesses all frames in the batch simultaneously, producingYh[level]with shape(B, 1, 6, H_l, W_l, 2)where the last dimension is real/imaginary - Phase extraction: Performed on-GPU for the entire batch (see below)
- Transfer back: Only the small phase signal array
(B, nlevels, 6)is transferred to CPU — the full wavelet coefficients are discarded - Memory cleanup: GPU tensors are explicitly deleted (
del batch_tensor, Yl, Yh) after each batch
This streaming architecture means GPU memory usage is proportional to batch_size, not frame_count — enabling processing of arbitrarily long videos.
The CPU path uses NumPy's complex number support (np.angle(coeffs * ref_conj)). The GPU path must handle complex arithmetic manually because pytorch_wavelets represents coefficients as real/imaginary pairs in the last dimension:
Yh[level] shape: (B, 1, 6, H, W, 2)
└── [0]=real, [1]=imag
Conjugate multiplication (phase difference from reference):
# (c + id)(a - ib) = (ca + db) + i(da - cb)
prod_real = c_real * r_real + c_imag * r_imag
prod_imag = c_imag * r_real - c_real * r_imagPhase and amplitude-squared weighting (vectorized over entire batch):
phase_diff = torch.atan2(prod_imag, prod_real)
amp_sq = c_real * c_real + c_imag * c_imag
weighted = (amp_sq * phase_diff).sum(dim=(-2, -1)) # sum over H, WThis produces the same
Reference frame handling: The reference frame's coefficients are extracted when the batch containing ref_index is processed, then retained as a small GPU tensor for subsequent batches. Frames before the reference produce zero phase signals.
Pre-flight VRAM estimation: Before processing, estimate_vram() calculates peak VRAM usage based on batch size and frame dimensions. The DTCWT forward transform requires ~15x the input frame size in working memory (filter banks, intermediate convolutions). If estimated usage exceeds 70% of available VRAM, a warning is printed with suggestions to reduce --batch-size, use --roi, or switch to CPU mode.
OOM handling: If a CUDA out-of-memory error occurs during processing, the tool catches it and exits with an actionable error message rather than a raw PyTorch traceback.
CPU/GPU output differences: The GPU path uses float32 (PyTorch/CUDA standard) while the CPU path uses float64. Combined with different DTCWT implementations (pytorch_wavelets vs dtcwt), outputs are similar but not bit-identical. Both produce valid audio recovery.
Benchmarked on Chips2-2200Hz-Mary_MIDI-input.avi (704x400, 38,083 frames, 2200 fps) with an RTX 4050 (6 GB VRAM):
| Configuration | Time | Notes |
|---|---|---|
| GPU, default settings | 3m 50s | --batch-size 32, near_sym_b/qshift_b |
GPU, --nlevels 2 |
3m 49s | Fewer decomposition levels |
| GPU, old filters | 3m 30s | --biort near_sym_a --qshift qshift_a |
| GPU, with ROI | 2m 46s | --roi 100,50,400,300 (smaller region) |
Hardware requirements (GPU path):
- NVIDIA GPU with CUDA 12.1+ support
- Minimum ~1 GB VRAM for typical videos (scales with
--batch-sizeand resolution) nvidia-container-toolkitfor Docker GPU support
Paper: "The Visual Microphone: Passive Recovery of Sound from Video" Authors: Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham J. Mysore, Frédo Durand, William T. Freeman Venue: ACM Transactions on Graphics (SIGGRAPH 2014), Vol 33, No 4 Institutions: MIT CSAIL, Stanford, Adobe Research
When sound travels through air, it creates pressure waves. When these waves hit an object's surface, they cause tiny vibrations — displacements on the order of micrometers or less. These vibrations are far too small to see with the naked eye, but a high-speed camera recording thousands of frames per second can capture them as subtle pixel-level changes.
Key insight: If we can measure those sub-pixel surface displacements over time, we effectively have a recording of the sound pressure wave — we've turned the object into a microphone.
Example: Playing "Mary Had A Little Lamb" near a bag of chips causes the bag's surface to vibrate at the frequencies of the music. A high-speed camera (2000–6000 fps) pointed at the bag captures these vibrations as tiny frame-to-frame changes.
You might think: "Just compute optical flow between frames and track the motion." The problem:
-
The motions are sub-pixel — typically
$\frac{1}{100}$ to$\frac{1}{1000}$ of a pixel. Standard optical flow fails at this scale. - Noise dominates — sensor noise, quantization noise, and lighting fluctuations are all larger than the actual vibration signal.
- You need temporal precision — to recover audio at meaningful frequencies, you need to track motion at every single frame with high temporal fidelity.
Solution: Instead of tracking pixels in the spatial domain, work in the frequency domain using the phase of complex wavelet/pyramid coefficients. Phase is far more sensitive to small motions than amplitude.
Consider a 1D signal shifted by a small displacement
In the Fourier domain, a spatial shift becomes a phase shift:
So if you decompose an image into frequency bands and track how the phase of each band changes over time, you're directly measuring local displacement at that spatial frequency.
For a band-pass filtered signal at spatial frequency
where
| Property | Amplitude |
Phase |
|---|---|---|
| Physical meaning | "How much texture is here" | "Where exactly is this texture positioned" |
| Response to small motion | Relatively stable | Shifts linearly with displacement |
| Sub-pixel sensitivity | Poor | Excellent — detects fractions of a pixel |
The original paper uses a Complex Steerable Pyramid to decompose each video frame.
A multi-scale, multi-orientation filter bank that decomposes an image into:
- Multiple scales (frequency bands): coarse
$\rightarrow$ fine detail - Multiple orientations at each scale: e.g.,
$0°, 30°, 60°, 90°, 120°, 150°$ (for 6 orientations) - A lowpass residual (the blurry base image)
- A highpass residual (the finest details)
Each sub-band produces complex-valued coefficients. At each spatial location
where:
-
$A$ = amplitude (how strong the texture is at this location/scale/orientation) -
$\phi$ = phase (the precise position of the texture pattern)
The filters can be analytically rotated to any orientation without recomputing — this gives fine directional control and avoids aliasing artifacts.
- Translation equivariant: shifting the input shifts the coefficients predictably (phase changes linearly)
-
Overcomplete (~21x for 8 orientations): more coefficients than pixels
$\rightarrow$ redundancy helps with noise - Shift-invariant: no downsampling artifacts that would corrupt phase measurements
- High-speed video
$V(x, y, t)$ with$N$ frames at$F$ fps - Grayscale frames (color not needed for vibration)
For each frame
This gives complex coefficients at
For each coefficient:
Choose a reference frame
This phase difference is proportional to how much the texture at location
Why subtract the reference? The absolute phase values encode the texture pattern itself (which we don't care about). By subtracting the reference, we isolate the change — which is the vibration.
For each scale
Why weight by
- Regions with strong texture (high amplitude) give reliable phase measurements
- Regions with weak/no texture (low amplitude) have noisy/random phase — we want to suppress these
-
$A^2$ weighting is effectively a "reliability-weighted average" that emphasizes trustworthy measurements
This produces one 1D time signal per
Different scales and orientations may have phase offsets relative to each other. Align them using cross-correlation:
- Pick a reference sub-band (e.g., scale 0, orientation 0):
$\text{ref} = \Phi(0, 0, t)$ - For each other
$(s, \theta)$ , find the time shift that maximizes correlation:
- Shift each sub-band signal by its optimal lag
$\tau$ .
This averaging acts as denoising — the vibration signal is coherent across sub-bands (adds constructively) while noise is incoherent (partially cancels).
Maps the signal to
Write as WAV file with sampling rate = video FPS.
Critical: If the video is 2200 fps, the audio is sampled at 2200 Hz. By the Nyquist theorem, this captures frequencies up to 1100 Hz — covering most speech fundamental frequencies and low musical tones.
High-speed cameras are expensive. But most consumer cameras have rolling shutter — the sensor reads rows sequentially, not all at once. Each row is exposed at a slightly different time.
For a 60 fps camera with 480 rows:
- Each row is a separate temporal sample
- Effective sampling rate:
$60 \times 480 / 60 \approx 480$ Hz (8x boost) - Sufficient to capture speech fundamentals
The algorithm adapts by:
- Treating each row as a separate temporal sample
- Computing 1D transforms along rows instead of 2D pyramids
- Stitching the temporal information together
This allowed recovering intelligible speech from a standard 60 fps consumer camera.
- Requires high-speed video for good quality (2000+ fps ideal; 60 fps with rolling shutter is limited)
- Object must have visible texture — smooth featureless surfaces give poor results
- Sound-to-noise ratio depends on object material, distance, and sound volume
- Computationally expensive — complex steerable pyramids are ~21x overcomplete
- Global averaging loses spatial information — all vibrations are mixed together
The DTCWT was developed by Nick Kingsbury (Cambridge, late 1990s) as an improvement over the standard Discrete Wavelet Transform (DWT).
- Not shift-invariant: shifting input by 1 pixel completely changes the coefficients
- Poor directional selectivity: only separates horizontal, vertical, diagonal — no fine orientations
- Oscillating coefficients: makes phase extraction unreliable
Run two parallel DWT filter banks (two "trees"):
-
Tree
$a$ : uses one set of filters$\rightarrow$ produces real part -
Tree
$b$ : uses a slightly different (quarter-sample shifted) set of filters$\rightarrow$ produces imaginary part
The filters are designed so that Tree
This complex wavelet is approximately analytic (has energy only on one side of the frequency spectrum), which provides:
- Approximate shift invariance (2x oversampling eliminates most aliasing)
- Clean phase information (no oscillation artifacts)
For 2D images, the DTCWT produces 6 complex sub-bands per scale, oriented at approximately:
This is fewer orientations than a typical steerable pyramid (which might use 8+), but:
- Only ~4x overcomplete (vs ~21x for steerable pyramid)
- Much faster to compute
- Still provides good directional selectivity
- Phase information is reliable for motion estimation
| Property | Complex Steerable Pyramid | 2D DTCWT |
|---|---|---|
| Shift invariant | Yes (exactly) | Approximately |
| Orientations per scale | Configurable (typically 8) | 6 (fixed) |
| Overcompleteness | ~21x (8 orientations) | ~4x |
| Computation speed | Slow (frequency domain) | Fast (filter banks) |
| Phase quality | Excellent | Good |
| Python library | pyrtools |
dtcwt |
| Reconstruction | Perfect | Near-perfect |
Trade-off: DTCWT is ~5x more computationally efficient at the cost of slightly fewer orientations and approximate (rather than exact) shift invariance. For the visual microphone application, this is a favorable trade-off — the phase information is still good enough to detect sub-pixel vibrations.
Here's how visualmic.py implements the pipeline, with line references.
Frames are streamed directly from the video file — each frame is read, transformed, and discarded immediately, so only one raw frame is in memory at a time. This enables processing of arbitrarily long videos without running out of memory. If an ROI is specified, each frame is cropped before the DTCWT decomposition, reducing computation and focusing on the vibrating object.
def extract_audio(cap, frame_count, nlevels, n_orient, ref_index, ref_orient, ref_level, ..., roi=None):
transform = dtcwt.Transform2d()
ref_conj = None
phase_signals = []
for fc in range(frame_count):
ret, raw_frame = cap.read()
if not ret or raw_frame is None:
break
gray = cv2.cvtColor(raw_frame, cv2.COLOR_BGR2GRAY)
if roi is not None:
rx, ry, rw, rh = roi
gray = gray[ry:ry+rh, rx:rx+rw]
dtcwt_frame = transform.forward(gray, nlevels=nlevels)
if fc == ref_index:
ref_conj = [np.conj(dtcwt_frame.highpasses[level]) for level in range(nlevels)]
frame_phases = np.zeros((nlevels, n_orient))
for level in range(nlevels):
coeffs = dtcwt_frame.highpasses[level]
amp = np.abs(coeffs)
phase_diff = np.angle(coeffs * ref_conj[level])
frame_phases[level, :] = np.sum(amp * amp * phase_diff, axis=(0, 1))
phase_signals.append(frame_phases)
phase_signals = np.array(phase_signals) # shape: (frame_count, nlevels, n_orient)Vectorized NumPy operations on entire 2D spatial slices:
| Operation | Code | Corresponds to |
|---|---|---|
| Extract amplitude |
np.abs(coeffs) |
Step 2 of original |
| Phase variation |
np.angle(coeffs * ref_conj[level]) |
Step 3 of original |
|
|
amp * amp * phase_diff |
Step 4 of original |
| Spatial sum |
np.sum(..., axis=(0, 1)) |
Step 4 of original |
The conjugate multiplication coeffs * conj(ref) computes the phase difference directly: angle(z * conj(w)) = angle(z) - angle(w), automatically wrapped to
Result: phase_signals[fc, level, angle]
When -fl and/or -fh are specified, a 4th-order Butterworth filter is applied to each of the 18 phase signals before cross-correlation:
nyquist = fps / 2.0
sos = signal.butter(4, [freq_low / nyquist, freq_high_clamped / nyquist],
btype='bandpass', output='sos')
for i in range(nlevels):
for j in range(n_orient):
phase_signals[:, i, j] = signal.sosfiltfilt(sos, phase_signals[:, i, j])sosfiltfiltapplies the filter forward and backward (zero-phase), so no time delay is introduced- The filter rejects low-frequency drift (camera shake, thermal effects) and high-frequency noise
- Upper cutoff is automatically clamped to 99% of Nyquist to avoid instability
- Skipped if video has fewer than 13 frames (minimum required for
filtfilt) - If only
-flis given, acts as highpass; if only-fh, acts as lowpass
ref_vector = phase_signals[:, ref_level, ref_orient].reshape(-1)
for i in range(nlevels):
for j in range(n_orient):
shift_matrix[i, j] = find_best_shift(ref_vector, phase_signals[:, i, j].reshape(-1))The find_best_shift function uses scipy.signal.correlate for
def find_best_shift(a, b):
correlation = signal.correlate(a, b, mode='full')
return np.argmax(correlation) - (len(b) - 1)sound_raw = np.zeros(frame_count)
for i in range(nlevels):
for j in range(n_orient):
sound_raw += np.roll(phase_signals[:, i, j], int(shift_matrix[i, j]))p_min = np.min(sound_raw)
p_max = np.max(sound_raw)
if p_max == p_min:
sound_data = np.zeros_like(sound_raw) # silent output if no motion
else:
sound_data = ((2 * sound_raw) - (p_min + p_max)) / (p_max - p_min)def save_wav(samples, output_name, sample_rate):
waveform_integers = np.int16(samples * 32767)
write(output_name, sample_rate, waveform_integers)The sample_rate is set to the video's FPS, ensuring the output audio matches the temporal resolution of the input video.
| Parameter | Default | Meaning |
|---|---|---|
nlevels |
3 | Number of wavelet decomposition scales |
n_orient |
6 | Number of orientations per scale (fixed by DTCWT) |
ref_index |
0 | Reference frame index (first frame) |
ref_level |
0 | Reference sub-band: finest scale |
ref_orient |
0 | Reference sub-band: first orientation (~$+15°$) |
biort |
near_sym_b |
Biorthogonal filter for level 1 |
qshift |
qshift_b |
Quarter-shift filter for levels 2+ |
With 3 levels of DTCWT decomposition:
| Level | Spatial Frequency | What It Captures | Spatial Resolution |
|---|---|---|---|
| 0 (finest) | High | Fine textures, edges, sharp details | |
| 1 (middle) | Medium | Medium-scale patterns | |
| 2 (coarsest) | Low | Broad structures, large features |
The vibration signal is present across all scales (the whole surface moves), but the signal-to-noise ratio varies:
-
Fine scales: more spatial locations to average
$\rightarrow$ better denoising - Coarse scales: fewer locations but stronger phase response to motion
| Year | Paper | Key Contribution |
|---|---|---|
| 2012 | Eulerian Video Magnification (SIGGRAPH) | Predecessor: amplifies color/intensity changes to visualize motion |
| 2013 | Phase-Based Video Motion Processing (SIGGRAPH) | Established that phase of complex steerable pyramid coefficients = local motion |
| 2014 | Riesz Pyramids (ICCP) | Compact 4x overcomplete pyramid for real-time phase-based processing |
| 2014 | The Visual Microphone (SIGGRAPH) | Recovered sound from video using phase-based motion analysis |
| 2016 | Visual Vibration Analysis (PhD Thesis, Abe Davis) | Extended to modal analysis, material properties, damping estimation |
| Year | Work | Advance |
|---|---|---|
| 2018 | Local Visual Microphones | Local vibration aggregation (not global averaging), 100–1000x speedup, sound direction estimation |
| 2022 | Effect of Video Resolution | Studies resolution impact on recovery quality; frame-wise denoising preprocessing |
| 2023 | Event-Based Visual Microphone (CVPR Workshop) | Neuromorphic event cameras for cheap, efficient vibration capture |
| 2024 | PSO-CNN Hybrid | Particle Swarm Optimization + CNN for enhanced sound restoration |
| 2025 | Single-Pixel Visual Microphone (Optica) | Single-pixel imaging with spatial light modulator — no expensive high-speed camera needed |
| Implementation | Technique | Language |
|---|---|---|
| MIT Original | Complex Steerable Pyramid | MATLAB |
| dsforza96/visual-mic | Complex Steerable Pyramid (pyrtools) |
Python |
| This repo (Visual-Mic) | 2D DTCWT (dtcwt) |
Python |
-
Multiprocessing across frames: Frame processing is independent after the reference frame is computed. Reading frames remains sequential (VideoCapture limitation), but the DTCWT + phase extraction can be parallelized across CPU cores using batch processing with
multiprocessing.Pool, giving ~Nx speedup on an N-core machine. -
Better post-processing / signal recovery: The current algorithm uses properly wrapped phase differences (
np.angle(coeffs * conj(ref)), bounded to [-π, π]), which is mathematically correct but produces lower-amplitude signals for very small vibrations. Exploring better post-processing — such as phase unwrapping, adaptive Wiener filtering, or learned denoising — could recover signal strength without reintroducing the phase wrapping artifacts.
All tests run inside Docker — no local Python dependencies needed:
# CPU: lint + unit tests (builds image automatically if not found)
./test.sh
# GPU: lint + unit tests (requires nvidia-container-toolkit)
./test.sh gpu
# Force rebuild before testing
./test.sh --build
./test.sh gpu --buildCPU tests (tests/test_visualmic.py) cover:
- Utility functions (
format_duration,find_best_shift,save_wav) - Phase signal postprocessing (cross-correlation, normalization, Butterworth filter)
- VRAM estimation arithmetic
- Full
extract_audiopipeline on synthetic 256x256 video (shape, finiteness) - All CLI validation error paths
GPU tests (tests/test_visualmic_gpu.py) cover:
- DTCWTForward shapes, finiteness, and custom filter selection
- Full
extract_audio_gpupipeline on synthetic 256x256 video - All GPU tests skip automatically on systems without CUDA
Version is tracked in a VERSION file at the project root. visualmic.py has __version__ baked into the source (updated at release time).
To cut a release:
- Update
VERSIONwith the new version number - Update
__version__invisualmic.py - Update
CHANGELOG.md— move items from[Unreleased]to[X.Y.Z] - YYYY-MM-DD - Commit:
Release vX.Y.Z - Tag:
git tag -a vX.Y.Z -m "Release vX.Y.Z" - Push:
git push && git push origin vX.Y.Z - Rebuild Docker images:
./docker-build.sh && ./docker-build-gpu.sh
visualmic.py # CLI tool (CPU + GPU paths)
Dockerfile # CPU Docker image (python:3.11-slim)
Dockerfile.gpu # GPU Docker image (pytorch:2.1.2-cuda12.1)
docker-build.sh # Build + tag CPU image
docker-build-gpu.sh # Build + tag GPU image
test.sh # Run lint + tests (Docker, supports cpu/gpu mode)
requirements.txt # CPU runtime dependencies
requirements-gpu.txt # GPU runtime dependencies
requirements-dev.txt # Dev dependencies (pytest, ruff)
tests/
test_visualmic.py # CPU unit tests
test_visualmic_gpu.py # GPU unit tests (CUDA-only, skip on CPU)
docs/design/ # Architecture decision records
visualmic-hardening.md # Hardening design doc
VERSION # Single source of truth for version
CHANGELOG.md # Release history
CONTRIBUTING.md # Contribution guidelines
-
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G.J., Durand, F., & Freeman, W.T. (2014). The Visual Microphone: Passive Recovery of Sound from Video. ACM Transactions on Graphics (SIGGRAPH), 33(4). Paper PDF | Project Page
-
Wadhwa, N., Rubinstein, M., Durand, F., & Freeman, W.T. (2013). Phase-Based Video Motion Processing. ACM Transactions on Graphics (SIGGRAPH). Project Page
-
Wadhwa, N., Rubinstein, M., Durand, F., & Freeman, W.T. (2014). Riesz Pyramids for Fast Phase-Based Video Magnification. IEEE ICCP. Project Page
-
Selesnick, I.W., Baraniuk, R.G., & Kingsbury, N.G. (2005). The Dual-Tree Complex Wavelet Transform. IEEE Signal Processing Magazine, 22(6), 123–151. Tutorial PDF
-
Davis, A. (2016). Visual Vibration Analysis. PhD Thesis, MIT. Thesis PDF
-
Shen, M. & Bhatt, S. (2018). Local Visual Microphones: Improved Sound Extraction from Silent Video. arXiv:1801.09436
-
Niwa, T., Fushimi, T., Yamamoto, S., & Ochiai, Y. (2023). Live Demonstration: Event-based Visual Microphone. CVPR Workshop on Event-based Vision. Paper PDF
-
dtcwt Python library. Documentation | GitHub
-
MIT CSAIL Visual Microphone Dataset. Download
