━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
- Architecture Overview
- Phase 1: Baseline F0 Analysis
- Phase 2: Vocal Tract Analysis
- Phase 3: Artifact Detection
- Phase 4: Report Synthesis
- Verification System
- Visualization System
- Algorithms & Mathematics
- Performance Optimization
- Security Considerations
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The pipeline follows a 4-phase sequential processing model where each phase builds upon the previous:
Audio Input → PHASE 1 → PHASE 2 → PHASE 3 → PHASE 4 → Verified Report
(F0) (Formants) (Artifacts) (Synthesis)
┌─────────────────────────────────────────────────────────┐
│ VoiceManipulationDetector │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Phase1 │ │ Phase2 │ │ Phase3 │ │
│ │ Baseline │→ │ Formants │→ │ Artifacts │ │
│ │ Analyzer │ │ Analyzer │ │ Analyzer │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │
│ └───────────────┴───────────────┘ │
│ ↓ │
│ ┌────────────┐ │
│ │ Phase4 │ │
│ │ Report │ │
│ │ Synthesis │ │
│ └────────────┘ │
│ ↓ │
│ ┌──────────────┴──────────────┐ │
│ ↓ ↓ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Output │ │ Visualizer│ │
│ │ Verifier │ │ │ │
│ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────┘
- Input: Audio file (WAV, MP3, FLAC, etc.)
- Loading:
librosa.load()→ normalized waveform - Analysis: Sequential 4-phase processing
- Synthesis: Report generation with verification metadata
- Output: JSON report + Markdown + Visualizations
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Extract the fundamental frequency (F0) to determine the presented pitch of the audio.
class BaselineAnalyzer:
def analyze(self, y, sr):
# Extract pitch using piptrack
pitches, magnitudes = librosa.piptrack(
y=y, sr=sr,
fmin=75, fmax=400,
threshold=0.1
)
# Select pitch with highest magnitude per frame
f0_values = []
for t in range(pitches.shape[1]):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
if pitch > 0:
f0_values.append(pitch)
f0_median = np.median(f0_values)
return f0_medianUses probabilistic YIN (pYIN) algorithm:
- Autocorrelation: Computes autocorrelation of audio signal
- Difference Function: Calculates cumulative mean normalized difference
- Peak Picking: Identifies periodic peaks corresponding to F0
- Probabilistic Selection: Uses Viterbi algorithm to select most likely F0 path
| Parameter | Value | Rationale |
|---|---|---|
fmin |
75 Hz | Below typical male speech (~85 Hz) |
fmax |
400 Hz | Above typical female speech (~255 Hz) |
threshold |
0.1 | Minimum magnitude for voiced detection |
hop_length |
512 | ~23ms frames at 22050 Hz |
if f0_median > 165 Hz:
presented_sex = 'Female'
else:
presented_sex = 'Male'Rationale: 165 Hz is the empirically-determined boundary where male/female pitch ranges overlap minimally.
{
'f0_median': 221.5,
'f0_mean': 220.8,
'f0_std': 12.3,
'f0_values': [...], # Time series
'presented_sex': 'Female'
}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Extract formant frequencies (F1, F2, F3) which reveal the physical characteristics of the vocal tract, independent of pitch.
class VocalTractAnalyzer:
def analyze(self, audio_path, sr):
# Load with Praat
sound = parselmouth.Sound(audio_path)
# Extract formants using Burg algorithm
formant = sound.to_formant_burg(
time_step=0.01,
max_number_of_formants=5,
maximum_formant=5500.0,
window_length=0.025
)
# Collect formants over time
f1_values, f2_values, f3_values = [], [], []
for time in np.arange(formant.start_time, formant.end_time, 0.01):
f1 = formant.get_value_at_time(1, time)
f2 = formant.get_value_at_time(2, time)
f3 = formant.get_value_at_time(3, time)
if not np.isnan(f1): f1_values.append(f1)
if not np.isnan(f2): f2_values.append(f2)
if not np.isnan(f3): f3_values.append(f3)
return {
'f1_median': np.median(f1_values),
'f2_median': np.median(f2_values),
'f3_median': np.median(f3_values)
}Linear Predictive Coding (LPC) approach:
- Windowing: Apply Gaussian window to audio frames
- LPC Analysis: Fit autoregressive model to predict signal
- Root Finding: Solve for poles of transfer function
- Formant Extraction: Convert poles to frequency peaks
Advantages:
- High frequency resolution
- Robust to noise
- Computationally efficient
| Formant | Male Range | Female Range |
|---|---|---|
| F1 | 400-800 Hz | 600-1000 Hz |
| F2 | 1000-1500 Hz | 1400-2200 Hz |
| F3 | 2000-3000 Hz | 2300-3500 Hz |
def _classify_sex(self, f1, f2):
if f1 < 550:
return 'Male'
elif f1 > 900:
return 'Female'
else:
# Use F2 as secondary discriminator
return 'Male' if f2 < 1350 else 'Female'Physical basis: Formants depend on vocal tract length and shape:
Formant Frequency ∝ Speed of Sound / (4 × Vocal Tract Length)
To authentically change formants, you would need to physically alter vocal tract geometry, which pitch-shifting algorithms cannot do.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Detect evidence of manipulation using three independent methods.
Core Principle: Pitch and formants must be coherent in natural speech.
def _analyze_pitch_formant_incoherence(self, phase1, phase2):
presented_sex = phase1['presented_sex']
probable_sex = phase2['probable_sex']
# SMOKING GUN: Direct contradiction
if presented_sex != probable_sex:
return {
'incoherence_detected': True,
'confidence': 0.95 # Very high
}Example:
F0 = 220 Hz → Female presentation
F1 = 500 Hz → Male vocal tract
VERDICT: INCOHERENCE → Pitch-shifted male voice
Detects:
- Unnatural harmonic structures
- Consistent computational noise floor
- Spectral discontinuities
def _analyze_mel_spectrogram(self, y, sr):
# Compute Mel spectrogram
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
mel_spec_db = librosa.power_to_db(mel_spec)
# Check 1: Noise floor consistency
noise_floor = np.percentile(mel_spec_db, 10, axis=1)
noise_floor_std = np.std(noise_floor)
consistent_noise_floor = noise_floor_std < 3.0
# Check 2: Spectral smoothness (harmonic ringing)
spectral_envelope = np.mean(mel_spec_db, axis=1)
spectral_gradient = np.gradient(spectral_envelope)
spectral_smoothness = np.std(spectral_gradient)
unnatural_harmonics = spectral_smoothness > 2.5Rationale:
- Natural audio: Variable noise floor, smooth harmonics
- Processed audio: Consistent noise floor, artificial harmonics from phase vocoder
Detects time-stretching artifacts.
Theory: Time-stretching works by:
- Slicing audio into short frames
- Discarding frames (to speed up) or duplicating frames (to slow down)
- Re-stitching frames with phase adjustment
Problem: This damages transients (sharp, percussive sounds like 'p', 't', 'k').
def _analyze_phase_coherence(self, y, sr):
# STFT
D = librosa.stft(y, n_fft=2048, hop_length=512)
magnitude = np.abs(D)
phase = np.angle(D)
# Phase coherence metric
phase_diff = np.diff(phase, axis=1)
phase_diff = np.angle(np.exp(1j * phase_diff)) # Wrap to [-π, π]
phase_variance = np.var(phase_diff)
# Transient analysis
onset_env = librosa.onset.onset_strength(y=y, sr=sr)
onset_sharpness = np.mean(np.abs(np.diff(onset_env)))
# Detection
time_stretch_detected = (phase_variance > 2.5) or (onset_sharpness < 0.5)Indicators:
- High phase variance: Scrambled phase information
- Low onset sharpness: Smeared transients
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Consolidate all findings into a comprehensive, verifiable report.
def _calculate_confidence(self, incoherence, mel, phase):
evidence_count = sum([
incoherence['incoherence_detected'],
mel['artifacts_detected'],
phase['transient_smearing_detected']
])
if evidence_count == 0:
return 0.0
elif evidence_count == 1:
return max(incoherence.get('confidence', 0), 0.60)
elif evidence_count == 2:
return 0.85
else: # All 3
return 0.99Logic: Multiple independent confirmations dramatically increase confidence.
{
"ASSET_ID": "...",
"ANALYSIS_TIMESTAMP": "2025-10-29T23:05:08Z",
"DECEPTION_BASELINE_F0": "221.5 Hz",
"PRESENTED_AS": "Female",
"PHYSICAL_BASELINE_FORMANTS": "F1: 498 Hz, F2: 1510 Hz",
"PROBABLE_SEX": "Male",
"ALTERATION_DETECTED": true,
"CONFIDENCE": "99% (Very High)",
"EVIDENCE_VECTOR_1_PITCH": "...",
"EVIDENCE_VECTOR_2_TIME": "...",
"EVIDENCE_VECTOR_3_SPECTRAL": "...",
"DETAILED_FINDINGS": {...},
"VERIFICATION": {...}
}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Every report includes:
def sign_report(self, report, audio_path):
# 1. Compute audio file hash
sha256_hash = hashlib.sha256()
with open(audio_path, "rb") as f:
for block in iter(lambda: f.read(4096), b""):
sha256_hash.update(block)
file_hash = sha256_hash.hexdigest()
# 2. Compute audio waveform hash
y, sr = librosa.load(audio_path)
audio_hash = hashlib.sha256(y.tobytes()).hexdigest()
# 3. Compute report hash
report_json = json.dumps(sanitize_for_json(report), sort_keys=True)
report_hash = hashlib.sha256(report_json.encode()).hexdigest()
# 4. Add verification block
report['VERIFICATION'] = {
'file_hash_sha256': file_hash,
'audio_hash_sha256': audio_hash,
'report_hash_sha256': report_hash,
'timestamp_utc': datetime.utcnow().isoformat() + 'Z'
}def verify_report(self, report_path):
# Load report
with open(report_path) as f:
report = json.load(f)
# Recompute hashes
current_hash = compute_hash(report)
stored_hash = report['VERIFICATION']['report_hash_sha256']
# Compare
if current_hash != stored_hash:
return {'valid': False, 'error': 'Report has been tampered with'}
return {'valid': True}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
Overview Dashboard (
_overview.png)- Waveform
- F0 over time
- Mel spectrogram
- Formant bar chart
- Detection summary
- Verdict panel
-
Mel Spectrogram (
_mel_spectrogram.png)- High-resolution spectrogram
- Artifact annotations
- Frequency/time axes
-
Phase Analysis (
_phase_analysis.png)- STFT magnitude
- Phase plot
- Transient markers
-
Pitch-Formant Comparison (
_pitch_formant_comparison.png)- F0 vs male/female ranges
- Formants vs expected ranges
- Incoherence visualization
Uses matplotlib with custom styling:
plt.style.use('dark_background')
fig, axes = plt.subplots(3, 2, figsize=(14, 10), dpi=100)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
R(τ) = Σ x(t) · x(t + τ)
where:
R(τ) = autocorrelation function
τ = lag (delay)
x(t) = signal at time t
F0 = 1 / τ_peak (period corresponding to peak autocorrelation)
x(n) = Σ a_k · x(n-k) + e(n)
where:
x(n) = signal sample at time n
a_k = LPC coefficients
e(n) = prediction error
Transfer Function:
H(z) = 1 / (1 - Σ a_k · z^-k)
Formants = frequencies where |H(f)| has local maxima
φ_var = Var(∠STFT(y))
where:
∠ = angle (phase extraction)
STFT = Short-Time Fourier Transform
Var = variance
High φ_var indicates phase discontinuities from time-stretching
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Limit audio duration to prevent memory exhaustion
y, sr = librosa.load(path, duration=30.0) # First 30 seconds only
# Use generators for batch processing
for audio_file in audio_files:
report = detector.analyze(audio_file)
yield report # Don't store all in memory- Disable visualizations for production:
save_visualizations=False - Reduce hop length for faster F0 extraction:
hop_length=1024 - Lower mel bands for faster spectrogram:
n_mels=64
from multiprocessing import Pool
with Pool(processes=4) as pool:
reports = pool.map(detector.analyze, audio_files)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Validate file size to prevent DoS
MAX_FILE_SIZE = 100 * 1024 * 1024 # 100 MB
if audio_path.stat().st_size > MAX_FILE_SIZE:
raise ValueError("File too large")
# Validate duration
y, sr = librosa.load(path, duration=10*60) # Max 10 minutes# Run in Docker container
docker run --rm --network none -v /audio:/data voice-detector analyze /data/sample.wav
# Run with restricted user
sudo -u nobody python pipeline.py sample.wav- All processing is 100% offline
- No external API calls
- No telemetry or data collection
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
Pitch Detection:
- de Cheveigné, A., & Kawahara, H. (2002). "YIN, a fundamental frequency estimator for speech and music." JASA
-
Formant Analysis:
- Burg, J. P. (1975). "Maximum entropy spectral analysis." PhD thesis, Stanford University
-
Voice Manipulation Detection:
- Wu, Z., et al. (2015). "ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge." Interspeech
-
Phase Vocoder Artifacts:
- Laroche, J., & Dolson, M. (1999). "Improved phase vocoder time-scale modification of audio." IEEE Trans. Speech Audio Process.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━