Technical Documentation

Voice Manipulation Detection Pipeline

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Architecture Overview
Phase 1: Baseline F0 Analysis
Phase 2: Vocal Tract Analysis
Phase 3: Artifact Detection
Phase 4: Report Synthesis
Verification System
Visualization System
Algorithms & Mathematics
Performance Optimization
Security Considerations

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Architecture Overview

System Design

The pipeline follows a 4-phase sequential processing model where each phase builds upon the previous:

Audio Input → PHASE 1 → PHASE 2 → PHASE 3 → PHASE 4 → Verified Report
              (F0)      (Formants) (Artifacts) (Synthesis)

Component Diagram

┌─────────────────────────────────────────────────────────┐
│                   VoiceManipulationDetector             │
│                                                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐       │
│  │  Phase1    │  │  Phase2    │  │  Phase3    │       │
│  │  Baseline  │→ │  Formants  │→ │  Artifacts │       │
│  │  Analyzer  │  │  Analyzer  │  │  Analyzer  │       │
│  └────────────┘  └────────────┘  └────────────┘       │
│         │               │               │              │
│         └───────────────┴───────────────┘              │
│                        ↓                               │
│                 ┌────────────┐                         │
│                 │  Phase4    │                         │
│                 │  Report    │                         │
│                 │  Synthesis │                         │
│                 └────────────┘                         │
│                        ↓                               │
│         ┌──────────────┴──────────────┐               │
│         ↓                              ↓               │
│  ┌────────────┐                 ┌────────────┐        │
│  │  Output    │                 │  Visualizer│        │
│  │  Verifier  │                 │            │        │
│  └────────────┘                 └────────────┘        │
└─────────────────────────────────────────────────────────┘

Data Flow

Input: Audio file (WAV, MP3, FLAC, etc.)
Loading: librosa.load() → normalized waveform
Analysis: Sequential 4-phase processing
Synthesis: Report generation with verification metadata
Output: JSON report + Markdown + Visualizations

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1: Baseline F0 Analysis

Objective

Extract the fundamental frequency (F0) to determine the presented pitch of the audio.

Implementation: `phase1_baseline.py`

class BaselineAnalyzer:
    def analyze(self, y, sr):
        # Extract pitch using piptrack
        pitches, magnitudes = librosa.piptrack(
            y=y, sr=sr,
            fmin=75, fmax=400,
            threshold=0.1
        )

        # Select pitch with highest magnitude per frame
        f0_values = []
        for t in range(pitches.shape[1]):
            index = magnitudes[:, t].argmax()
            pitch = pitches[index, t]
            if pitch > 0:
                f0_values.append(pitch)

        f0_median = np.median(f0_values)
        return f0_median

Algorithm: `librosa.piptrack()`

Uses probabilistic YIN (pYIN) algorithm:

Autocorrelation: Computes autocorrelation of audio signal
Difference Function: Calculates cumulative mean normalized difference
Peak Picking: Identifies periodic peaks corresponding to F0
Probabilistic Selection: Uses Viterbi algorithm to select most likely F0 path

Parameters

Parameter	Value	Rationale
`fmin`	75 Hz	Below typical male speech (~85 Hz)
`fmax`	400 Hz	Above typical female speech (~255 Hz)
`threshold`	0.1	Minimum magnitude for voiced detection
`hop_length`	512	~23ms frames at 22050 Hz

Sex Classification

if f0_median > 165 Hz:
    presented_sex = 'Female'
else:
    presented_sex = 'Male'

Rationale: 165 Hz is the empirically-determined boundary where male/female pitch ranges overlap minimally.

Output

{
    'f0_median': 221.5,
    'f0_mean': 220.8,
    'f0_std': 12.3,
    'f0_values': [...],  # Time series
    'presented_sex': 'Female'
}

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 2: Vocal Tract Analysis

Objective

Extract formant frequencies (F1, F2, F3) which reveal the physical characteristics of the vocal tract, independent of pitch.

Implementation: `phase2_formants.py`

class VocalTractAnalyzer:
    def analyze(self, audio_path, sr):
        # Load with Praat
        sound = parselmouth.Sound(audio_path)

        # Extract formants using Burg algorithm
        formant = sound.to_formant_burg(
            time_step=0.01,
            max_number_of_formants=5,
            maximum_formant=5500.0,
            window_length=0.025
        )

        # Collect formants over time
        f1_values, f2_values, f3_values = [], [], []
        for time in np.arange(formant.start_time, formant.end_time, 0.01):
            f1 = formant.get_value_at_time(1, time)
            f2 = formant.get_value_at_time(2, time)
            f3 = formant.get_value_at_time(3, time)
            if not np.isnan(f1): f1_values.append(f1)
            if not np.isnan(f2): f2_values.append(f2)
            if not np.isnan(f3): f3_values.append(f3)

        return {
            'f1_median': np.median(f1_values),
            'f2_median': np.median(f2_values),
            'f3_median': np.median(f3_values)
        }

Algorithm: Burg's Method

Linear Predictive Coding (LPC) approach:

Windowing: Apply Gaussian window to audio frames
LPC Analysis: Fit autoregressive model to predict signal
Root Finding: Solve for poles of transfer function
Formant Extraction: Convert poles to frequency peaks

Advantages:

High frequency resolution
Robust to noise
Computationally efficient

Formant Ranges

Formant	Male Range	Female Range
F1	400-800 Hz	600-1000 Hz
F2	1000-1500 Hz	1400-2200 Hz
F3	2000-3000 Hz	2300-3500 Hz

Sex Classification

def _classify_sex(self, f1, f2):
    if f1 < 550:
        return 'Male'
    elif f1 > 900:
        return 'Female'
    else:
        # Use F2 as secondary discriminator
        return 'Male' if f2 < 1350 else 'Female'

Why Formants Cannot Be Easily Manipulated

Physical basis: Formants depend on vocal tract length and shape:

Formant Frequency ∝ Speed of Sound / (4 × Vocal Tract Length)

To authentically change formants, you would need to physically alter vocal tract geometry, which pitch-shifting algorithms cannot do.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 3: Artifact Detection

Objective

Detect evidence of manipulation using three independent methods.

3.1: Pitch-Formant Incoherence Detection

Core Principle: Pitch and formants must be coherent in natural speech.

def _analyze_pitch_formant_incoherence(self, phase1, phase2):
    presented_sex = phase1['presented_sex']
    probable_sex = phase2['probable_sex']

    # SMOKING GUN: Direct contradiction
    if presented_sex != probable_sex:
        return {
            'incoherence_detected': True,
            'confidence': 0.95  # Very high
        }

Example:

F0 = 220 Hz → Female presentation
F1 = 500 Hz → Male vocal tract
VERDICT: INCOHERENCE → Pitch-shifted male voice

3.2: Mel Spectrogram Artifact Analysis

Detects:

Unnatural harmonic structures
Consistent computational noise floor
Spectral discontinuities

def _analyze_mel_spectrogram(self, y, sr):
    # Compute Mel spectrogram
    mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    mel_spec_db = librosa.power_to_db(mel_spec)

    # Check 1: Noise floor consistency
    noise_floor = np.percentile(mel_spec_db, 10, axis=1)
    noise_floor_std = np.std(noise_floor)
    consistent_noise_floor = noise_floor_std < 3.0

    # Check 2: Spectral smoothness (harmonic ringing)
    spectral_envelope = np.mean(mel_spec_db, axis=1)
    spectral_gradient = np.gradient(spectral_envelope)
    spectral_smoothness = np.std(spectral_gradient)
    unnatural_harmonics = spectral_smoothness > 2.5

Rationale:

Natural audio: Variable noise floor, smooth harmonics
Processed audio: Consistent noise floor, artificial harmonics from phase vocoder

3.3: Phase Decoherence / Transient Smearing

Detects time-stretching artifacts.

Theory: Time-stretching works by:

Slicing audio into short frames
Discarding frames (to speed up) or duplicating frames (to slow down)
Re-stitching frames with phase adjustment

Problem: This damages transients (sharp, percussive sounds like 'p', 't', 'k').

def _analyze_phase_coherence(self, y, sr):
    # STFT
    D = librosa.stft(y, n_fft=2048, hop_length=512)
    magnitude = np.abs(D)
    phase = np.angle(D)

    # Phase coherence metric
    phase_diff = np.diff(phase, axis=1)
    phase_diff = np.angle(np.exp(1j * phase_diff))  # Wrap to [-π, π]
    phase_variance = np.var(phase_diff)

    # Transient analysis
    onset_env = librosa.onset.onset_strength(y=y, sr=sr)
    onset_sharpness = np.mean(np.abs(np.diff(onset_env)))

    # Detection
    time_stretch_detected = (phase_variance > 2.5) or (onset_sharpness < 0.5)

Indicators:

High phase variance: Scrambled phase information
Low onset sharpness: Smeared transients

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 4: Report Synthesis

Objective

Consolidate all findings into a comprehensive, verifiable report.

Confidence Calculation

def _calculate_confidence(self, incoherence, mel, phase):
    evidence_count = sum([
        incoherence['incoherence_detected'],
        mel['artifacts_detected'],
        phase['transient_smearing_detected']
    ])

    if evidence_count == 0:
        return 0.0
    elif evidence_count == 1:
        return max(incoherence.get('confidence', 0), 0.60)
    elif evidence_count == 2:
        return 0.85
    else:  # All 3
        return 0.99

Logic: Multiple independent confirmations dramatically increase confidence.

Report Structure

{
  "ASSET_ID": "...",
  "ANALYSIS_TIMESTAMP": "2025-10-29T23:05:08Z",

  "DECEPTION_BASELINE_F0": "221.5 Hz",
  "PRESENTED_AS": "Female",

  "PHYSICAL_BASELINE_FORMANTS": "F1: 498 Hz, F2: 1510 Hz",
  "PROBABLE_SEX": "Male",

  "ALTERATION_DETECTED": true,
  "CONFIDENCE": "99% (Very High)",

  "EVIDENCE_VECTOR_1_PITCH": "...",
  "EVIDENCE_VECTOR_2_TIME": "...",
  "EVIDENCE_VECTOR_3_SPECTRAL": "...",

  "DETAILED_FINDINGS": {...},
  "VERIFICATION": {...}
}

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Verification System

Cryptographic Integrity

Every report includes:

def sign_report(self, report, audio_path):
    # 1. Compute audio file hash
    sha256_hash = hashlib.sha256()
    with open(audio_path, "rb") as f:
        for block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(block)
    file_hash = sha256_hash.hexdigest()

    # 2. Compute audio waveform hash
    y, sr = librosa.load(audio_path)
    audio_hash = hashlib.sha256(y.tobytes()).hexdigest()

    # 3. Compute report hash
    report_json = json.dumps(sanitize_for_json(report), sort_keys=True)
    report_hash = hashlib.sha256(report_json.encode()).hexdigest()

    # 4. Add verification block
    report['VERIFICATION'] = {
        'file_hash_sha256': file_hash,
        'audio_hash_sha256': audio_hash,
        'report_hash_sha256': report_hash,
        'timestamp_utc': datetime.utcnow().isoformat() + 'Z'
    }

Tamper Detection

def verify_report(self, report_path):
    # Load report
    with open(report_path) as f:
        report = json.load(f)

    # Recompute hashes
    current_hash = compute_hash(report)
    stored_hash = report['VERIFICATION']['report_hash_sha256']

    # Compare
    if current_hash != stored_hash:
        return {'valid': False, 'error': 'Report has been tampered with'}

    return {'valid': True}

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Visualization System

Generated Plots

Overview Dashboard (_overview.png)
- Waveform
- F0 over time
- Mel spectrogram
- Formant bar chart
- Detection summary
- Verdict panel
Mel Spectrogram (_mel_spectrogram.png)
- High-resolution spectrogram
- Artifact annotations
- Frequency/time axes
Phase Analysis (_phase_analysis.png)
- STFT magnitude
- Phase plot
- Transient markers
Pitch-Formant Comparison (_pitch_formant_comparison.png)
- F0 vs male/female ranges
- Formants vs expected ranges
- Incoherence visualization

Rendering Engine

Uses matplotlib with custom styling:

plt.style.use('dark_background')
fig, axes = plt.subplots(3, 2, figsize=(14, 10), dpi=100)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Algorithms & Mathematics

F0 Estimation via Autocorrelation

R(τ) = Σ x(t) · x(t + τ)

where:
  R(τ) = autocorrelation function
  τ = lag (delay)
  x(t) = signal at time t

F0 = 1 / τ_peak  (period corresponding to peak autocorrelation)

Formant Estimation via LPC

x(n) = Σ a_k · x(n-k) + e(n)

where:
  x(n) = signal sample at time n
  a_k = LPC coefficients
  e(n) = prediction error

Transfer Function:
  H(z) = 1 / (1 - Σ a_k · z^-k)

Formants = frequencies where |H(f)| has local maxima

Phase Variance Metric

φ_var = Var(∠STFT(y))

where:
  ∠ = angle (phase extraction)
  STFT = Short-Time Fourier Transform
  Var = variance

High φ_var indicates phase discontinuities from time-stretching

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Performance Optimization

Memory Management

# Limit audio duration to prevent memory exhaustion
y, sr = librosa.load(path, duration=30.0)  # First 30 seconds only

# Use generators for batch processing
for audio_file in audio_files:
    report = detector.analyze(audio_file)
    yield report  # Don't store all in memory

Computational Optimization

Disable visualizations for production: save_visualizations=False
Reduce hop length for faster F0 extraction: hop_length=1024
Lower mel bands for faster spectrogram: n_mels=64

Parallelization

from multiprocessing import Pool

with Pool(processes=4) as pool:
    reports = pool.map(detector.analyze, audio_files)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Security Considerations

Input Validation

# Validate file size to prevent DoS
MAX_FILE_SIZE = 100 * 1024 * 1024  # 100 MB
if audio_path.stat().st_size > MAX_FILE_SIZE:
    raise ValueError("File too large")

# Validate duration
y, sr = librosa.load(path, duration=10*60)  # Max 10 minutes

Sandboxing Recommendations

# Run in Docker container
docker run --rm --network none -v /audio:/data voice-detector analyze /data/sample.wav

# Run with restricted user
sudo -u nobody python pipeline.py sample.wav

No Network Access Required

All processing is 100% offline
No external API calls
No telemetry or data collection

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

References

Pitch Detection:
- de Cheveigné, A., & Kawahara, H. (2002). "YIN, a fundamental frequency estimator for speech and music." JASA
Formant Analysis:
- Burg, J. P. (1975). "Maximum entropy spectral analysis." PhD thesis, Stanford University
Voice Manipulation Detection:
- Wu, Z., et al. (2015). "ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge." Interspeech
Phase Vocoder Artifacts:
- Laroche, J., & Dolson, M. (1999). "Improved phase vocoder time-scale modification of audio." IEEE Trans. Speech Audio Process.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

FilesExpand file tree

technical.md

Latest commit

History

technical.md

File metadata and controls

Technical Documentation

Voice Manipulation Detection Pipeline

Table of Contents

Architecture Overview

System Design

Component Diagram

Data Flow

Phase 1: Baseline F0 Analysis

Objective

Implementation: phase1_baseline.py

Algorithm: librosa.piptrack()

Parameters

Sex Classification

Output

Phase 2: Vocal Tract Analysis

Objective

Implementation: phase2_formants.py

Algorithm: Burg's Method

Formant Ranges

Sex Classification

Why Formants Cannot Be Easily Manipulated

Phase 3: Artifact Detection

Objective

3.1: Pitch-Formant Incoherence Detection

3.2: Mel Spectrogram Artifact Analysis

3.3: Phase Decoherence / Transient Smearing

Phase 4: Report Synthesis

Objective

Confidence Calculation

Report Structure

Verification System

Cryptographic Integrity

Tamper Detection

Visualization System

Generated Plots

Rendering Engine

Algorithms & Mathematics

F0 Estimation via Autocorrelation

Formant Estimation via LPC

Phase Variance Metric

Performance Optimization

Memory Management

Computational Optimization

Parallelization

Security Considerations

Input Validation

Sandboxing Recommendations

No Network Access Required

References

Implementation: `phase1_baseline.py`

Algorithm: `librosa.piptrack()`

Implementation: `phase2_formants.py`