Skip to content

ECAPA-TDNN + Integrated Gradients to explain speaker verification and the impact of pitch-shift anonymization on LibriSpeech (with EER and IG heatmaps)

License

Notifications You must be signed in to change notification settings

Ashly1991/ecapa-ig-speaker-anonymization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ECAPA‑TDNN + Integrated Gradients for Speaker Anonymization (+4 st)

This repository evaluates how a +4 semitone pitch‑shift (as a simple anonymization) affects speaker verification and explains the model’s decisions using Integrated Gradients (Captum). We use a pretrained ECAPA‑TDNN (SpeechBrain, VoxCeleb‑trained) on LibriSpeech and produce both quantitative metrics (cosine scores, EER) and qualitative time–frequency attributions (IG heatmaps).

Model: speechbrain/spkrec-ecapa-voxceleb · Data: LibriSpeech (16 kHz) · XAI: Captum Integrated Gradients (+ optional NoiseTunnel)


Highlights

  • Verification baseline — ECAPA embeddings + cosine scoring on LibriSpeech.
  • Anonymization — apply +4 semitone pitch shift to the test utterance only.
  • Two independent runsRun A and Run B, each on a different (disjoint) set of speakers, for robust comparison.
  • Explainability — log‑Mel + Integrated Gradients heatmaps (signed and |IG|) showing which time–frequency regions support or oppose “same speaker.”
  • Reproducible outputs — each run writes its own folder under results/<run_name>/ with config, log, pairs, metrics, and figures.

Project structure (suggested)

repo/
├── scripts/
│   └── LibriSpeech_ECAPA_TDNN.ipynb     # main notebook
├── data/                                # small text artifacts
│   ├── ig_run1_pairs.txt
│   └── ig_run2_pairs.txt
├── results/                             # per‑run outputs
│   ├── ig_run1/
│   │   ├── run_config.json
│   │   ├── run.log
│   │   ├── pairs.txt
│   │   └── ig_explanations/
│   │       ├── same_orig_1.png
│   │       ├── same_orig_1.abs.png
│   │       ├── same_orig_1.attr.npy
│   │       └── same_orig_1.fbank.npy
│   └── ig_run2/ ... (same layout)
└── README.md

Note: Large, derived artifacts (e.g., results/**) are typically excluded from Git; keep run_config.json, run.log, and small text files (pairs) for reproducibility.


Requirements

  • Python ≥ 3.10 (tested with 3.11)
  • PyTorch + torchaudio (CUDA optional)
  • SpeechBrain, Captum, scikit‑learn, tqdm, matplotlib, soundfile
  • LibriSpeech accessible locally (e.g., /speechdat/LibriSpeech/LibriSpeech)

Create the environment:

conda create -n xai_lrp python=3.11 -y
conda activate xai_lrp

# Pick the correct PyTorch/torchaudio wheels for your system (CPU or CUDA)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

# Core libs
pip install speechbrain==0.5.15 captum==0.7.0 librosa==0.10.2.post1 scikit-learn tqdm matplotlib soundfile

How it works (high level)

  1. Index LibriSpeech (e.g., test-clean) and gather speakers with ≥3 utterances.
  2. Build trials for each run:
    • same‑speaker pairs (enroll vs two other utterances of the same speaker)
    • different‑speaker pairs (enroll vs a different speaker)
    • for both original and anonymized (+4 st) versions (anonymization applied to test only)
  3. Score with ECAPA embeddings + cosine; compute EER split by original vs anonymized.
  4. Explain selected trials with Integrated Gradients on the model’s log‑Mel features pipeline.
  5. Save per‑run artifacts under results/<run_name>/.

Running the two disjoint‑speaker runs

Open scripts/LibriSpeech_ECAPA_TDNN.ipynb, set the parameters near the top (adjust paths as needed):

DATA_ROOT = "/speechdat/LibriSpeech/LibriSpeech"
SUBSET = "test-clean"
DEVICE = "auto"            # or "cuda" / "cpu"
MAX_SPEAKERS = 40
IG_STEPS = 64
EXPLAIN_PER_SPLIT = 2

Then run the multi‑run cell preconfigured to produce two runs with different speakers (disjoint cohorts), e.g.:

# Example: two runs with disjoint speakers, both using +4 semitone anonymization
spk2utt = index_librispeech(DATA_ROOT, SUBSET, min_utts_per_spk=3)
classifier = load_ecapa(device)

# Split into two non‑overlapping sets
all_spks = sorted(spk2utt.keys())
mid = len(all_spks) // 2
spk2utt_1 = {s: spk2utt[s] for s in all_spks[:mid]}
spk2utt_2 = {s: spk2utt[s] for s in all_spks[mid:]}

run_experiment(name="ig_run1", spk2utt=spk2utt_1, classifier=classifier,
               data_root=DATA_ROOT, subset=SUBSET, device=device,
               max_speakers=min(MAX_SPEAKERS, len(spk2utt_1)), seed=101,
               pitch_steps=4, ig_steps=IG_STEPS, internal_bs=16,
               explain_per_split=EXPLAIN_PER_SPLIT, smooth=True,
               nt_type="smoothgrad_sq", stdevs=0.02, nt_samples=8)

run_experiment(name="ig_run2", spk2utt=spk2utt_2, classifier=classifier,
               data_root=DATA_ROOT, subset=SUBSET, device=device,
               max_speakers=min(MAX_SPEAKERS, len(spk2utt_2)), seed=202,
               pitch_steps=4, ig_steps=IG_STEPS, internal_bs=16,
               explain_per_split=EXPLAIN_PER_SPLIT, smooth=True,
               nt_type="smoothgrad_sq", stdevs=0.02, nt_samples=8)

Each run writes to results/ig_run1/ and results/ig_run2/ respectively.


Outputs

results/<run>/run.log

Only the key summaries, for example:

[ig_run1] speakers=40  trials=120
[ig_run1] Original: N=120 pos=80 neg=40
[ig_run1] Anonymized: N=120 pos=80 neg=40
[ig_run1] Original EER: 3.12% @ thr=0.301
[ig_run1] Anonymized (+4 st) EER: 30.63% @ thr=0.107

results/<run>/pairs.txt

Tab‑separated trials (paths are relative to DATA_ROOT):

enroll_path    test_path    label    anonymized
test-clean/1089/.../0000.flac   test-clean/1089/.../0001.flac   1   0   # same speaker, original
test-clean/1089/.../0000.flac   test-clean/1089/.../0001.flac   1   1   # same speaker, anonymized (test)
test-clean/1089/.../0000.flac   test-clean/1188/.../0001.flac   0   0   # different speakers, original
test-clean/1089/.../0000.flac   test-clean/1188/.../0001.flac   0   1   # different speakers, anonymized (test)

IG heatmaps (results/<run>/ig_explanations/*.png)

  • Top panel — log‑Mel spectrogram of the test utterance (what ECAPA sees).
  • Bottom panelIntegrated Gradients attribution for the cosine score vs the enroll embedding:
    warmer (positive) supports “same speaker,” cooler (negative) supports “different.”
    Companion *.abs.png files show |IG| (importance magnitude only).

Plots can optionally be standardized to a fixed width for comparison; underlying arrays remain full‑length.


Reproducibility & logging

  • Each run saves a run_config.json with parameters (paths, seeds, IG settings).
  • A file‑only logger writes run.log with summary lines (INFO). Noisy per‑image messages are written at DEBUG and omitted from the file (but still printed to console).

Notes

  • The +4 semitone anonymization is applied to the test side only when anonymized=1.
  • IG is computed on the model’s log‑Mel feature representation for faithful attributions.
  • Expect some negative attributions in voiced regions (due to normalization/attention coupling).
  • EER will vary slightly across runs because speaker cohorts differ by design.

Acknowledgments

  • SpeechBrain — Mirco Ravanelli et al., SpeechBrain: A General-Purpose Speech Toolkit.
  • ECAPA‑TDNN — Desplanques et al., Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker Verification.
  • Captum — Kokhlikyan et al., Captum: A Model Interpretability Library for PyTorch.

Releases

No releases published

Packages

 
 
 

Contributors