This repository evaluates how a +4 semitone pitch‑shift (as a simple anonymization) affects speaker verification and explains the model’s decisions using Integrated Gradients (Captum). We use a pretrained ECAPA‑TDNN (SpeechBrain, VoxCeleb‑trained) on LibriSpeech and produce both quantitative metrics (cosine scores, EER) and qualitative time–frequency attributions (IG heatmaps).
Model:
speechbrain/spkrec-ecapa-voxceleb· Data: LibriSpeech (16 kHz) · XAI: Captum Integrated Gradients (+ optional NoiseTunnel)
- Verification baseline — ECAPA embeddings + cosine scoring on LibriSpeech.
- Anonymization — apply +4 semitone pitch shift to the test utterance only.
- Two independent runs — Run A and Run B, each on a different (disjoint) set of speakers, for robust comparison.
- Explainability — log‑Mel + Integrated Gradients heatmaps (signed and |IG|) showing which time–frequency regions support or oppose “same speaker.”
- Reproducible outputs — each run writes its own folder under
results/<run_name>/with config, log, pairs, metrics, and figures.
repo/
├── scripts/
│ └── LibriSpeech_ECAPA_TDNN.ipynb # main notebook
├── data/ # small text artifacts
│ ├── ig_run1_pairs.txt
│ └── ig_run2_pairs.txt
├── results/ # per‑run outputs
│ ├── ig_run1/
│ │ ├── run_config.json
│ │ ├── run.log
│ │ ├── pairs.txt
│ │ └── ig_explanations/
│ │ ├── same_orig_1.png
│ │ ├── same_orig_1.abs.png
│ │ ├── same_orig_1.attr.npy
│ │ └── same_orig_1.fbank.npy
│ └── ig_run2/ ... (same layout)
└── README.md
Note: Large, derived artifacts (e.g.,
results/**) are typically excluded from Git; keeprun_config.json,run.log, and small text files (pairs) for reproducibility.
- Python ≥ 3.10 (tested with 3.11)
- PyTorch + torchaudio (CUDA optional)
- SpeechBrain, Captum, scikit‑learn, tqdm, matplotlib, soundfile
- LibriSpeech accessible locally (e.g.,
/speechdat/LibriSpeech/LibriSpeech)
Create the environment:
conda create -n xai_lrp python=3.11 -y
conda activate xai_lrp
# Pick the correct PyTorch/torchaudio wheels for your system (CPU or CUDA)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
# Core libs
pip install speechbrain==0.5.15 captum==0.7.0 librosa==0.10.2.post1 scikit-learn tqdm matplotlib soundfile- Index LibriSpeech (e.g.,
test-clean) and gather speakers with ≥3 utterances. - Build trials for each run:
- same‑speaker pairs (enroll vs two other utterances of the same speaker)
- different‑speaker pairs (enroll vs a different speaker)
- for both original and anonymized (+4 st) versions (anonymization applied to test only)
- Score with ECAPA embeddings + cosine; compute EER split by original vs anonymized.
- Explain selected trials with Integrated Gradients on the model’s log‑Mel features pipeline.
- Save per‑run artifacts under
results/<run_name>/.
Open scripts/LibriSpeech_ECAPA_TDNN.ipynb, set the parameters near the top (adjust paths as needed):
DATA_ROOT = "/speechdat/LibriSpeech/LibriSpeech"
SUBSET = "test-clean"
DEVICE = "auto" # or "cuda" / "cpu"
MAX_SPEAKERS = 40
IG_STEPS = 64
EXPLAIN_PER_SPLIT = 2Then run the multi‑run cell preconfigured to produce two runs with different speakers (disjoint cohorts), e.g.:
# Example: two runs with disjoint speakers, both using +4 semitone anonymization
spk2utt = index_librispeech(DATA_ROOT, SUBSET, min_utts_per_spk=3)
classifier = load_ecapa(device)
# Split into two non‑overlapping sets
all_spks = sorted(spk2utt.keys())
mid = len(all_spks) // 2
spk2utt_1 = {s: spk2utt[s] for s in all_spks[:mid]}
spk2utt_2 = {s: spk2utt[s] for s in all_spks[mid:]}
run_experiment(name="ig_run1", spk2utt=spk2utt_1, classifier=classifier,
data_root=DATA_ROOT, subset=SUBSET, device=device,
max_speakers=min(MAX_SPEAKERS, len(spk2utt_1)), seed=101,
pitch_steps=4, ig_steps=IG_STEPS, internal_bs=16,
explain_per_split=EXPLAIN_PER_SPLIT, smooth=True,
nt_type="smoothgrad_sq", stdevs=0.02, nt_samples=8)
run_experiment(name="ig_run2", spk2utt=spk2utt_2, classifier=classifier,
data_root=DATA_ROOT, subset=SUBSET, device=device,
max_speakers=min(MAX_SPEAKERS, len(spk2utt_2)), seed=202,
pitch_steps=4, ig_steps=IG_STEPS, internal_bs=16,
explain_per_split=EXPLAIN_PER_SPLIT, smooth=True,
nt_type="smoothgrad_sq", stdevs=0.02, nt_samples=8)Each run writes to results/ig_run1/ and results/ig_run2/ respectively.
Only the key summaries, for example:
[ig_run1] speakers=40 trials=120
[ig_run1] Original: N=120 pos=80 neg=40
[ig_run1] Anonymized: N=120 pos=80 neg=40
[ig_run1] Original EER: 3.12% @ thr=0.301
[ig_run1] Anonymized (+4 st) EER: 30.63% @ thr=0.107
Tab‑separated trials (paths are relative to DATA_ROOT):
enroll_path test_path label anonymized
test-clean/1089/.../0000.flac test-clean/1089/.../0001.flac 1 0 # same speaker, original
test-clean/1089/.../0000.flac test-clean/1089/.../0001.flac 1 1 # same speaker, anonymized (test)
test-clean/1089/.../0000.flac test-clean/1188/.../0001.flac 0 0 # different speakers, original
test-clean/1089/.../0000.flac test-clean/1188/.../0001.flac 0 1 # different speakers, anonymized (test)
- Top panel — log‑Mel spectrogram of the test utterance (what ECAPA sees).
- Bottom panel — Integrated Gradients attribution for the cosine score vs the enroll embedding:
warmer (positive) supports “same speaker,” cooler (negative) supports “different.”
Companion*.abs.pngfiles show|IG|(importance magnitude only).
Plots can optionally be standardized to a fixed width for comparison; underlying arrays remain full‑length.
- Each run saves a
run_config.jsonwith parameters (paths, seeds, IG settings). - A file‑only logger writes
run.logwith summary lines (INFO). Noisy per‑image messages are written at DEBUG and omitted from the file (but still printed to console).
- The +4 semitone anonymization is applied to the test side only when
anonymized=1. - IG is computed on the model’s log‑Mel feature representation for faithful attributions.
- Expect some negative attributions in voiced regions (due to normalization/attention coupling).
- EER will vary slightly across runs because speaker cohorts differ by design.
- SpeechBrain — Mirco Ravanelli et al., SpeechBrain: A General-Purpose Speech Toolkit.
- ECAPA‑TDNN — Desplanques et al., Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker Verification.
- Captum — Kokhlikyan et al., Captum: A Model Interpretability Library for PyTorch.