This repository presents a custom GNN-based framework for audio anti-spoofing. The work is inspired by the original paper, AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks, and is partially built upon the authors' implementation.
The core of this repository lies in the re-implementation of the proposed architectures using torch-geometric. We leverage its highly optimized DenseGraph modules, which are explicit and efficient versions of the graph-attention mechanisms from the original work. This approach not only allows for significantly faster training but also achieves better performance metrics within the same number of training epochs.
This repository provides an optimized implementation of the AASIST architecture. While adhering to the original design's principles for spectro-temporal analysis, this version leverages torch_geometric and custom modules for improved efficiency. The architecture comprises four key stages.
A RawNet2-based encoder processes the raw audio waveform to extract a high-level feature map, F. It consists of an initial sinc-convolution layer followed by six residual blocks with pre-activation. Our implementation directly models this, with the CONV and Residual_block classes corresponding to these components.
From the encoder's feature map, two graphs are derived to model the spectral G_s and temporal G_t domains independently. This is achieved by applying a max operation across the temporal and spectral axes, respectively. The resulting node features, e_s and e_t in the code, are then processed by subsequent graph modules.
The core of the architecture is the Max Graph Operation (MGO), a two-branch competitive mechanism designed to detect diverse spoofing artefacts in parallel.
- Paper's Design: Each MGO branch utilizes a novel Heterogeneous Stacking Graph Attention Layer (HS-GAL). The HS-GAL is designed to process the combined spectro-temporal graph using a specialized attention mechanism for heterogeneous node types and a "stack node" to aggregate cross-domain information.
- Implementation Nuances: Our implementation uses optimized modules from
torch_geometric, such asDenseGATConv, for initial graph attention. The HS-GAL is realized as a customDenseHeteroGATmodule, which emulates the original design by employing distinct attention weights for intra- and inter-graph connections. It also models the stack node as an explicitmasternode parameter. The two-branch MGO structure is maintained, withtorch.maxcombining the outputs to preserve the competitive learning aspect.
The final stage aggregates features for classification. As described in the paper, this involves concatenating the node-wise maximum and average from both the spectral and temporal graphs, along with the final state of the master (stack) node. This combined embedding is then passed to a linear layer for the final spoofing detection decision.
A brief overview of the key files and directories in this repository.
.
├── LA/ # ASVspoof 2019 dataset (to be created by user)
├── config/ # Configuration files for the model and training
│ └── AASIST_GNN.conf
├── eval_results_gnn/ # Saved evaluation scores and reports
│ ├── scores_...txt
│ └── t-DCF_EER_...txt
├── models/ # Model architecture definitions
│ └── AASIST_GNN.py
├── model_weights/ # Stored model checkpoints (.pth files)
│ └── model_...pth
├── training_logs/ # Contains training history and visualizations
│ ├── training_history.csv
│ └── traing_graph.png
├── data_utils.py # Data loading and preprocessing utilities
├── evaluate_gnn.py # Script for evaluating a trained model
├── evaluation.py # Helper functions for calculating EER and t-DCF
├── train.py # The main script to run model training
├── utils.py # Utility functions (e.g., seed, optimizer)
├── requirements.txt # Project dependencies
└── readme.md # This file
To get started with this project, first ensure you have the necessary dependencies installed. The following instructions will guide you through setting up your environment and preparing the dataset.
It is recommended to create and activate a virtual environment to manage the project's dependencies. Then, install the required packages. The training was performed on a machine with 2x NVIDIA RTX 2080Ti GPUs, each with 11GB of VRAM, using the nvidia-driver-575. The training pipeline requires approximately 19GB of VRAM for efficient processing. A full training run of 30 epochs takes about 2 hours on this setup.
python -m venv env
source env/bin/activate
pip install -r requirements.txtThis project uses the ASVspoof 2019 Logical Access (LA) dataset. The ASVspoof 2019 dataset was created for the third Automatic Speaker Verification Spoofing and Countermeasures Challenge. This repository focuses on the Logical Access (LA) partition, which, for the first time in the challenge's history, includes all three major attack types: text-to-speech (TTS), voice conversion (VC), and replay attacks. The data is derived from the VCTK corpus and is split into training, development, and evaluation sets with no speaker overlap between them. A key feature of the evaluation set is the inclusion of "unknown attacks"—spoofing techniques not present in the training or development data—to rigorously test a model's ability to generalize. The challenge also introduced the tandem Decision Cost Function (t-DCF) as a new primary metric, aligning the assessment more closely with real-world automatic speaker verification performance. The project structure expects the data to be in a directory named LA/ in the project root.
The preferred method is to download and unzip it directly from your terminal:
curl -o ./LA.zip -# https://datashare.ed.ac.uk/bitstream/handle/10283/3336/LA.zip\?sequence\=3\&isAllowed\=y
unzip LA.zipAlternatively, you can download the dataset from:
- Original Source: ASVspoof 2019 dataset page
- Kaggle Mirror: ASVpoof 2019 Dataset on Kaggle
To begin training the AASIST_GNN model, you can run the train.py script. This script will start a new training session from scratch.
python train.pyIf you wish to resume training from a specific checkpoint, you can use the --model_path argument:
python train.py --model_path /path/to/your/model.pthTo evaluate a trained model checkpoint, use the evaluate_gnn.py script and provide the path to your model file.
python evaluate_gnn.py --model_path /path/to/your/model.pthThis will produce an evaluation file with the scores and calculate the Equal Error Rate (EER) and the minimum tandem Detection Cost Function (t-DCF).
Here is a summary of the evaluation results for a sample model checkpoint.
-
CM SYSTEM
- EER: 2.20% (Equal error rate for countermeasure)
-
TANDEM
- min-tDCF: 0.0713
-
BREAKDOWN CM SYSTEM (EER by Attack Type)
- A07: 2.23%
- A08: 0.49%
- A09: 0.02%
- A10: 2.69%
- A11: 0.65%
- A12: 2.44%
- A13: 0.43%
- A14: 0.55%
- A15: 1.47%
- A16: 1.22%
- A17: 2.38%
- A18: 5.92%
- A19: 1.85%
You can find the detailed training logs in training_logs/training_history.csv, along with the visualization script used to generate the graph below.
On the graph, you can find the sliding average loss and EER metrics evaluated at several checkpoints during training.
A pre-trained model checkpoint, saved after 32 epochs of training, is available at:
model_weights/model_epoch_031_batch_0900.pth
You can find the corresponding evaluation results in the eval_results_gnn/ directory:
-
eval_results_gnn/scores_model_epoch_031_batch_0900.txt- This file contains the raw model output scores. You can use these to compute any other custom metrics. These same results can be achieved by runningevaluate_gnn.pyusing themodel_weights/model_epoch_031_batch_0900.pthmodel. -
eval_results_gnn/t-DCF_EER_model_epoch_031_batch_0900.txt- This file contains the detailed evaluation summary shown above.

