Skip to content

This repository contains a torch-geometric implementation of the AASIST framework for audio anti-spoofing, featuring optimized GNN modules for significantly faster training and enhanced performance.

License

Notifications You must be signed in to change notification settings

Melodiz/AASIST_GNN

Repository files navigation

This repository presents a custom GNN-based framework for audio anti-spoofing. The work is inspired by the original paper, AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks, and is partially built upon the authors' implementation.

The core of this repository lies in the re-implementation of the proposed architectures using torch-geometric. We leverage its highly optimized DenseGraph modules, which are explicit and efficient versions of the graph-attention mechanisms from the original work. This approach not only allows for significantly faster training but also achieves better performance metrics within the same number of training epochs.


Model Architecture

Training loss & metrics

This repository provides an optimized implementation of the AASIST architecture. While adhering to the original design's principles for spectro-temporal analysis, this version leverages torch_geometric and custom modules for improved efficiency. The architecture comprises four key stages.

1. Waveform Encoder

A RawNet2-based encoder processes the raw audio waveform to extract a high-level feature map, F. It consists of an initial sinc-convolution layer followed by six residual blocks with pre-activation. Our implementation directly models this, with the CONV and Residual_block classes corresponding to these components.

2. Dual Graph Formulation

From the encoder's feature map, two graphs are derived to model the spectral G_s and temporal G_t domains independently. This is achieved by applying a max operation across the temporal and spectral axes, respectively. The resulting node features, e_s and e_t in the code, are then processed by subsequent graph modules.

3. Graph Processing

The core of the architecture is the Max Graph Operation (MGO), a two-branch competitive mechanism designed to detect diverse spoofing artefacts in parallel.

  • Paper's Design: Each MGO branch utilizes a novel Heterogeneous Stacking Graph Attention Layer (HS-GAL). The HS-GAL is designed to process the combined spectro-temporal graph using a specialized attention mechanism for heterogeneous node types and a "stack node" to aggregate cross-domain information.
  • Implementation Nuances: Our implementation uses optimized modules from torch_geometric, such as DenseGATConv, for initial graph attention. The HS-GAL is realized as a custom DenseHeteroGAT module, which emulates the original design by employing distinct attention weights for intra- and inter-graph connections. It also models the stack node as an explicit master node parameter. The two-branch MGO structure is maintained, with torch.max combining the outputs to preserve the competitive learning aspect.

4. Readout and Classification

The final stage aggregates features for classification. As described in the paper, this involves concatenating the node-wise maximum and average from both the spectral and temporal graphs, along with the final state of the master (stack) node. This combined embedding is then passed to a linear layer for the final spoofing detection decision.


Project Structure

A brief overview of the key files and directories in this repository.

.
├── LA/                   # ASVspoof 2019 dataset (to be created by user)
├── config/               # Configuration files for the model and training
│   └── AASIST_GNN.conf
├── eval_results_gnn/     # Saved evaluation scores and reports
│   ├── scores_...txt
│   └── t-DCF_EER_...txt
├── models/               # Model architecture definitions
│   └── AASIST_GNN.py
├── model_weights/        # Stored model checkpoints (.pth files)
│   └── model_...pth
├── training_logs/        # Contains training history and visualizations
│   ├── training_history.csv
│   └── traing_graph.png
├── data_utils.py         # Data loading and preprocessing utilities
├── evaluate_gnn.py       # Script for evaluating a trained model
├── evaluation.py         # Helper functions for calculating EER and t-DCF
├── train.py              # The main script to run model training
├── utils.py              # Utility functions (e.g., seed, optimizer)
├── requirements.txt      # Project dependencies
└── readme.md             # This file

Getting started

To get started with this project, first ensure you have the necessary dependencies installed. The following instructions will guide you through setting up your environment and preparing the dataset.

1. Installing Dependencies & Environment Setup

It is recommended to create and activate a virtual environment to manage the project's dependencies. Then, install the required packages. The training was performed on a machine with 2x NVIDIA RTX 2080Ti GPUs, each with 11GB of VRAM, using the nvidia-driver-575. The training pipeline requires approximately 19GB of VRAM for efficient processing. A full training run of 30 epochs takes about 2 hours on this setup.

python -m venv env
source env/bin/activate
pip install -r requirements.txt

2. Data Preparation

This project uses the ASVspoof 2019 Logical Access (LA) dataset. The ASVspoof 2019 dataset was created for the third Automatic Speaker Verification Spoofing and Countermeasures Challenge. This repository focuses on the Logical Access (LA) partition, which, for the first time in the challenge's history, includes all three major attack types: text-to-speech (TTS), voice conversion (VC), and replay attacks. The data is derived from the VCTK corpus and is split into training, development, and evaluation sets with no speaker overlap between them. A key feature of the evaluation set is the inclusion of "unknown attacks"—spoofing techniques not present in the training or development data—to rigorously test a model's ability to generalize. The challenge also introduced the tandem Decision Cost Function (t-DCF) as a new primary metric, aligning the assessment more closely with real-world automatic speaker verification performance. The project structure expects the data to be in a directory named LA/ in the project root.

The preferred method is to download and unzip it directly from your terminal:

curl -o ./LA.zip -# https://datashare.ed.ac.uk/bitstream/handle/10283/3336/LA.zip\?sequence\=3\&isAllowed\=y
unzip LA.zip

Alternatively, you can download the dataset from:


Usage

Training the Model

To begin training the AASIST_GNN model, you can run the train.py script. This script will start a new training session from scratch.

python train.py

If you wish to resume training from a specific checkpoint, you can use the --model_path argument:

python train.py --model_path /path/to/your/model.pth

Evaluating the Model

To evaluate a trained model checkpoint, use the evaluate_gnn.py script and provide the path to your model file.

python evaluate_gnn.py --model_path /path/to/your/model.pth

This will produce an evaluation file with the scores and calculate the Equal Error Rate (EER) and the minimum tandem Detection Cost Function (t-DCF).


Performance & Results

Results Summary

Here is a summary of the evaluation results for a sample model checkpoint.

  • CM SYSTEM

    • EER: 2.20% (Equal error rate for countermeasure)
  • TANDEM

    • min-tDCF: 0.0713
  • BREAKDOWN CM SYSTEM (EER by Attack Type)

    • A07: 2.23%
    • A08: 0.49%
    • A09: 0.02%
    • A10: 2.69%
    • A11: 0.65%
    • A12: 2.44%
    • A13: 0.43%
    • A14: 0.55%
    • A15: 1.47%
    • A16: 1.22%
    • A17: 2.38%
    • A18: 5.92%
    • A19: 1.85%

Training History & Visualization

You can find the detailed training logs in training_logs/training_history.csv, along with the visualization script used to generate the graph below.

Training loss & metrics

On the graph, you can find the sliding average loss and EER metrics evaluated at several checkpoints during training.


Pre-trained Model & Result Files

A pre-trained model checkpoint, saved after 32 epochs of training, is available at:

  • model_weights/model_epoch_031_batch_0900.pth

You can find the corresponding evaluation results in the eval_results_gnn/ directory:

  • eval_results_gnn/scores_model_epoch_031_batch_0900.txt - This file contains the raw model output scores. You can use these to compute any other custom metrics. These same results can be achieved by running evaluate_gnn.py using the model_weights/model_epoch_031_batch_0900.pth model.

  • eval_results_gnn/t-DCF_EER_model_epoch_031_batch_0900.txt - This file contains the detailed evaluation summary shown above.

About

This repository contains a torch-geometric implementation of the AASIST framework for audio anti-spoofing, featuring optimized GNN modules for significantly faster training and enhanced performance.

Topics

Resources

License

Stars

Watchers

Forks

Languages