AI Voice Detection System

A state-of-the-art deep learning system for detecting AI-generated voice content across multiple Indian languages. Built using Facebook's Wav2Vec2-XLSR-53 model with fine-tuning for multilingual voice authentication.

Overview

This project addresses the growing challenge of deepfake audio detection by providing a robust solution capable of distinguishing between human and AI-generated speech in five major Indian languages: Tamil, English, Hindi, Malayalam, and Telugu.

Key Features

Multilingual Support: Supports 5 Indian languages (Tamil, English, Hindi, Malayalam, Telugu)
High Accuracy: Fine-tuned Wav2Vec2-XLSR-53 model with advanced training strategies
Real-time Inference: FastAPI-based REST API for production deployment
Docker Ready: Containerized deployment with Docker support
Audio Augmentation: Robust training with noise injection, pitch shifting, and time stretching
Chunked Processing: Handles variable-length audio files efficiently
Confidence Scoring: Provides detailed confidence metrics and explanations

Architecture

AI Voice Detection System
├── Training Pipeline
│   ├── train_wav2vec2.py    # Main training script
│   └── encode.py            # Audio encoding utilities
└── Production API
    ├── ai-voice-container/
    │   ├── app.py           # FastAPI inference server
    │   ├── Dockerfile       # Container configuration
    │   └── requirements.txt # Dependencies

Quick Start

Prerequisites

Python 3.10+
PyTorch 1.12+
CUDA (optional, for GPU acceleration)
Docker (for containerized deployment)

Installation

Clone the repository

git clone <repository-url>
cd AI-voice-detection

Install dependencies

pip install -r ai-voice-container/requirements.txt

Download the pre-trained model

# The model will be automatically downloaded from Hugging Face Hub
# Model: kripasree/ai-voice-detector

Running the API Server

Option 1: Direct Python Execution

cd ai-voice-container
uvicorn app:app --host 0.0.0.0 --port 8000

Option 2: Docker Deployment

cd ai-voice-container
docker build -t ai-voice-detector .
docker run -p 8000:8000 ai-voice-detector

Model Performance

Training Configuration

Base Model: facebook/wav2vec2-large-xLSR-53
Training Strategy: Two-phase fine-tuning (frozen + unfrozen)
Batch Size: 1 (with gradient accumulation of 8)
Learning Rate: 2e-6 (initial), 1e-6 (fine-tuning)
Epochs: 7 (initial) + 2 (fine-tuning)
Audio Length: Up to 7 seconds per sample
Sampling Rate: 16kHz

Data Augmentation Techniques

Random gain adjustment (0.8x - 1.2x)
Noise injection (Gaussian noise, σ=0.003)
Pitch shifting (±2 semitones)
Time stretching (0.9x - 1.1x speed)

API Usage

Endpoint

POST /
Content-Type: application/json

Request Format

{
  "language": "english",
  "audioBase64": "base64_encoded_audio_data"
}

Response Format

{
  "status": "success",
  "language": "english",
  "classification": "AI_GENERATED",
  "confidenceScore": 0.8945,
  "explanation": "Synthetic patterns detected"
}

Classification Threshold

AI Detection Threshold: 0.75
Output Classes: HUMAN or AI_GENERATED
Confidence Range: 0.0 - 1.0

Training Pipeline

Dataset Structure

HCL-DATA/
├── human/
│   ├── tamil/clips/
│   ├── english/clips/
│   ├── hindi/clips/
│   ├── malayalam/clips/
│   └── telugu/clips/
└── ai_generated/
    ├── tamil/
    ├── english/
    ├── hindi/
    ├── malayalam/
    └── telugu/

Training Process

Data Collection: Balanced sampling (1500 samples per language per class)
Preprocessing: Audio loading, cropping, and augmentation
Feature Extraction: Wav2Vec2 feature extraction at 16kHz
Model Training: Two-phase fine-tuning approach
Evaluation: 90/10 train-test split with epoch-wise evaluation

Running Training

python train_wav2vec2.py

Note: Update the BASE path in train_wav2vec2.py to point to your dataset location.

Development

Project Structure

train_wav2vec2.py: Complete training pipeline with data augmentation
encode.py: Audio encoding utilities for base64 conversion
ai-voice-container/app.py: Production-ready FastAPI inference server
ai-voice-container/Dockerfile: Container configuration for deployment

Key Dependencies

torch: Deep learning framework
transformers: Hugging Face transformers library
librosa: Audio processing and analysis
fastapi: Modern web framework for APIs
uvicorn: ASGI server for FastAPI
numpy: Numerical computing
datasets: Hugging Face datasets library

Performance Metrics

The model achieves competitive performance across all supported languages with:

High Precision: Minimal false positives in AI detection
Robust Generalization: Handles various audio qualities and conditions
Real-time Processing: Sub-second inference for typical audio clips
Scalable Architecture: Designed for production workloads

Security & Ethics

Privacy: No audio data is stored permanently
Bias Mitigation: Balanced dataset across languages and demographics
Responsible AI: Designed for legitimate content verification purposes

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Facebook AI for the Wav2Vec2-XLSR-53 model
Hugging Face for the transformers library and model hub
HCL for providing the multilingual voice dataset

Contact

For questions, suggestions, or collaborations, please reach out through the project issues or contact channels.

Disclaimer: This tool is designed for legitimate content verification and research purposes. Users are responsible for complying with applicable laws and regulations when using this system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Voice Detection System

Overview

Key Features

Architecture

Quick Start

Prerequisites

Installation

Running the API Server

Option 1: Direct Python Execution

Option 2: Docker Deployment

Model Performance

Training Configuration

Data Augmentation Techniques

API Usage

Endpoint

Request Format

Response Format

Classification Threshold

Training Pipeline

Dataset Structure

Training Process

Running Training

Development

Project Structure

Key Dependencies

Performance Metrics

Security & Ethics

Contributing

License

Acknowledgments

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AI Voice Detection System

Overview

Key Features

Architecture

Quick Start

Prerequisites

Installation

Running the API Server

Option 1: Direct Python Execution

Option 2: Docker Deployment

Model Performance

Training Configuration

Data Augmentation Techniques

API Usage

Endpoint

Request Format

Response Format

Classification Threshold

Training Pipeline

Dataset Structure

Training Process

Running Training

Development

Project Structure

Key Dependencies

Performance Metrics

Security & Ethics

Contributing

License

Acknowledgments

Contact