Skip to content

Latest commit

 

History

History
213 lines (166 loc) · 6.29 KB

File metadata and controls

213 lines (166 loc) · 6.29 KB

AI Voice Detection System

A state-of-the-art deep learning system for detecting AI-generated voice content across multiple Indian languages. Built using Facebook's Wav2Vec2-XLSR-53 model with fine-tuning for multilingual voice authentication.

Overview

This project addresses the growing challenge of deepfake audio detection by providing a robust solution capable of distinguishing between human and AI-generated speech in five major Indian languages: Tamil, English, Hindi, Malayalam, and Telugu.

Key Features

  • Multilingual Support: Supports 5 Indian languages (Tamil, English, Hindi, Malayalam, Telugu)
  • High Accuracy: Fine-tuned Wav2Vec2-XLSR-53 model with advanced training strategies
  • Real-time Inference: FastAPI-based REST API for production deployment
  • Docker Ready: Containerized deployment with Docker support
  • Audio Augmentation: Robust training with noise injection, pitch shifting, and time stretching
  • Chunked Processing: Handles variable-length audio files efficiently
  • Confidence Scoring: Provides detailed confidence metrics and explanations

Architecture

AI Voice Detection System
├── Training Pipeline
│   ├── train_wav2vec2.py    # Main training script
│   └── encode.py            # Audio encoding utilities
└── Production API
    ├── ai-voice-container/
    │   ├── app.py           # FastAPI inference server
    │   ├── Dockerfile       # Container configuration
    │   └── requirements.txt # Dependencies

Quick Start

Prerequisites

  • Python 3.10+
  • PyTorch 1.12+
  • CUDA (optional, for GPU acceleration)
  • Docker (for containerized deployment)

Installation

  1. Clone the repository
git clone <repository-url>
cd AI-voice-detection
  1. Install dependencies
pip install -r ai-voice-container/requirements.txt
  1. Download the pre-trained model
# The model will be automatically downloaded from Hugging Face Hub
# Model: kripasree/ai-voice-detector

Running the API Server

Option 1: Direct Python Execution

cd ai-voice-container
uvicorn app:app --host 0.0.0.0 --port 8000

Option 2: Docker Deployment

cd ai-voice-container
docker build -t ai-voice-detector .
docker run -p 8000:8000 ai-voice-detector

Model Performance

Training Configuration

  • Base Model: facebook/wav2vec2-large-xLSR-53
  • Training Strategy: Two-phase fine-tuning (frozen + unfrozen)
  • Batch Size: 1 (with gradient accumulation of 8)
  • Learning Rate: 2e-6 (initial), 1e-6 (fine-tuning)
  • Epochs: 7 (initial) + 2 (fine-tuning)
  • Audio Length: Up to 7 seconds per sample
  • Sampling Rate: 16kHz

Data Augmentation Techniques

  • Random gain adjustment (0.8x - 1.2x)
  • Noise injection (Gaussian noise, σ=0.003)
  • Pitch shifting (±2 semitones)
  • Time stretching (0.9x - 1.1x speed)

API Usage

Endpoint

POST /
Content-Type: application/json

Request Format

{
  "language": "english",
  "audioBase64": "base64_encoded_audio_data"
}

Response Format

{
  "status": "success",
  "language": "english",
  "classification": "AI_GENERATED",
  "confidenceScore": 0.8945,
  "explanation": "Synthetic patterns detected"
}

Classification Threshold

  • AI Detection Threshold: 0.75
  • Output Classes: HUMAN or AI_GENERATED
  • Confidence Range: 0.0 - 1.0

Training Pipeline

Dataset Structure

HCL-DATA/
├── human/
│   ├── tamil/clips/
│   ├── english/clips/
│   ├── hindi/clips/
│   ├── malayalam/clips/
│   └── telugu/clips/
└── ai_generated/
    ├── tamil/
    ├── english/
    ├── hindi/
    ├── malayalam/
    └── telugu/

Training Process

  1. Data Collection: Balanced sampling (1500 samples per language per class)
  2. Preprocessing: Audio loading, cropping, and augmentation
  3. Feature Extraction: Wav2Vec2 feature extraction at 16kHz
  4. Model Training: Two-phase fine-tuning approach
  5. Evaluation: 90/10 train-test split with epoch-wise evaluation

Running Training

python train_wav2vec2.py

Note: Update the BASE path in train_wav2vec2.py to point to your dataset location.

Development

Project Structure

  • train_wav2vec2.py: Complete training pipeline with data augmentation
  • encode.py: Audio encoding utilities for base64 conversion
  • ai-voice-container/app.py: Production-ready FastAPI inference server
  • ai-voice-container/Dockerfile: Container configuration for deployment

Key Dependencies

  • torch: Deep learning framework
  • transformers: Hugging Face transformers library
  • librosa: Audio processing and analysis
  • fastapi: Modern web framework for APIs
  • uvicorn: ASGI server for FastAPI
  • numpy: Numerical computing
  • datasets: Hugging Face datasets library

Performance Metrics

The model achieves competitive performance across all supported languages with:

  • High Precision: Minimal false positives in AI detection
  • Robust Generalization: Handles various audio qualities and conditions
  • Real-time Processing: Sub-second inference for typical audio clips
  • Scalable Architecture: Designed for production workloads

Security & Ethics

  • Privacy: No audio data is stored permanently
  • Bias Mitigation: Balanced dataset across languages and demographics
  • Responsible AI: Designed for legitimate content verification purposes

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Facebook AI for the Wav2Vec2-XLSR-53 model
  • Hugging Face for the transformers library and model hub
  • HCL for providing the multilingual voice dataset

Contact

For questions, suggestions, or collaborations, please reach out through the project issues or contact channels.


Disclaimer: This tool is designed for legitimate content verification and research purposes. Users are responsible for complying with applicable laws and regulations when using this system.