Skip to content

thekripaverse/AI-Voice-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

AI Voice Detection System

A state-of-the-art deep learning system for detecting AI-generated voice content across multiple Indian languages. Built using Facebook's Wav2Vec2-XLSR-53 model with fine-tuning for multilingual voice authentication.

Overview

This project addresses the growing challenge of deepfake audio detection by providing a robust solution capable of distinguishing between human and AI-generated speech in five major Indian languages: Tamil, English, Hindi, Malayalam, and Telugu.

Key Features

  • Multilingual Support: Supports 5 Indian languages (Tamil, English, Hindi, Malayalam, Telugu)
  • High Accuracy: Fine-tuned Wav2Vec2-XLSR-53 model with advanced training strategies
  • Real-time Inference: FastAPI-based REST API for production deployment
  • Docker Ready: Containerized deployment with Docker support
  • Audio Augmentation: Robust training with noise injection, pitch shifting, and time stretching
  • Chunked Processing: Handles variable-length audio files efficiently
  • Confidence Scoring: Provides detailed confidence metrics and explanations

Architecture

AI Voice Detection System
├── Training Pipeline
│   ├── train_wav2vec2.py    # Main training script
│   └── encode.py            # Audio encoding utilities
└── Production API
    ├── ai-voice-container/
    │   ├── app.py           # FastAPI inference server
    │   ├── Dockerfile       # Container configuration
    │   └── requirements.txt # Dependencies

Quick Start

Prerequisites

  • Python 3.10+
  • PyTorch 1.12+
  • CUDA (optional, for GPU acceleration)
  • Docker (for containerized deployment)

Installation

  1. Clone the repository
git clone <repository-url>
cd AI-voice-detection
  1. Install dependencies
pip install -r ai-voice-container/requirements.txt
  1. Download the pre-trained model
# The model will be automatically downloaded from Hugging Face Hub
# Model: kripasree/ai-voice-detector

Running the API Server

Option 1: Direct Python Execution

cd ai-voice-container
uvicorn app:app --host 0.0.0.0 --port 8000

Option 2: Docker Deployment

cd ai-voice-container
docker build -t ai-voice-detector .
docker run -p 8000:8000 ai-voice-detector

Model Performance

Training Configuration

  • Base Model: facebook/wav2vec2-large-xLSR-53
  • Training Strategy: Two-phase fine-tuning (frozen + unfrozen)
  • Batch Size: 1 (with gradient accumulation of 8)
  • Learning Rate: 2e-6 (initial), 1e-6 (fine-tuning)
  • Epochs: 7 (initial) + 2 (fine-tuning)
  • Audio Length: Up to 7 seconds per sample
  • Sampling Rate: 16kHz

Data Augmentation Techniques

  • Random gain adjustment (0.8x - 1.2x)
  • Noise injection (Gaussian noise, σ=0.003)
  • Pitch shifting (±2 semitones)
  • Time stretching (0.9x - 1.1x speed)

API Usage

Endpoint

POST /
Content-Type: application/json

Request Format

{
  "language": "english",
  "audioBase64": "base64_encoded_audio_data"
}

Response Format

{
  "status": "success",
  "language": "english",
  "classification": "AI_GENERATED",
  "confidenceScore": 0.8945,
  "explanation": "Synthetic patterns detected"
}

Classification Threshold

  • AI Detection Threshold: 0.75
  • Output Classes: HUMAN or AI_GENERATED
  • Confidence Range: 0.0 - 1.0

Training Pipeline

Dataset Structure

HCL-DATA/
├── human/
│   ├── tamil/clips/
│   ├── english/clips/
│   ├── hindi/clips/
│   ├── malayalam/clips/
│   └── telugu/clips/
└── ai_generated/
    ├── tamil/
    ├── english/
    ├── hindi/
    ├── malayalam/
    └── telugu/

Training Process

  1. Data Collection: Balanced sampling (1500 samples per language per class)
  2. Preprocessing: Audio loading, cropping, and augmentation
  3. Feature Extraction: Wav2Vec2 feature extraction at 16kHz
  4. Model Training: Two-phase fine-tuning approach
  5. Evaluation: 90/10 train-test split with epoch-wise evaluation

Running Training

python train_wav2vec2.py

Note: Update the BASE path in train_wav2vec2.py to point to your dataset location.

Development

Project Structure

  • train_wav2vec2.py: Complete training pipeline with data augmentation
  • encode.py: Audio encoding utilities for base64 conversion
  • ai-voice-container/app.py: Production-ready FastAPI inference server
  • ai-voice-container/Dockerfile: Container configuration for deployment

Key Dependencies

  • torch: Deep learning framework
  • transformers: Hugging Face transformers library
  • librosa: Audio processing and analysis
  • fastapi: Modern web framework for APIs
  • uvicorn: ASGI server for FastAPI
  • numpy: Numerical computing
  • datasets: Hugging Face datasets library

Performance Metrics

The model achieves competitive performance across all supported languages with:

  • High Precision: Minimal false positives in AI detection
  • Robust Generalization: Handles various audio qualities and conditions
  • Real-time Processing: Sub-second inference for typical audio clips
  • Scalable Architecture: Designed for production workloads

Security & Ethics

  • Privacy: No audio data is stored permanently
  • Bias Mitigation: Balanced dataset across languages and demographics
  • Responsible AI: Designed for legitimate content verification purposes

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Facebook AI for the Wav2Vec2-XLSR-53 model
  • Hugging Face for the transformers library and model hub
  • HCL for providing the multilingual voice dataset

Contact

For questions, suggestions, or collaborations, please reach out through the project issues or contact channels.


Disclaimer: This tool is designed for legitimate content verification and research purposes. Users are responsible for complying with applicable laws and regulations when using this system.

About

The project distinguishes Real Human voice and Deepfake AI generated voice with a provision of confidence score.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors