A state-of-the-art deep learning system for detecting AI-generated voice content across multiple Indian languages. Built using Facebook's Wav2Vec2-XLSR-53 model with fine-tuning for multilingual voice authentication.
This project addresses the growing challenge of deepfake audio detection by providing a robust solution capable of distinguishing between human and AI-generated speech in five major Indian languages: Tamil, English, Hindi, Malayalam, and Telugu.
- Multilingual Support: Supports 5 Indian languages (Tamil, English, Hindi, Malayalam, Telugu)
- High Accuracy: Fine-tuned Wav2Vec2-XLSR-53 model with advanced training strategies
- Real-time Inference: FastAPI-based REST API for production deployment
- Docker Ready: Containerized deployment with Docker support
- Audio Augmentation: Robust training with noise injection, pitch shifting, and time stretching
- Chunked Processing: Handles variable-length audio files efficiently
- Confidence Scoring: Provides detailed confidence metrics and explanations
AI Voice Detection System
├── Training Pipeline
│ ├── train_wav2vec2.py # Main training script
│ └── encode.py # Audio encoding utilities
└── Production API
├── ai-voice-container/
│ ├── app.py # FastAPI inference server
│ ├── Dockerfile # Container configuration
│ └── requirements.txt # Dependencies
- Python 3.10+
- PyTorch 1.12+
- CUDA (optional, for GPU acceleration)
- Docker (for containerized deployment)
- Clone the repository
git clone <repository-url>
cd AI-voice-detection- Install dependencies
pip install -r ai-voice-container/requirements.txt- Download the pre-trained model
# The model will be automatically downloaded from Hugging Face Hub
# Model: kripasree/ai-voice-detectorcd ai-voice-container
uvicorn app:app --host 0.0.0.0 --port 8000cd ai-voice-container
docker build -t ai-voice-detector .
docker run -p 8000:8000 ai-voice-detector- Base Model: facebook/wav2vec2-large-xLSR-53
- Training Strategy: Two-phase fine-tuning (frozen + unfrozen)
- Batch Size: 1 (with gradient accumulation of 8)
- Learning Rate: 2e-6 (initial), 1e-6 (fine-tuning)
- Epochs: 7 (initial) + 2 (fine-tuning)
- Audio Length: Up to 7 seconds per sample
- Sampling Rate: 16kHz
- Random gain adjustment (0.8x - 1.2x)
- Noise injection (Gaussian noise, σ=0.003)
- Pitch shifting (±2 semitones)
- Time stretching (0.9x - 1.1x speed)
POST /
Content-Type: application/json
{
"language": "english",
"audioBase64": "base64_encoded_audio_data"
}{
"status": "success",
"language": "english",
"classification": "AI_GENERATED",
"confidenceScore": 0.8945,
"explanation": "Synthetic patterns detected"
}- AI Detection Threshold: 0.75
- Output Classes:
HUMANorAI_GENERATED - Confidence Range: 0.0 - 1.0
HCL-DATA/
├── human/
│ ├── tamil/clips/
│ ├── english/clips/
│ ├── hindi/clips/
│ ├── malayalam/clips/
│ └── telugu/clips/
└── ai_generated/
├── tamil/
├── english/
├── hindi/
├── malayalam/
└── telugu/
- Data Collection: Balanced sampling (1500 samples per language per class)
- Preprocessing: Audio loading, cropping, and augmentation
- Feature Extraction: Wav2Vec2 feature extraction at 16kHz
- Model Training: Two-phase fine-tuning approach
- Evaluation: 90/10 train-test split with epoch-wise evaluation
python train_wav2vec2.pyNote: Update the BASE path in train_wav2vec2.py to point to your dataset location.
train_wav2vec2.py: Complete training pipeline with data augmentationencode.py: Audio encoding utilities for base64 conversionai-voice-container/app.py: Production-ready FastAPI inference serverai-voice-container/Dockerfile: Container configuration for deployment
torch: Deep learning frameworktransformers: Hugging Face transformers librarylibrosa: Audio processing and analysisfastapi: Modern web framework for APIsuvicorn: ASGI server for FastAPInumpy: Numerical computingdatasets: Hugging Face datasets library
The model achieves competitive performance across all supported languages with:
- High Precision: Minimal false positives in AI detection
- Robust Generalization: Handles various audio qualities and conditions
- Real-time Processing: Sub-second inference for typical audio clips
- Scalable Architecture: Designed for production workloads
- Privacy: No audio data is stored permanently
- Bias Mitigation: Balanced dataset across languages and demographics
- Responsible AI: Designed for legitimate content verification purposes
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Facebook AI for the Wav2Vec2-XLSR-53 model
- Hugging Face for the transformers library and model hub
- HCL for providing the multilingual voice dataset
For questions, suggestions, or collaborations, please reach out through the project issues or contact channels.
Disclaimer: This tool is designed for legitimate content verification and research purposes. Users are responsible for complying with applicable laws and regulations when using this system.