Skip to content

cbitosc/HTF25-Team-415

Repository files navigation

🎬 AI-Powered Video Caption Generator

AI Caption Generator Web Interface

Transform your videos with AI-powered captions in multiple languages and styles

Python 3.10+ Flask faster-whisper Google Gemini SQLite CUDA


πŸ“ Project Overview

An intelligent video captioning application built for HTF25 Hackathon that combines cutting-edge AI models for automatic video transcription, caption enhancement, and multilingual translation. Upload any video, select your preferred style and language, and get professionally captioned videos in seconds!

🎯 Key Capabilities

  • 🎀 Speech-to-Text: Powered by faster-whisper (4-8x faster than OpenAI Whisper)
  • ✨ AI Enhancement: Google Gemini 2.5 Flash removes filler words and polishes captions
  • 🌍 Multilingual: Translate captions to 12+ languages
  • 🎨 Style Options: Casual, Professional, Educational, Humorous
  • πŸ‘₯ User Authentication: Login system with SQLite database
  • πŸ“Š Video History: Track all processed videos (for logged-in users)
  • ⚑ GPU Acceleration: CUDA support for 4-10x faster processing

πŸš€ Complete Tech Stack

Backend

Technology Purpose Version
Flask Web framework Latest
Python Programming language 3.10+
SQLite3 Database (user auth & history) Built-in
faster-whisper Speech-to-text (4-8x faster) Latest
Google Gemini 2.5 Flash Caption enhancement & translation API
MoviePy Video processing & overlay 1.0.3
FFmpeg Video encoding/decoding Latest
PIL/Pillow Text rendering on videos Latest

AI Models

Model Task Performance
faster-whisper (tiny/base/small/medium) Audio β†’ Text transcription 0.5-30s per minute
Gemini 2.5 Flash Text polishing & translation 1-2s per segment

Model Options:

  • tiny (39M params): Fastest, 32x realtime
  • base (74M params): Balanced, 16x realtime ✨ Recommended
  • small (244M params): Quality, 6x realtime
  • medium (769M params): Best accuracy, 2x realtime

Frontend

Technology Purpose
HTML5 Structure & semantic markup
CSS3 Custom styling, gradients, animations
JavaScript Interactivity, video controls, validation
Font Awesome 6.4.0 Icons (CDN)
Google Fonts (Poppins) Typography

Responsive Design:

  • Desktop: >1024px
  • Tablet: 768px-1024px
  • Mobile: <768px

Database Schema

-- Users table
CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    username TEXT UNIQUE,
    email TEXT UNIQUE,
    password_hash TEXT,
    created_at TIMESTAMP
);

-- Videos table
CREATE TABLE videos (
    id INTEGER PRIMARY KEY,
    user_id INTEGER,
    original_filename TEXT,
    video_file TEXT,
    srt_file TEXT,
    style TEXT,
    language TEXT,
    processed_at TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

Authentication & Security

  • Password Hashing: SHA-256
  • SQL Injection Protection: Parameterized queries
  • XSS Protection: HTTPOnly cookies
  • Session Management: Flask sessions (2-hour timeout)
  • File Validation: Extension & size checks (max 500MB)

GPU Acceleration

  • CUDA Version: 12.0
  • GPU: NVIDIA GTX 1050 (4GB VRAM)
  • Optimization: FP16 precision on GPU, INT8 on CPU
  • Speedup: 4-10x faster than CPU

API Integration

  • Gemini API: 28 keys with automatic rotation
  • Rate Limiting: 500 requests/day per key
  • Fallback: Auto-retry with exponential backoff
  • Tracking: Usage counts and disabled keys logging

🌟 Features

  • Automatic Transcription: Uses faster-whisper for high-speed video audio transcription (4-8x faster than OpenAI Whisper)
  • AI Caption Rewriting: Leverages Google Gemini 2.5 Flash to enhance and translate captions in different styles
  • Multi-language Support: Translate and generate captions in 10+ languages (English, Hindi, Spanish, French, German, etc.)
  • Video Overlay: Automatically overlays captions on your video using PIL and MoviePy
  • Beautiful Web Interface: Modern, responsive Flask web interface with drag & drop support
  • Multiple Caption Styles: Choose from 6 styles - Casual, Formal, Funny, Dramatic, Minimal, Educational
  • Model Selection: Choose from 4 Whisper model variants (tiny/base/small/medium) for speed vs accuracy tradeoff
  • GPU Acceleration: CUDA-optimized faster-whisper with FP16 precision for maximum performance
  • Unique File Management: All outputs saved with timestamps in organized outputs/ folder
  • Dual Download: Get both captioned video (.mp4) and subtitle file (.srt)
  • User Authentication: Secure login system with password hashing and session management
  • Result Page: Beautiful success page with confetti animation and download options
  • Comprehensive Logging: Detailed console output tracking each processing stage with timing metrics
  • Secure: File validation, size limits (500MB), and automatic temp file cleanup

οΏ½ Screenshots

Web Application Interface

AI Caption Generator UI Modern, responsive web interface with drag & drop support, multiple style options, and language selection

Output Sample - Video with AI-Generated Captions

Captioned Video Output Example of processed video with AI-generated captions overlaid in selected style and language

οΏ½πŸ“‹ Prerequisites

Before you begin, ensure you have the following installed:

  • Python 3.10+
  • Conda (Anaconda or Miniconda)
  • Git

πŸš€ Installation

Step 1: Clone the Repository

git clone https://github.com/chiluverugirish/HTF25-Team-415.git
cd HTF25-Team-415

Step 2: Create Conda Environment

conda create -n htf25 python=3.10 -y

Step 3: Activate the Environment

conda activate htf25

Step 4: Install System Dependencies

Install FFmpeg and ImageMagick (required for video processing):

conda install -c conda-forge ffmpeg imagemagick -y

Step 5: Install Python Dependencies

pip install -r requirements.txt

Step 6: Set Up Environment Variables

Copy the example environment file and add your Gemini API keys:

# Copy the example file
Copy-Item .env.example .env

# Edit .env and add your API keys
notepad .env

Configure your .env file with at least one Gemini API key (28 keys recommended for high-volume):

GEMINI_API_KEY_1=your_first_gemini_api_key_here
GEMINI_API_KEY_2=your_second_gemini_api_key_here
GEMINI_API_KEY_3=your_third_gemini_api_key_here
# ... add up to GEMINI_API_KEY_28

To get Gemini API keys:

  1. Visit Google AI Studio
  2. Sign in with your Google account
  3. Create new API keys (recommended: 28 keys for automatic rotation)
  4. Copy and paste them into your .env file

⚠️ SECURITY NOTE:

  • Never commit the .env file to Git (it's in .gitignore)
  • Never share your API keys publicly
  • The .env.example file is a template without real keys

Step 7: Verify Installation

python -c "from faster_whisper import WhisperModel; import moviepy; from google import generativeai; print('βœ… All packages installed successfully!')"

🎯 Usage

Running the Application

Quick Start (Recommended):

# Using startup script
.\start.ps1

Or manually:

# Activate environment
conda activate D:\conda_envs\Ai_Caption_Gen

# Start the app
python app.py

The application will automatically open in your default browser at http://127.0.0.1:5000/

Using the Application

Web Interface

  1. Upload Video: Click the upload area or drag & drop your video file
  2. Choose Style: Select from 6 caption styles (casual, formal, funny, dramatic, minimal, educational)
  3. Select Language: Choose output language from 10+ supported languages (Gemini will translate)
  4. Choose Speed: Select Whisper model variant (tiny/base/small/medium) for speed vs accuracy tradeoff
  5. Generate: Click "Generate Captions" and watch real-time processing logs
  6. Download: Get both the captioned video and SRT subtitle file from the success page
  7. Access Files: All outputs are saved in the outputs/ folder with unique timestamped names

Sample Output

Captioned Video

The above image shows an example of the final output - a video with AI-generated captions overlaid in your selected style and language.

🎬 How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Upload Video      β”‚  ← User uploads video via web interface
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  faster-whisper     β”‚  ← Speech-to-text transcription (4-8x faster)
β”‚  Transcription      β”‚     with GPU acceleration (CUDA FP16)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gemini 2.5 Flash   β”‚  ← AI enhancement: translate + rewrite
β”‚  Caption Rewriting  β”‚     in selected style (28 API keys)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SRT Generation     β”‚  ← Generate standard subtitle file format
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Caption Overlay    β”‚  ← Overlay captions on video using MoviePy + PIL
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Download Results   β”‚  ← Get captioned video (.mp4) + SRT file (.srt)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

HTF25-Team-415/
β”œβ”€β”€ app.py                          # Main Flask application
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ packages.txt                    # System dependencies
β”œβ”€β”€ disabled_keys.json             # Configuration file
β”œβ”€β”€ usage_counts.json              # Usage tracking
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ transcribe.py              # Video transcription module
β”‚   β”œβ”€β”€ generate_srt.py            # SRT subtitle generation
β”‚   β”œβ”€β”€ rewrite_captions_gemini.py # AI caption rewriting
β”‚   β”œβ”€β”€ overlay.py                 # Video caption overlay
β”‚   └── runall.py                  # Batch processing script
β”œβ”€β”€ templates/
β”‚   └── index.html                 # Web interface template
└── examples/                      # Example videos/outputs

πŸ› οΈ Dependencies

Python Packages

  • Flask: Web framework and routing
  • faster-whisper: High-performance audio transcription (4-8x faster than OpenAI Whisper)
  • moviepy: Video processing and manipulation
  • google-generativeai: Gemini AI integration for caption enhancement
  • python-dotenv: Environment variable management
  • pysrt: SRT subtitle file handling
  • Pillow (PIL): Text rendering and image processing
  • torch: PyTorch for deep learning inference
  • numpy: Numerical computing

System Packages

  • FFmpeg: Video encoding/decoding
  • CUDA Toolkit 12.0: GPU acceleration (NVIDIA only)

πŸ”§ Troubleshooting

Common Issues

Issue: FFmpeg not found

# Solution: Reinstall FFmpeg
conda install -c conda-forge ffmpeg -y

Issue: faster-whisper model download fails

# Solution: Manually download the model
python -c "from faster_whisper import WhisperModel; model = WhisperModel('base', device='cpu')"

Issue: CUDA out of memory error

# Solution: Use smaller Whisper model or switch to CPU
# Edit scripts/transcribe.py and change model size from 'medium' to 'base' or 'tiny'

Issue: ImportError for moviepy

# Solution: Reinstall moviepy
pip uninstall moviepy -y
pip install moviepy==1.0.3

Issue: Gemini API error

  • Verify your API key is correct in the .env file
  • Check your API quota at Google AI Studio

🌐 Environment Management

Activate Environment

conda activate htf25

Deactivate Environment

conda deactivate

Remove Environment (if needed)

conda deactivate
conda remove -n htf25 --all -y

🀝 Contributing

This project was created for HTF25 (Hackathon). To contribute:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature-name)
  3. Make your changes
  4. Commit your changes (git add . && git commit -m "Add feature")
  5. Push to your fork (git push origin feature-name)
  6. Create a Pull Request

πŸ“ License

This project is part of the HTF25 hackathon.

πŸ‘₯ Team

Team 415 - HTF25 Hackathon Participants

🎨 Results Showcase

Application Interface

Our modern web interface with gradient design and intuitive controls:

AI Caption Generator Interface

Beautiful web interface with drag & drop support, multiple style options, and responsive design

Sample Output

Example of AI-generated captions overlaid on video with selected style and language:

Video with AI Captions

Professional caption overlay showing AI-enhanced text in the selected style

Key Visual Features

  • 🎨 Modern UI Design: Purple gradient theme with smooth animations
  • πŸ–±οΈ Drag & Drop: Intuitive file upload with visual feedback
  • πŸ“± Responsive Layout: Works seamlessly on all devices
  • 🎬 Professional Output: High-quality caption overlay with customizable styles
  • πŸ“Š Success Page: Confetti animation with download options
  • πŸ“ Organized Storage: Timestamped files in dedicated outputs folder

πŸ™ Acknowledgments

  • faster-whisper by Systran for high-performance speech-to-text transcription
  • Google Gemini 2.5 Flash for AI-powered caption enhancement and translation
  • OpenAI for the original Whisper architecture
  • The open-source community for amazing libraries (MoviePy, Flask, PIL, PyTorch)

πŸ“ž Support

If you encounter any issues or have questions:

  1. Check the Troubleshooting section
  2. Open an issue on GitHub
  3. Contact the team maintainers

Happy Captioning! 🎬✨

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •