An intelligent video captioning application built for HTF25 Hackathon that combines cutting-edge AI models for automatic video transcription, caption enhancement, and multilingual translation. Upload any video, select your preferred style and language, and get professionally captioned videos in seconds!
- π€ Speech-to-Text: Powered by faster-whisper (4-8x faster than OpenAI Whisper)
- β¨ AI Enhancement: Google Gemini 2.5 Flash removes filler words and polishes captions
- π Multilingual: Translate captions to 12+ languages
- π¨ Style Options: Casual, Professional, Educational, Humorous
- π₯ User Authentication: Login system with SQLite database
- π Video History: Track all processed videos (for logged-in users)
- β‘ GPU Acceleration: CUDA support for 4-10x faster processing
| Technology | Purpose | Version |
|---|---|---|
| Flask | Web framework | Latest |
| Python | Programming language | 3.10+ |
| SQLite3 | Database (user auth & history) | Built-in |
| faster-whisper | Speech-to-text (4-8x faster) | Latest |
| Google Gemini 2.5 Flash | Caption enhancement & translation | API |
| MoviePy | Video processing & overlay | 1.0.3 |
| FFmpeg | Video encoding/decoding | Latest |
| PIL/Pillow | Text rendering on videos | Latest |
| Model | Task | Performance |
|---|---|---|
| faster-whisper (tiny/base/small/medium) | Audio β Text transcription | 0.5-30s per minute |
| Gemini 2.5 Flash | Text polishing & translation | 1-2s per segment |
Model Options:
tiny(39M params): Fastest, 32x realtimebase(74M params): Balanced, 16x realtime β¨ Recommendedsmall(244M params): Quality, 6x realtimemedium(769M params): Best accuracy, 2x realtime
| Technology | Purpose |
|---|---|
| HTML5 | Structure & semantic markup |
| CSS3 | Custom styling, gradients, animations |
| JavaScript | Interactivity, video controls, validation |
| Font Awesome 6.4.0 | Icons (CDN) |
| Google Fonts (Poppins) | Typography |
Responsive Design:
- Desktop: >1024px
- Tablet: 768px-1024px
- Mobile: <768px
-- Users table
CREATE TABLE users (
id INTEGER PRIMARY KEY,
username TEXT UNIQUE,
email TEXT UNIQUE,
password_hash TEXT,
created_at TIMESTAMP
);
-- Videos table
CREATE TABLE videos (
id INTEGER PRIMARY KEY,
user_id INTEGER,
original_filename TEXT,
video_file TEXT,
srt_file TEXT,
style TEXT,
language TEXT,
processed_at TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(id)
);- Password Hashing: SHA-256
- SQL Injection Protection: Parameterized queries
- XSS Protection: HTTPOnly cookies
- Session Management: Flask sessions (2-hour timeout)
- File Validation: Extension & size checks (max 500MB)
- CUDA Version: 12.0
- GPU: NVIDIA GTX 1050 (4GB VRAM)
- Optimization: FP16 precision on GPU, INT8 on CPU
- Speedup: 4-10x faster than CPU
- Gemini API: 28 keys with automatic rotation
- Rate Limiting: 500 requests/day per key
- Fallback: Auto-retry with exponential backoff
- Tracking: Usage counts and disabled keys logging
- Automatic Transcription: Uses faster-whisper for high-speed video audio transcription (4-8x faster than OpenAI Whisper)
- AI Caption Rewriting: Leverages Google Gemini 2.5 Flash to enhance and translate captions in different styles
- Multi-language Support: Translate and generate captions in 10+ languages (English, Hindi, Spanish, French, German, etc.)
- Video Overlay: Automatically overlays captions on your video using PIL and MoviePy
- Beautiful Web Interface: Modern, responsive Flask web interface with drag & drop support
- Multiple Caption Styles: Choose from 6 styles - Casual, Formal, Funny, Dramatic, Minimal, Educational
- Model Selection: Choose from 4 Whisper model variants (tiny/base/small/medium) for speed vs accuracy tradeoff
- GPU Acceleration: CUDA-optimized faster-whisper with FP16 precision for maximum performance
- Unique File Management: All outputs saved with timestamps in organized
outputs/folder - Dual Download: Get both captioned video (.mp4) and subtitle file (.srt)
- User Authentication: Secure login system with password hashing and session management
- Result Page: Beautiful success page with confetti animation and download options
- Comprehensive Logging: Detailed console output tracking each processing stage with timing metrics
- Secure: File validation, size limits (500MB), and automatic temp file cleanup
Modern, responsive web interface with drag & drop support, multiple style options, and language selection
Example of processed video with AI-generated captions overlaid in selected style and language
Before you begin, ensure you have the following installed:
- Python 3.10+
- Conda (Anaconda or Miniconda)
- Git
git clone https://github.com/chiluverugirish/HTF25-Team-415.git
cd HTF25-Team-415conda create -n htf25 python=3.10 -yconda activate htf25Install FFmpeg and ImageMagick (required for video processing):
conda install -c conda-forge ffmpeg imagemagick -ypip install -r requirements.txtCopy the example environment file and add your Gemini API keys:
# Copy the example file
Copy-Item .env.example .env
# Edit .env and add your API keys
notepad .envConfigure your .env file with at least one Gemini API key (28 keys recommended for high-volume):
GEMINI_API_KEY_1=your_first_gemini_api_key_here
GEMINI_API_KEY_2=your_second_gemini_api_key_here
GEMINI_API_KEY_3=your_third_gemini_api_key_here
# ... add up to GEMINI_API_KEY_28To get Gemini API keys:
- Visit Google AI Studio
- Sign in with your Google account
- Create new API keys (recommended: 28 keys for automatic rotation)
- Copy and paste them into your
.envfile
- Never commit the
.envfile to Git (it's in.gitignore) - Never share your API keys publicly
- The
.env.examplefile is a template without real keys
python -c "from faster_whisper import WhisperModel; import moviepy; from google import generativeai; print('β
All packages installed successfully!')"# Using startup script
.\start.ps1Or manually:
# Activate environment
conda activate D:\conda_envs\Ai_Caption_Gen
# Start the app
python app.pyThe application will automatically open in your default browser at http://127.0.0.1:5000/
- Upload Video: Click the upload area or drag & drop your video file
- Choose Style: Select from 6 caption styles (casual, formal, funny, dramatic, minimal, educational)
- Select Language: Choose output language from 10+ supported languages (Gemini will translate)
- Choose Speed: Select Whisper model variant (tiny/base/small/medium) for speed vs accuracy tradeoff
- Generate: Click "Generate Captions" and watch real-time processing logs
- Download: Get both the captioned video and SRT subtitle file from the success page
- Access Files: All outputs are saved in the
outputs/folder with unique timestamped names
The above image shows an example of the final output - a video with AI-generated captions overlaid in your selected style and language.
βββββββββββββββββββββββ
β Upload Video β β User uploads video via web interface
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β faster-whisper β β Speech-to-text transcription (4-8x faster)
β Transcription β with GPU acceleration (CUDA FP16)
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Gemini 2.5 Flash β β AI enhancement: translate + rewrite
β Caption Rewriting β in selected style (28 API keys)
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β SRT Generation β β Generate standard subtitle file format
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Caption Overlay β β Overlay captions on video using MoviePy + PIL
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββ
β Download Results β β Get captioned video (.mp4) + SRT file (.srt)
βββββββββββββββββββββββ
HTF25-Team-415/
βββ app.py # Main Flask application
βββ requirements.txt # Python dependencies
βββ packages.txt # System dependencies
βββ disabled_keys.json # Configuration file
βββ usage_counts.json # Usage tracking
βββ scripts/
β βββ transcribe.py # Video transcription module
β βββ generate_srt.py # SRT subtitle generation
β βββ rewrite_captions_gemini.py # AI caption rewriting
β βββ overlay.py # Video caption overlay
β βββ runall.py # Batch processing script
βββ templates/
β βββ index.html # Web interface template
βββ examples/ # Example videos/outputs
- Flask: Web framework and routing
- faster-whisper: High-performance audio transcription (4-8x faster than OpenAI Whisper)
- moviepy: Video processing and manipulation
- google-generativeai: Gemini AI integration for caption enhancement
- python-dotenv: Environment variable management
- pysrt: SRT subtitle file handling
- Pillow (PIL): Text rendering and image processing
- torch: PyTorch for deep learning inference
- numpy: Numerical computing
- FFmpeg: Video encoding/decoding
- CUDA Toolkit 12.0: GPU acceleration (NVIDIA only)
Issue: FFmpeg not found
# Solution: Reinstall FFmpeg
conda install -c conda-forge ffmpeg -yIssue: faster-whisper model download fails
# Solution: Manually download the model
python -c "from faster_whisper import WhisperModel; model = WhisperModel('base', device='cpu')"Issue: CUDA out of memory error
# Solution: Use smaller Whisper model or switch to CPU
# Edit scripts/transcribe.py and change model size from 'medium' to 'base' or 'tiny'Issue: ImportError for moviepy
# Solution: Reinstall moviepy
pip uninstall moviepy -y
pip install moviepy==1.0.3Issue: Gemini API error
- Verify your API key is correct in the
.envfile - Check your API quota at Google AI Studio
conda activate htf25conda deactivateconda deactivate
conda remove -n htf25 --all -yThis project was created for HTF25 (Hackathon). To contribute:
- Fork the repository
- Create a new branch (
git checkout -b feature-name) - Make your changes
- Commit your changes (
git add . && git commit -m "Add feature") - Push to your fork (
git push origin feature-name) - Create a Pull Request
This project is part of the HTF25 hackathon.
Team 415 - HTF25 Hackathon Participants
Our modern web interface with gradient design and intuitive controls:
Example of AI-generated captions overlaid on video with selected style and language:
- π¨ Modern UI Design: Purple gradient theme with smooth animations
- π±οΈ Drag & Drop: Intuitive file upload with visual feedback
- π± Responsive Layout: Works seamlessly on all devices
- π¬ Professional Output: High-quality caption overlay with customizable styles
- π Success Page: Confetti animation with download options
- π Organized Storage: Timestamped files in dedicated outputs folder
- faster-whisper by Systran for high-performance speech-to-text transcription
- Google Gemini 2.5 Flash for AI-powered caption enhancement and translation
- OpenAI for the original Whisper architecture
- The open-source community for amazing libraries (MoviePy, Flask, PIL, PyTorch)
If you encounter any issues or have questions:
- Check the Troubleshooting section
- Open an issue on GitHub
- Contact the team maintainers
Happy Captioning! π¬β¨
