Elevate your public speaking and presentation skills through intelligent multimodal analysis
Features β’ Quick Start β’ Docker Deployment β’ Technologies β’ Roadmap
Vocalyst is an advanced AI-driven communication coaching platform that revolutionizes how individuals improve their public speaking and presentation skills. Unlike traditional tools that focus solely on text analysis, Vocalyst provides comprehensive, real-time multimodal feedback by analyzing:
- π£οΈ Voice Analysis - Speech fluency, pacing, tone, and delivery
- π Facial Expressions - Emotional engagement and confidence levels
- ποΈ Eye Contact - Gaze tracking and audience engagement
- π Content Structure - Logical coherence, vocabulary, and engagement
Communication is more than just wordsβit's about how you sound, how you look, and how you structure your message. Vocalyst bridges the gap left by conventional tools by offering a unified, intelligent solution for holistic communication improvement.
- Real-time Feedback during presentations
- Multiple Practice Modes: General, Persuasive, Emotive, Debate, Storytelling
- Camera & Audio Integration for comprehensive analysis
- Live Metrics Display with instant feedback
- Speech Transcription using OpenAI Whisper
- Filler Word Detection (um, uh, like, etc.) with frequency tracking
- Words Per Minute (WPM) measurement
- Clarity Scoring based on pronunciation and enunciation
- Vocabulary Sophistication tracking
- Eye Contact Tracking via MediaPipe
- Facial Expression Analysis using DeepFace
- Emotion Detection (neutral, happy, sad, angry, fear, surprise)
- Engagement Estimation through facial cues
- Real-time Visual Feedback during sessions
- Dynamic Session Insights generated by Google Gemini AI
- Personalized Recommendations based on performance
- Trend Analysis (improving/declining/stable metrics)
- Strengths & Weaknesses identification
- Gamification with level/XP system
- Performance Dashboard with historical trends
- Session History with detailed breakdowns
- Progress Tracking over time
- Practice Mode Analytics with distribution charts
- Emotional Expression Patterns
- Reset Functionality with data archiving
- Multiple Voice Options (8 high-quality AI voices)
- Speed Control for customized playback
- ElevenLabs & Neuphonic integration
- Practice Prompts generation
- Docker & Docker Compose (recommended) OR
- Node.js (v18+) and Python (v3.11+)
-
Clone the repository
git clone https://github.com/Shreyyy07/Vocalyst-Main.git cd Vocalyst-Main -
Set up environment variables
cp .env.example .env # Edit .env with your API keys -
Start with Docker Compose
docker-compose up
-
Access the application
- Frontend: http://localhost:3000
- Backend API: http://localhost:5328
-
Clone and install dependencies
git clone https://github.com/Shreyyy07/Vocalyst-Main.git cd Vocalyst-Main # Install Python dependencies pip install -r requirements.txt # Install Node.js dependencies npm install
-
Set up environment variables
cp .env.example .env # Add your API keys to .env -
Run both servers
npm run dev
Or run separately:
# Terminal 1 - Frontend npm run next-dev # Terminal 2 - Backend npm run flask-dev
Vocalyst uses a multi-container Docker setup:
- Frontend Container: Next.js production build (Port 3000)
- Backend Container: Flask API with ML models (Port 5328)
- Shared Network: Bridge network for inter-container communication
- Persistent Volumes: Session data and uploads
# Build containers
docker-compose build
# Start services
docker-compose up
# Start in detached mode
docker-compose up -d
# Stop services
docker-compose down
# View logs
docker-compose logs -f
# Rebuild and restart
docker-compose down && docker-compose build && docker-compose up- Session Data:
./api/data- Stores practice session analytics - Uploads:
./api/uploads- Stores recordings and emotion data - Archives:
./api/data/archive- Archived session data after reset
-
Navigate to Practice
- Click "Practice" in the navigation menu
- Select a practice mode (General, Persuasive, Emotive, etc.)
-
Record Your Presentation
- Allow camera and microphone permissions
- Click "Start Recording"
- Speak naturally while the system analyzes
-
Receive Real-time Feedback
- Monitor live WPM, clarity, and filler word metrics
- View eye contact and emotion tracking
- Get instant visual feedback
-
Review Detailed Analysis
- View comprehensive post-session breakdown
- Get AI-generated personalized insights
- See scores for fluency, coherence, and engagement
- Receive actionable recommendations
Analytics Dashboard (/analytics):
- View aggregated performance metrics
- Track practice mode distribution
- Monitor emotional expression patterns
- Review recent session history
- Reset analytics with data archiving
AI-Powered Insights (/get-insights):
- Gamified progress tracking (Level/XP system)
- Skill breakdown radar chart
- Dynamic strengths and weaknesses
- Performance trends (improving/declining/stable)
- Personalized recommendations
- Reset progress functionality
| Technology | Purpose |
|---|---|
| Next.js 14 | React framework with SSR and production optimization |
| React 18 | UI component library with hooks |
| TypeScript | Type-safe JavaScript development |
| Tailwind CSS | Utility-first styling framework |
| Framer Motion | Smooth animations and transitions |
| Recharts | Data visualization and charts |
| Lucide React | Modern icon library |
| Technology | Purpose |
|---|---|
| Flask | Python web framework for API |
| Flask-CORS | Cross-origin resource sharing |
| Google Gemini AI | Dynamic insights generation |
| Neuphonic | Enhanced TTS and speech processing |
| ElevenLabs | High-quality text-to-speech |
| Model | Purpose | Performance |
|---|---|---|
| OpenAI Whisper | Speech-to-text transcription | State-of-the-art accuracy |
| RoBERTa (large) | Logical coherence detection | High performance |
| XGBoost | Speech fluency classification | 93% F1 Score |
| Google Gemini Pro | AI insights generation | Real-time analysis |
- Librosa - Audio feature extraction (MFCC, ZCR, energy)
- Neuphonic - Enhanced speech signal processing
- OpenAI Whisper - Accurate speech transcription
- SoundDevice - Real-time audio capture
- MediaPipe - Face landmark detection (68 points)
- OpenCV - Video processing and frame analysis
- DeepFace - Facial expression and emotion recognition
- GazeTracking - Eye contact estimation
- Docker - Containerization platform
- Docker Compose - Multi-container orchestration
- Gunicorn - Production WSGI server
- Next.js Production Build - Optimized frontend
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VOCALYST PLATFORM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
β β
βββββββββΌβββββββββ βββββββββΌβββββββββ
β Frontend β β Backend β
β Container ββββββββΊβ Container β
β (Next.js) β REST β (Flask) β
β Port: 3000 β API β Port: 5328 β
βββββββββββββββββββ ββββββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββ
β β β
βββββββββΌβββββββββ βββββββββΌβββββββββ βββββββββΌβββββββββ
β SPEECH MODULE β β VISUAL MODULE β β AI INSIGHTS β
β (Whisper) β β (MediaPipe) β β (Gemini) β
βββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ
β β β
ββββββββββββ΄βββββββββββ βββββββββββ΄ββββββββββ βββββββββββ΄ββββββββββ
β β β β β β
ββββββΌβββββ ββββββΌβββββ ββββββΌβββββ ββββββΌβββββ
βFiller β β WPM β βDeepFace β βDynamic β
βDetectionβ βTracking β βEmotions β βInsights β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β
ββββββββββββββΌβββββββββββββ
β ANALYTICS DASHBOARD β
β - Session History β
β - Trends & Progress β
β - Recommendations β
βββββββββββββββββββββββββββ
Vocalyst-Main/
βββ api/ # Backend Flask API
β βββ index.py # Main API endpoints
β βββ simple_tts.py # TTS subprocess handler
β βββ tonality.py # Tonality analysis
β βββ data/ # Session data storage
β β βββ sessions.json # Practice sessions
β β βββ archive/ # Archived data
β βββ uploads/ # User recordings
β βββ Dockerfile # Backend container config
β βββ requirements.txt # Python dependencies
β
βββ app/ # Next.js frontend
β βββ analytics/ # Analytics dashboard
β βββ get-insights/ # AI insights page
β βββ practice/ # Practice session interface
β βββ camera/ # Camera capture
β βββ tts/ # Text-to-speech lab
β βββ Dockerfile # Frontend container config
β βββ page.tsx # Landing page
β
βββ components/ # Reusable React components
β βββ ui/ # UI component library
β
βββ docker-compose.yml # Multi-container orchestration
βββ .env.example # Environment variables template
βββ .dockerignore # Docker ignore rules
βββ .gitignore # Git ignore rules
βββ package.json # Node.js dependencies
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Real-time Metrics: Live WPM, clarity, filler word tracking
- Post-Session Breakdown: Comprehensive analysis with scores
- AI Insights: Unique, personalized feedback per session
- Historical Tracking: Progress monitoring over time
- Aggregated Metrics: Average WPM, filler %, clarity, duration
- Practice Mode Distribution: Visual breakdown by category
- Emotional Patterns: Emotion distribution across sessions
- Recent Sessions: Quick access to session history
- Reset Functionality: Archive and clear data
- Dynamic Analysis: Real-time strengths/weaknesses calculation
- Trend Detection: Improving/declining/stable metrics
- Personalized Recommendations: Actionable improvement tips
- Gamification: Level/XP system for motivation
- Skill Visualization: Radar chart for skill breakdown
- Docker containerization and deployment
- Dynamic AI insights with Gemini API
- Analytics reset with data archiving
- Enhanced insights page with gamification
- Skill breakdown radar chart
- Performance trend analysis
- Multi-voice TTS integration
- Multilingual Support - 20+ languages for global accessibility
- Mobile Application - iOS and Android native apps
- Real-Time Coaching - Live suggestions during presentations
- Team Collaboration - Multi-user sessions and peer feedback
- Custom Training Modules - Industry-specific templates
- Integration APIs - Zoom, Teams, Meet connectivity
- Advanced Emotion AI - Context-aware sentiment analysis
- Voice Cloning - Personalized TTS with user's voice
- Presentation Templates - Pre-built scenarios and scripts
- Export Reports - PDF/PowerPoint presentation reports
- Local Processing: All ML models run locally in Docker containers
- No Data Sharing: Session data stays on your machine
- Environment Variables: Secure API key management
- Data Archiving: Safe reset with backup functionality
- CORS Protection: Configured cross-origin policies
We welcome contributions! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 for Python, ESLint for TypeScript
- Write meaningful commit messages
- Add comments for complex logic
- Update documentation as needed
This project is licensed under the MIT License - see the LICENSE file for details.
- Shreyyy07 - GitHub Profile
β Star this repository if you find it helpful!
π Found a bug? Open an issue
π‘ Have a feature idea? Start a discussion
Made with β€οΈ by Shreyyy07