Hard Word Extractor

A web application that processes audio and video files to extract and classify vocabulary words by CEFR language levels (A1-C2), providing transcription and vocabulary analysis for language learners.

🎯 Features

Manual Step-by-Step Workflow: Full control over each processing step
- Upload audio/video files
- Manually trigger transcription with Whisper AI
- Extract and review words
- Classify with Groq AI (see raw API responses)
- Save and analyze results
CEFR Classification: Words are automatically classified by difficulty level (A1-C2)
Word Context: See each word in context with timestamps
Full Transcription: View complete transcription of your audio
Vocabulary Statistics: Get insights about word frequency and difficulty distribution
Transparency: See exactly what's happening at each step

New in v2.0: We've redesigned the workflow to give you complete control! Instead of automatic processing, you now manually trigger each step and can review intermediate results. See MANUAL_WORKFLOW.md for details.

🛠️ Tech Stack

Backend

Django 4.2+
Django REST Framework
Celery + Redis for async processing
PostgreSQL database
Whisper AI for transcription
Groq API for LLM processing

Frontend

React 18+ with TypeScript
Material-UI (MUI)
Axios for API calls
React Router

DevOps

Docker & Docker Compose
Gunicorn + Nginx
Let's Encrypt SSL (optional)

🚀 Quick Start

Docker Deployment (Recommended)

Prerequisites:

Docker and Docker Compose installed
Groq API key (get one at groq.com)

3 Simple Steps:

Clone and configure

git clone https://github.com/yourusername/HardWordExtractor.git
cd HardWordExtractor
# Set your GROQ_API_KEY in docker-compose.dev.yml

Start all services

docker compose -f docker-compose.dev.yml up --build

Access the application
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000/api/
- Admin Panel: http://localhost:8000/admin/

See docs/DOCKER-QUICKSTART.md for detailed Docker deployment guide.

Manual Development Setup

For development without Docker, see QUICKSTART.md for running services manually.

📖 Documentation

Deployment & Setup

Docker Quick Start - 🚀 Start here! 3 steps to run (173 lines)
Docker Reference - Detailed Docker configuration reference (600+ lines)
Manual Setup - Run services manually (without Docker)
Setup Guide - Detailed development setup

API & Architecture

API Documentation - Complete API endpoints and usage
Architecture Guide - System architecture and design patterns
Manual Workflow - Step-by-step API workflow guide

Configuration

Groq Setup - How to get your Groq API key

🏗️ Architecture Highlights

Backend (Refactored)

backend/transcription/
├── models/              # Data models (6 files)
│   ├── audio.py, transcription.py, word.py
│   ├── statistics.py, processing.py
├── serializers/         # API serialization (6 files)
├── views/               # API endpoints (6 files)
├── services/            # Business logic (organized by domain)
│   ├── audio/          # Audio processing
│   ├── transcription/  # Whisper & processing
│   ├── words/          # Extraction & context
│   └── ai/             # Groq & classification
├── utils/               # Shared utilities (6 files)
│   ├── constants.py, exceptions.py
│   ├── validators.py, responses.py, pagination.py
└── tests/               # Comprehensive test suite

Frontend (Refactored)

frontend/src/
├── components/          # UI components (organized by feature)
│   ├── audio/, transcription/, words/
│   ├── layout/, common/
├── features/            # Feature modules with hooks
│   ├── audio/hooks/    # useAudioUpload, useAudioStatus
│   ├── transcription/hooks/  # useTranscription
│   └── words/hooks/    # useWords
├── hooks/               # Global hooks
│   ├── useDebounce, useLocalStorage
│   ├── useCache, useCachedApi
├── services/            # API communication
└── pages/               # Route components (lazy-loaded)

Performance Features:

✅ Code splitting (44% bundle size reduction)
✅ React.memo on expensive components
✅ API caching with custom hooks
✅ Debounced search inputs

See docs/ARCHITECTURE.md for detailed design patterns and data flow.

🧪 Development

Local Development Setup

See SETUP.md for detailed instructions.

Running Tests

# Backend tests
cd backend
python manage.py test

# Frontend tests
cd frontend
npm test

# Test coverage
cd backend && pytest --cov
cd frontend && npm test -- --coverage

📦 Project Structure

HardWordExtractor/
├── backend/
│   ├── config/              # Django settings & Celery
│   └── transcription/       # Main app (refactored)
│       ├── models/          # 6 model files
│       ├── views/           # 6 view files
│       ├── serializers/     # 6 serializer files
│       ├── services/        # Business logic (4 domains)
│       ├── utils/           # Shared utilities (6 files)
│       └── tests/           # Test suite (organized by layer)
├── frontend/
│   └── src/
│       ├── components/      # UI components (5 domains)
│       ├── features/        # Feature hooks (3 domains)
│       ├── hooks/           # Global hooks (4 files)
│       ├── services/        # API services
│       ├── pages/           # Route components
│       └── types/           # TypeScript types
├── docker/                  # Docker configurations
├── docs/                    # Documentation
│   ├── ARCHITECTURE.md     # System architecture
│   ├── API.md              # API documentation
│   ├── SETUP.md            # Setup guide
│   └── GROQ_SETUP.md       # Groq API guide
├── scripts/                 # Utility scripts
├── docker-compose.yml       # Docker orchestration
└── README.md

🗺️ Roadmap

Current Status: Phase 1 MVP is complete and production-ready! Docker deployment tested and verified with comprehensive documentation.

See PROJECT_OUTLINE.md and PROJECT_STATUS.md for detailed progress tracking.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📧 Contact

For questions or support, please open an issue on GitHub.

🙏 Acknowledgments

OpenAI Whisper - Speech recognition
Groq - Fast LLM inference
Django - Web framework
React - Frontend framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hard Word Extractor

🎯 Features

🛠️ Tech Stack

Backend

Frontend

DevOps

🚀 Quick Start

Docker Deployment (Recommended)

Manual Development Setup

📖 Documentation

Deployment & Setup

API & Architecture

Configuration

🏗️ Architecture Highlights

Backend (Refactored)

Frontend (Refactored)

🧪 Development

Local Development Setup

Running Tests

📦 Project Structure

🗺️ Roadmap

📄 License

🤝 Contributing

📧 Contact

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
backend		backend
docker		docker
docs		docs
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
PROJECT_OUTLINE.md		PROJECT_OUTLINE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

Hard Word Extractor

🎯 Features

🛠️ Tech Stack

Backend

Frontend

DevOps

🚀 Quick Start

Docker Deployment (Recommended)

Manual Development Setup

📖 Documentation

Deployment & Setup

API & Architecture

Configuration

🏗️ Architecture Highlights

Backend (Refactored)

Frontend (Refactored)

🧪 Development

Local Development Setup

Running Tests

📦 Project Structure

🗺️ Roadmap

📄 License

🤝 Contributing

📧 Contact

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages