Full-stack text-to-speech web application powered by Kokoro-82M — React frontend, FastAPI backend, Docker containerized. Includes an Upload & Clean pipeline for processing transcripts and PDFs before generating speech. Runs locally via start.bat or deployed on Render (frontend) + Hugging Face Spaces (backend).
| Layer | URL |
|---|---|
| Frontend | https://vocably.onrender.com |
| Backend API | https://gilfoyle99213-vocably-backend.hf.space |
| API Health | https://gilfoyle99213-vocably-backend.hf.space/health |
| API Docs | https://gilfoyle99213-vocably-backend.hf.space/docs |
- Frontend: React 19, Tailwind CSS v4, Vite — deployed on Render
- Backend: FastAPI + Uvicorn, Kokoro-82M (PyTorch) — deployed on Hugging Face Spaces
- Upload & Clean: Ollama (qwen3.5:4b) — local AI for transcript and PDF cleanup before TTS
- YouTube Transcript:
youtube-transcript-api— scrapes closed-caption data directly from YouTube, no API key required - Infra: Docker (
python:3.11-slim, non-root user, layer-cached build)
Browser (Render)
│
├── POST /api/tts ──► Kokoro-82M ──► WAV
│
├── POST /api/clean ──► detect format ──► parser + Ollama ──► clean text
│
├── POST /api/extract-pdf ──► pymupdf / Tesseract OCR ──► Ollama ──► clean text
│
└── POST /api/youtube-transcript ──► extract video ID ──► youtube-transcript-api ──► Ollama ──► clean text
Prerequisites: Node.js 18+, Python 3.10+, 4 GB RAM (model downloads ~500 MB on first run)
git clone https://github.com/AbdulGani11/Vocably.git
cd VocablyEnvironment files (already in the repo — no secrets):
| File | Used when | Contains |
|---|---|---|
.env.development |
npm run dev |
localhost:8000 — local backend |
.env.production |
npm run build |
HF Spaces URL — cloud backend |
These files are committed because they contain no secrets — only public URLs. Vite automatically picks the correct file based on the command.
Windows:
.\start.bat # starts backend + frontend togetherThe app accepts .txt, .md, .srt, .vtt, and .pdf files. Before the text reaches the TTS engine, a two-stage cleaning pipeline runs:
- Format parser (deterministic) — strips timestamps, cue IDs, HTML tags from SRT/VTT; uses Tesseract OCR for scanned PDFs
- Ollama LLM (qwen3.5:4b) — removes spoken filler words and cleans prose structure
Ollama is optional. If it is not running, the parsed text is loaded as-is.
Paste any YouTube URL into the YouTube input in the card header. The backend extracts the video ID, fetches closed-caption data directly from YouTube's servers (no API key required), and runs the same Ollama cleaning pipeline before loading the transcript into the textarea. Works with auto-generated and manual captions.
Supports URL formats: watch?v=, youtu.be/, /shorts/, /embed/, /live/
Beginners: You do not need Docker for daily development. Use
.\start.bat— it handles everything. Docker is only needed to test the containerized backend locally.
docker-compose up --build # local container (port 8000)
docker-compose downSee Documents/documentation.md — full technical reference covering setup, architecture, the Upload & Clean pipeline, AI concepts, Docker, deployment, and troubleshooting.
MIT