fully functional Agentic RAG system that implements all the requested features:
- Implementation: OpenAI Whisper model for high-accuracy transcription
- User Experience: Click microphone button → speak for up to 10 seconds → automatic processing
- Technical: Real-time audio capture via MediaRecorder API → Whisper processing → text query
- Files:
backend/stt/streaming_stt.py
,frontend/src/components/VoiceMic.tsx
- Text Extraction: PyMuPDF for clean text extraction from PDFs
- Image Processing: Automatic page screenshots saved as high-quality PNGs
- OCR Integration: Tesseract OCR extracts text from charts, graphs, and images
- Vector Storage: ChromaDB stores text embeddings for semantic search
- Files:
backend/rag/pdf_processor.py
,backend/rag/chroma_store.py
- RAG Component: Semantic search through uploaded PDF content
- Web Search Agent: DuckDuckGo integration for recent/external information
- Google Drive MCP: Model Context Protocol for searching user's Google Drive
- Smart Routing: System intelligently decides which sources to query based on the question
- Files:
backend/rag/query_engine.py
,backend/agents/web_search_agent.py
,backend/mcp/google_drive_client.py
- Smart Citations: Responses include [1], [2], [3] style citations
- Source Tracking: Each citation shows exactly where information came from
- Multi-Source Display: Clear indication of PDF pages, Google Drive docs, web results
- Source Summary: Count of documents consulted from each source type
- Files: Frontend citation display in
ChatBox.tsx
- PDF Page Images: Click citation images to view full-size page screenshots
- Image Modal: Beautiful fullscreen overlay with smooth transitions
- External Links: Clickable links to Google Drive documents and web sources
- Preview System: Thumbnail images show relevant PDF pages in citations
- Files: Image modal system in
frontend/src/components/ChatBox.tsx
🌐 Frontend (React + TypeScript + Tailwind CSS)
├── 💬 ChatBox.tsx # Main chat interface with citations & modals
├── 📄 UploadPDF.tsx # PDF upload with progress feedback
├── 🎙️ VoiceMic.tsx # Voice recording and submission
└── 🚀 App.tsx # Main application with feature showcase
🔧 Backend (FastAPI + Python)
├── 🌍 main.py # REST API server with CORS support
├── 📚 rag/
│ ├── 📄 pdf_processor.py # PDF text + image extraction + OCR
│ ├── 🗃️ chroma_store.py # Vector database operations
│ ├── 🧠 query_engine.py # Multi-source query orchestration
│ └── 🔤 embedder.py # Text embedding utilities
├── 🎙️ stt/
│ └── 🗣️ streaming_stt.py # Whisper voice transcription
├── 🤖 agents/
│ └── 🌐 web_search_agent.py # DuckDuckGo search integration
└── ☁️ mcp/
└── 📁 google_drive_client.py # Google Drive Model Context Protocol
- Upload: User uploads PDF with charts/graphs →
POST /upload-pdf/
- Extract: PyMuPDF extracts text + renders page images →
images/page_X.png
- OCR: Tesseract processes images for additional text extraction
- Combine: Text from PDF + OCR text from images merged by page
- Vectorize: Combined text chunked and embedded in ChromaDB
- Ready: System confirms processing complete with stats
- Record: User clicks mic → MediaRecorder captures 10s audio
- Upload: Audio blob sent to →
POST /voice-query/
- Transcribe: Whisper converts speech to text
- Process: Text query goes through full RAG pipeline
- Respond: Returns transcription + answer + citations + images
- Analyze: System analyzes query for keywords ("latest", "current", etc.)
- RAG Search: ChromaDB semantic search on uploaded PDFs
- Web Search: DuckDuckGo search for recent/external information
- Drive Search: Google Drive MCP for user's cloud documents
- Synthesize: Gemini LLM combines all sources with proper citations
- Format: Response with clickable citations and preview images
- FastAPI: Modern async web framework with automatic OpenAPI docs
- LangChain: RAG pipeline orchestration and LLM integration
- ChromaDB: High-performance vector database for embeddings
- OpenAI Whisper: State-of-the-art speech recognition
- PyMuPDF: Fast PDF processing without external dependencies
- Tesseract OCR: Optical character recognition for images
- Google Gemini: Advanced LLM for response generation
- React 18: Modern component-based UI framework
- TypeScript: Type safety and better development experience
- Tailwind CSS: Utility-first styling for rapid UI development
- Vite: Fast build tool with HMR for development
- Axios: HTTP client for API communication
- CORS Support: Proper cross-origin resource sharing setup
- File Upload: Multipart form handling for PDFs and audio
- Static File Serving: Efficient image serving for citations
- Error Handling: Comprehensive error handling throughout
- Progress Feedback: Real-time upload and processing status
- Responsive Design: Works on desktop and mobile devices
📦 chatbot-query/
├── 📄 requirements.txt # Python dependencies
├── 📄 QUICK_START.md # Quick setup guide
├── 📄 setup_instructions.md # Detailed setup instructions
├── 🔧 start.bat # Windows startup script
├── 🧪 test_system.py # System validation script
├── 🔧 backend/
│ ├── 📄 .env # Environment variables
│ ├── 🌍 main.py # FastAPI application
│ ├── 📚 rag/ # RAG system components
│ ├── 🎙️ stt/ # Speech-to-text system
│ ├── 🤖 agents/ # Web search agents
│ ├── ☁️ mcp/ # Google Drive integration
│ ├── 📁 temp/ # Temporary file storage
│ ├── 🖼️ images/ # PDF page images and screenshots
│ └── 📁 uploads/ # Uploaded file storage
└── 🌐 frontend/
├── 📄 package.json # Node.js dependencies
├── ⚙️ vite.config.ts # Vite configuration with proxy
├── 🎨 tailwind.config.js # Tailwind CSS configuration
└── 📁 src/
├── 🚀 App.tsx # Main application component
├── 📁 components/ # React components
└── 🎨 assets/ # Static assets
# Edit backend/.env and set:
GOOGLE_API_KEY=your_actual_google_api_key_from_makersuite
Option A - Use Startup Script:
# Double-click start.bat (Windows)
Option B - Manual Start:
# Terminal 1 - Backend
cd backend
python main.py
# Terminal 2 - Frontend
cd frontend
npm run dev
- Frontend: http://localhost:5173
- Backend API: http://localhost:8001
- API Docs: http://localhost:8001/docs
Your system now successfully implements:
- ✅ Advanced Voice Interface: Click-to-record voice queries with Whisper STT
- ✅ MultiModal Document Processing: PDFs with text, images, and charts
- ✅ Intelligent Information Retrieval: RAG + Web Search + Google Drive
- ✅ Transparent Source Attribution: Smart citations with click-to-view
- ✅ Modern User Experience: Responsive design with real-time feedback
- ✅ Production-Ready Architecture: Scalable FastAPI + React stack
Your Agentic RAG Chatbot is complete and ready for use! The system provides enterprise-level functionality with a modern, intuitive interface. All requested features have been implemented with production-quality code and comprehensive documentation.
Enjoy your new AI-powered document analysis and voice query system! 🚀