Skip to content

An agentic multimodal RAG assistant with voice queries, PDF/image understanding, web search, and grounded answers.

Notifications You must be signed in to change notification settings

Vijaysingh1621/AgenticRAG-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Agentic RAG Chatbot - Complete Implementation

🎯 Project Overview

fully functional Agentic RAG system that implements all the requested features:

Feature 1: Streaming Speech-to-Text (STT)

  • Implementation: OpenAI Whisper model for high-accuracy transcription
  • User Experience: Click microphone button → speak for up to 10 seconds → automatic processing
  • Technical: Real-time audio capture via MediaRecorder API → Whisper processing → text query
  • Files: backend/stt/streaming_stt.py, frontend/src/components/VoiceMic.tsx

Feature 2: MultiModal RAG with PDF Images & Graphs

  • Text Extraction: PyMuPDF for clean text extraction from PDFs
  • Image Processing: Automatic page screenshots saved as high-quality PNGs
  • OCR Integration: Tesseract OCR extracts text from charts, graphs, and images
  • Vector Storage: ChromaDB stores text embeddings for semantic search
  • Files: backend/rag/pdf_processor.py, backend/rag/chroma_store.py

Feature 3a: Agentic Query Processing (RAG + Web Search + MCP)

  • RAG Component: Semantic search through uploaded PDF content
  • Web Search Agent: DuckDuckGo integration for recent/external information
  • Google Drive MCP: Model Context Protocol for searching user's Google Drive
  • Smart Routing: System intelligently decides which sources to query based on the question
  • Files: backend/rag/query_engine.py, backend/agents/web_search_agent.py, backend/mcp/google_drive_client.py

Feature 3b: Citation & Grounding System

  • Smart Citations: Responses include [1], [2], [3] style citations
  • Source Tracking: Each citation shows exactly where information came from
  • Multi-Source Display: Clear indication of PDF pages, Google Drive docs, web results
  • Source Summary: Count of documents consulted from each source type
  • Files: Frontend citation display in ChatBox.tsx

Feature 3c: Click-to-View Images & Content

  • PDF Page Images: Click citation images to view full-size page screenshots
  • Image Modal: Beautiful fullscreen overlay with smooth transitions
  • External Links: Clickable links to Google Drive documents and web sources
  • Preview System: Thumbnail images show relevant PDF pages in citations
  • Files: Image modal system in frontend/src/components/ChatBox.tsx

🏗️ System Architecture

🌐 Frontend (React + TypeScript + Tailwind CSS)
├── 💬 ChatBox.tsx      # Main chat interface with citations & modals
├── 📄 UploadPDF.tsx    # PDF upload with progress feedback
├── 🎙️ VoiceMic.tsx     # Voice recording and submission
└── 🚀 App.tsx          # Main application with feature showcase

🔧 Backend (FastAPI + Python)
├── 🌍 main.py                    # REST API server with CORS support
├── 📚 rag/
│   ├── 📄 pdf_processor.py       # PDF text + image extraction + OCR
│   ├── 🗃️ chroma_store.py        # Vector database operations
│   ├── 🧠 query_engine.py        # Multi-source query orchestration
│   └── 🔤 embedder.py            # Text embedding utilities
├── 🎙️ stt/
│   └── 🗣️ streaming_stt.py       # Whisper voice transcription
├── 🤖 agents/
│   └── 🌐 web_search_agent.py    # DuckDuckGo search integration
└── ☁️ mcp/
    └── 📁 google_drive_client.py # Google Drive Model Context Protocol

🔄 Complete Workflow Examples

📄 PDF Processing Workflow:

  1. Upload: User uploads PDF with charts/graphs → POST /upload-pdf/
  2. Extract: PyMuPDF extracts text + renders page images → images/page_X.png
  3. OCR: Tesseract processes images for additional text extraction
  4. Combine: Text from PDF + OCR text from images merged by page
  5. Vectorize: Combined text chunked and embedded in ChromaDB
  6. Ready: System confirms processing complete with stats

🎙️ Voice Query Workflow:

  1. Record: User clicks mic → MediaRecorder captures 10s audio
  2. Upload: Audio blob sent to → POST /voice-query/
  3. Transcribe: Whisper converts speech to text
  4. Process: Text query goes through full RAG pipeline
  5. Respond: Returns transcription + answer + citations + images

🧠 Agentic Query Workflow:

  1. Analyze: System analyzes query for keywords ("latest", "current", etc.)
  2. RAG Search: ChromaDB semantic search on uploaded PDFs
  3. Web Search: DuckDuckGo search for recent/external information
  4. Drive Search: Google Drive MCP for user's cloud documents
  5. Synthesize: Gemini LLM combines all sources with proper citations
  6. Format: Response with clickable citations and preview images

🛠️ Technical Implementation Details

Backend Technologies:

  • FastAPI: Modern async web framework with automatic OpenAPI docs
  • LangChain: RAG pipeline orchestration and LLM integration
  • ChromaDB: High-performance vector database for embeddings
  • OpenAI Whisper: State-of-the-art speech recognition
  • PyMuPDF: Fast PDF processing without external dependencies
  • Tesseract OCR: Optical character recognition for images
  • Google Gemini: Advanced LLM for response generation

Frontend Technologies:

  • React 18: Modern component-based UI framework
  • TypeScript: Type safety and better development experience
  • Tailwind CSS: Utility-first styling for rapid UI development
  • Vite: Fast build tool with HMR for development
  • Axios: HTTP client for API communication

Key Features:

  • CORS Support: Proper cross-origin resource sharing setup
  • File Upload: Multipart form handling for PDFs and audio
  • Static File Serving: Efficient image serving for citations
  • Error Handling: Comprehensive error handling throughout
  • Progress Feedback: Real-time upload and processing status
  • Responsive Design: Works on desktop and mobile devices

📁 File Structure

📦 chatbot-query/
├── 📄 requirements.txt          # Python dependencies
├── 📄 QUICK_START.md           # Quick setup guide
├── 📄 setup_instructions.md    # Detailed setup instructions
├── 🔧 start.bat               # Windows startup script
├── 🧪 test_system.py          # System validation script
├── 🔧 backend/
│   ├── 📄 .env                # Environment variables
│   ├── 🌍 main.py             # FastAPI application
│   ├── 📚 rag/                # RAG system components
│   ├── 🎙️ stt/               # Speech-to-text system
│   ├── 🤖 agents/             # Web search agents
│   ├── ☁️ mcp/                # Google Drive integration
│   ├── 📁 temp/               # Temporary file storage
│   ├── 🖼️ images/             # PDF page images and screenshots
│   └── 📁 uploads/            # Uploaded file storage
└── 🌐 frontend/
    ├── 📄 package.json        # Node.js dependencies
    ├── ⚙️ vite.config.ts      # Vite configuration with proxy
    ├── 🎨 tailwind.config.js  # Tailwind CSS configuration
    └── 📁 src/
        ├── 🚀 App.tsx         # Main application component
        ├── 📁 components/     # React components
        └── 🎨 assets/         # Static assets

🚀 Quick Start (Final Steps)

1. Set Google API Key (Required)

# Edit backend/.env and set:
GOOGLE_API_KEY=your_actual_google_api_key_from_makersuite

2. Start the System

Option A - Use Startup Script:

# Double-click start.bat (Windows)

Option B - Manual Start:

# Terminal 1 - Backend
cd backend
python main.py

# Terminal 2 - Frontend
cd frontend
npm run dev

3. Access the Application

🎯 Success Metrics

Your system now successfully implements:

  • Advanced Voice Interface: Click-to-record voice queries with Whisper STT
  • MultiModal Document Processing: PDFs with text, images, and charts
  • Intelligent Information Retrieval: RAG + Web Search + Google Drive
  • Transparent Source Attribution: Smart citations with click-to-view
  • Modern User Experience: Responsive design with real-time feedback
  • Production-Ready Architecture: Scalable FastAPI + React stack

🎉 You're Done!

Your Agentic RAG Chatbot is complete and ready for use! The system provides enterprise-level functionality with a modern, intuitive interface. All requested features have been implemented with production-quality code and comprehensive documentation.

Enjoy your new AI-powered document analysis and voice query system! 🚀

About

An agentic multimodal RAG assistant with voice queries, PDF/image understanding, web search, and grounded answers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published