Skip to content

hparreao/rag-voice-assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG VOICE ASSISTANT πŸŽ™οΈ

Transforming Document Interactions with Voice Intelligence

Python FastAPI Streamlit OpenAI LangChain

Built with cutting-edge AI technologies:

FastAPI Streamlit OpenAI LangChain Python FAISS


Table of Contents


Overview

RAG Voice Assistant is an advanced AI-powered application that combines Retrieval-Augmented Generation (RAG) with voice interaction capabilities. The system allows users to upload PDF documents and interact with them through both text and voice interfaces, providing intelligent responses based on document content.

Key Capabilities

πŸ” Document Intelligence: Upload and process PDF documents with advanced text chunking and embedding
πŸŽ™οΈ Voice Interaction: Real-time speech-to-text and text-to-speech capabilities
πŸ’¬ Intelligent Chat: Context-aware responses using OpenAI's GPT models
πŸ”Š Audio Processing: Support for multiple voice models and audio formats
⚑ Real-time Processing: Live audio transcription and instant responses


Features

🎯 Core Features

  • PDF Document Processing: Advanced text extraction and chunking using PyPDF
  • Vector Search: FAISS-powered similarity search for relevant document retrieval
  • Multi-modal Interaction: Text, voice, and audio file input support
  • Real-time Transcription: Live speech-to-text using OpenAI Whisper
  • Text-to-Speech: Multiple voice options with OpenAI TTS
  • Context-aware Responses: RAG-based intelligent document querying

πŸ› οΈ Technical Features

  • FastAPI Backend: High-performance async API with automatic documentation
  • Streamlit Frontend: Interactive web interface with multiple tabs
  • WebRTC Integration: Real-time audio streaming capabilities
  • Modular Architecture: Separate backend and frontend for scalability
  • Error Handling: Comprehensive logging and error management
  • File Management: Automatic cleanup and temporary file handling

Getting Started

Prerequisites

This project requires the following dependencies:

  • Programming Language: Python 3.8+
  • Package Manager: pip
  • API Keys: OpenAI API key (required)
  • Audio Support: System audio drivers for voice features

Installation

Build the RAG Voice Assistant from source and install dependencies:

  1. Clone the repository:

    git clone https://github.com/yourusername/rag-voice-assistant.git
  2. Navigate to the project directory:

    cd rag-voice-assistant
  3. Install the dependencies:

    Using pip:

    pip install -r requirements.txt

Configuration

  1. Set up environment variables: Create a .env file in the root directory:

    OPENAI_API_KEY=your_openai_api_key_here
  2. Verify installation:

    python -c "import openai; print('OpenAI installed successfully')"

Usage

Running the Backend

Start the FastAPI server with hot reload:

uvicorn main:app --reload

The API will be available at:

Running the Frontend

Launch the Streamlit interface:

streamlit run frontend.py

The web interface will open at: http://localhost:8501

Using the Application

1. Document Upload

  • Navigate to the sidebar "Gerenciamento de Documentos"
  • Upload one or more PDF files
  • Click "Processar" to index the documents

2. Chat Interface

  • Use the "Chat" tab for text-based questions
  • Ask questions about your uploaded documents
  • Receive both text and audio responses

3. Voice Input

  • Switch to "Entrada por Voz" tab
  • Grant microphone permissions
  • Speak your questions naturally
  • View real-time transcription

4. Audio Features

  • Text-to-Speech: Convert any text to audio with voice selection
  • Audio-to-Text: Upload MP3 files for transcription
  • Voice Models: Choose from 6 different voice options

API Documentation

Core Endpoints

Document Management

POST /upload
Content-Type: multipart/form-data

Upload and process PDF documents for indexing.

Query Processing

POST /query
Content-Type: application/json

{
  "question": "Your question about the documents"
}

Audio Processing

POST /text-to-audio
Content-Type: multipart/form-data

Convert text to speech with voice selection.
POST /audio-to-text
Content-Type: multipart/form-data

Transcribe audio files to text using Whisper.

Response Formats

Query Response:

{
  "response": "AI-generated answer based on document content"
}

Error Response:

{
  "detail": "Error description"
}

Architecture

System Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Streamlit     β”‚    β”‚    FastAPI      β”‚    β”‚    OpenAI       β”‚
β”‚   Frontend      │◄──►│    Backend      │◄──►│    Services     β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Chat UI       β”‚    β”‚ β€’ RAG Pipeline  β”‚    β”‚ β€’ GPT Models    β”‚
β”‚ β€’ Voice Input   β”‚    β”‚ β€’ Audio Proc.   β”‚    β”‚ β€’ Whisper       β”‚
β”‚ β€’ File Upload   β”‚    β”‚ β€’ Vector Store  β”‚    β”‚ β€’ TTS           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚    FAISS        β”‚
                       β”‚  Vector Store   β”‚
                       β”‚                 β”‚
                       β”‚ β€’ Embeddings    β”‚
                       β”‚ β€’ Similarity    β”‚
                       β”‚ β€’ Search        β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

  1. Document Processing: PDFs β†’ Text Chunks β†’ Embeddings β†’ Vector Store
  2. Query Processing: User Input β†’ Similarity Search β†’ Context Retrieval β†’ LLM β†’ Response
  3. Audio Processing: Voice Input β†’ Whisper β†’ Text β†’ Query Pipeline β†’ TTS β†’ Audio Output

Key Components

  • Document Loader: PyPDFLoader for PDF text extraction
  • Text Splitter: RecursiveCharacterTextSplitter for intelligent chunking
  • Embeddings: OpenAI embeddings for semantic search
  • Vector Store: FAISS for efficient similarity search
  • LLM: OpenAI GPT-3.5-turbo for response generation
  • Audio Processing: OpenAI Whisper (STT) and TTS models

Contributing

We welcome contributions to improve the RAG Voice Assistant! Here's how you can help:

Development Setup

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes and test thoroughly
  4. Commit your changes: git commit -m "Add your feature"
  5. Push to the branch: git push origin feature/your-feature
  6. Open a Pull Request

Areas for Contribution

  • 🌍 Internationalization: Add support for more languages
  • 🎨 UI/UX: Improve the frontend interface
  • πŸ”§ Performance: Optimize vector search and processing
  • πŸ“± Mobile: Add mobile-responsive design
  • πŸ§ͺ Testing: Add comprehensive test coverage
  • πŸ“š Documentation: Improve docs and examples

Support

If you encounter any issues or have questions:


Made with ❀️ by [Hugo Parreão]

⭐ Star this project β€’ 🍴 Fork it β€’ πŸ“’ Report Issues

About

RAG Voice Assistant is an advanced AI-powered application that combines RAG with voice interaction capabilities. The system allows users to upload PDF documents and interact with them through both text and voice interfaces, providing intelligent responses based on document content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages