Transforming Document Interactions with Voice Intelligence
Built with cutting-edge AI technologies:
RAG Voice Assistant is an advanced AI-powered application that combines Retrieval-Augmented Generation (RAG) with voice interaction capabilities. The system allows users to upload PDF documents and interact with them through both text and voice interfaces, providing intelligent responses based on document content.
π Document Intelligence: Upload and process PDF documents with advanced text chunking and embedding
ποΈ Voice Interaction: Real-time speech-to-text and text-to-speech capabilities
π¬ Intelligent Chat: Context-aware responses using OpenAI's GPT models
π Audio Processing: Support for multiple voice models and audio formats
β‘ Real-time Processing: Live audio transcription and instant responses
- PDF Document Processing: Advanced text extraction and chunking using PyPDF
- Vector Search: FAISS-powered similarity search for relevant document retrieval
- Multi-modal Interaction: Text, voice, and audio file input support
- Real-time Transcription: Live speech-to-text using OpenAI Whisper
- Text-to-Speech: Multiple voice options with OpenAI TTS
- Context-aware Responses: RAG-based intelligent document querying
- FastAPI Backend: High-performance async API with automatic documentation
- Streamlit Frontend: Interactive web interface with multiple tabs
- WebRTC Integration: Real-time audio streaming capabilities
- Modular Architecture: Separate backend and frontend for scalability
- Error Handling: Comprehensive logging and error management
- File Management: Automatic cleanup and temporary file handling
This project requires the following dependencies:
- Programming Language: Python 3.8+
- Package Manager: pip
- API Keys: OpenAI API key (required)
- Audio Support: System audio drivers for voice features
Build the RAG Voice Assistant from source and install dependencies:
-
Clone the repository:
git clone https://github.com/yourusername/rag-voice-assistant.git
-
Navigate to the project directory:
cd rag-voice-assistant -
Install the dependencies:
Using pip:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory:OPENAI_API_KEY=your_openai_api_key_here
-
Verify installation:
python -c "import openai; print('OpenAI installed successfully')"
Start the FastAPI server with hot reload:
uvicorn main:app --reloadThe API will be available at:
- API: http://localhost:8000
- Documentation: http://localhost:8000/docs
- Alternative Docs: http://localhost:8000/redoc
Launch the Streamlit interface:
streamlit run frontend.pyThe web interface will open at: http://localhost:8501
- Navigate to the sidebar "Gerenciamento de Documentos"
- Upload one or more PDF files
- Click "Processar" to index the documents
- Use the "Chat" tab for text-based questions
- Ask questions about your uploaded documents
- Receive both text and audio responses
- Switch to "Entrada por Voz" tab
- Grant microphone permissions
- Speak your questions naturally
- View real-time transcription
- Text-to-Speech: Convert any text to audio with voice selection
- Audio-to-Text: Upload MP3 files for transcription
- Voice Models: Choose from 6 different voice options
POST /upload
Content-Type: multipart/form-data
Upload and process PDF documents for indexing.POST /query
Content-Type: application/json
{
"question": "Your question about the documents"
}POST /text-to-audio
Content-Type: multipart/form-data
Convert text to speech with voice selection.POST /audio-to-text
Content-Type: multipart/form-data
Transcribe audio files to text using Whisper.Query Response:
{
"response": "AI-generated answer based on document content"
}Error Response:
{
"detail": "Error description"
}βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Streamlit β β FastAPI β β OpenAI β
β Frontend βββββΊβ Backend βββββΊβ Services β
β β β β β β
β β’ Chat UI β β β’ RAG Pipeline β β β’ GPT Models β
β β’ Voice Input β β β’ Audio Proc. β β β’ Whisper β
β β’ File Upload β β β’ Vector Store β β β’ TTS β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β FAISS β
β Vector Store β
β β
β β’ Embeddings β
β β’ Similarity β
β β’ Search β
βββββββββββββββββββ
- Document Processing: PDFs β Text Chunks β Embeddings β Vector Store
- Query Processing: User Input β Similarity Search β Context Retrieval β LLM β Response
- Audio Processing: Voice Input β Whisper β Text β Query Pipeline β TTS β Audio Output
- Document Loader: PyPDFLoader for PDF text extraction
- Text Splitter: RecursiveCharacterTextSplitter for intelligent chunking
- Embeddings: OpenAI embeddings for semantic search
- Vector Store: FAISS for efficient similarity search
- LLM: OpenAI GPT-3.5-turbo for response generation
- Audio Processing: OpenAI Whisper (STT) and TTS models
We welcome contributions to improve the RAG Voice Assistant! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes and test thoroughly
- Commit your changes:
git commit -m "Add your feature" - Push to the branch:
git push origin feature/your-feature - Open a Pull Request
- π Internationalization: Add support for more languages
- π¨ UI/UX: Improve the frontend interface
- π§ Performance: Optimize vector search and processing
- π± Mobile: Add mobile-responsive design
- π§ͺ Testing: Add comprehensive test coverage
- π Documentation: Improve docs and examples
If you encounter any issues or have questions:
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Made with β€οΈ by [Hugo ParreΓ£o]
β Star this project β’ π΄ Fork it β’ π’ Report Issues