|
| 1 | +# Pebblo Open Source SafeRAG Demo |
| 2 | + |
| 3 | +A secure and semantic Retrieval-Augmented Generation (RAG) pipeline that combines multiple powerful components to provide a robust, privacy-focused document retrieval and question-answering system. |
| 4 | + |
| 5 | +## Core Components |
| 6 | + |
| 7 | +- **Document Source**: Google Drive integration for document ingestion |
| 8 | +- **Security**: PebbloSafeLoader for semantic filtering and access control |
| 9 | +- **Vector Store**: Qdrant for efficient document retrieval |
| 10 | +- **Embeddings**: Local HuggingFace embeddings for semantic search |
| 11 | +- **LLM**: Groq-powered Llama 3.3 for high-quality responses |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +1. Google Service Account with access to target Drive folder |
| 16 | +2. Qdrant vector database instance |
| 17 | +3. Required Python packages (see requirements.txt) |
| 18 | +4. GROQ API key (set in .env file) |
| 19 | + |
| 20 | +## Setup Instructions |
| 21 | + |
| 22 | +### 1. Configure Environment |
| 23 | + |
| 24 | +- Set up Google Drive authentication: |
| 25 | + - Follow the guide at: https://python.langchain.com/docs/integrations/document_loaders/google_drive/ |
| 26 | + - Create a service account and download credentials |
| 27 | + - Share your target Google Drive folder with the service account email |
| 28 | + |
| 29 | +- Configure API Keys: |
| 30 | + - Create a `.env` file in the project root |
| 31 | + - Add your GROQ API key: `GROQ_API_KEY=your_api_key_here` |
| 32 | + |
| 33 | +- Update `constant.py` with your configuration: |
| 34 | + - Set `SERVICE_ACCOUNT_PATH` to your Google service account credentials |
| 35 | + - Set `INPUT_FOLDER_ID` to your Google Drive folder ID |
| 36 | + - Configure other settings as needed |
| 37 | + |
| 38 | +### 2. Start Required Services |
| 39 | + |
| 40 | +#### Qdrant Vector Database |
| 41 | +```bash |
| 42 | +docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant |
| 43 | +``` |
| 44 | + |
| 45 | +#### Pebblo Server |
| 46 | +```bash |
| 47 | +pip install pebblo |
| 48 | +pebblo --config-file config.yaml |
| 49 | +``` |
| 50 | + |
| 51 | +### 3. Run the Application |
| 52 | + |
| 53 | +```bash |
| 54 | +python pebblo_opensource_saferag.py |
| 55 | +``` |
| 56 | + |
| 57 | +## Features |
| 58 | + |
| 59 | +- **Secure Document Ingestion**: Semantic filtering during document loading |
| 60 | +- **Identity-Based Access Control**: User-level permissions and authentication |
| 61 | +- **Content Filtering**: Topic and entity-based content filtering |
| 62 | +- **Interactive Interface**: User-friendly query interface |
| 63 | +- **Real-time Search**: Efficient semantic search and retrieval |
| 64 | +- **Privacy-Focused**: Local embeddings and secure data handling |
| 65 | + |
| 66 | +## Project Structure |
| 67 | + |
| 68 | +``` |
| 69 | +pebblo_google_drive_opensource/ |
| 70 | +├── pebblo_opensource_saferag.py # Main application file |
| 71 | +├── constant.py # Configuration settings |
| 72 | +├── utils.py # Utility functions |
| 73 | +├── google_auth.py # Google authentication utilities |
| 74 | +├── .env # Environment variables |
| 75 | +└── README.md # This file |
| 76 | +``` |
| 77 | + |
| 78 | +## Security and Privacy |
| 79 | + |
| 80 | +This implementation prioritizes security and privacy while maintaining high-quality retrieval and generation capabilities. Key security features include: |
| 81 | + |
| 82 | +- Semantic filtering of sensitive content |
| 83 | +- Identity-based access control |
| 84 | +- Local embedding generation |
| 85 | +- Secure API key management |
| 86 | +- Privacy-preserving document processing |
| 87 | + |
| 88 | +## Contributing |
| 89 | + |
| 90 | +Contributions are welcome! Please feel free to submit a Pull Request. |
0 commit comments