Skip to content

Commit dd92db4

Browse files
authored
Added pebblo safe rag demo using google drive , qdrant, hugging face and groq api (#599)
* Added demo with all open-source components
1 parent 3a962db commit dd92db4

File tree

7 files changed

+513
-0
lines changed

7 files changed

+513
-0
lines changed
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Pebblo Open Source SafeRAG Demo
2+
3+
A secure and semantic Retrieval-Augmented Generation (RAG) pipeline that combines multiple powerful components to provide a robust, privacy-focused document retrieval and question-answering system.
4+
5+
## Core Components
6+
7+
- **Document Source**: Google Drive integration for document ingestion
8+
- **Security**: PebbloSafeLoader for semantic filtering and access control
9+
- **Vector Store**: Qdrant for efficient document retrieval
10+
- **Embeddings**: Local HuggingFace embeddings for semantic search
11+
- **LLM**: Groq-powered Llama 3.3 for high-quality responses
12+
13+
## Prerequisites
14+
15+
1. Google Service Account with access to target Drive folder
16+
2. Qdrant vector database instance
17+
3. Required Python packages (see requirements.txt)
18+
4. GROQ API key (set in .env file)
19+
20+
## Setup Instructions
21+
22+
### 1. Configure Environment
23+
24+
- Set up Google Drive authentication:
25+
- Follow the guide at: https://python.langchain.com/docs/integrations/document_loaders/google_drive/
26+
- Create a service account and download credentials
27+
- Share your target Google Drive folder with the service account email
28+
29+
- Configure API Keys:
30+
- Create a `.env` file in the project root
31+
- Add your GROQ API key: `GROQ_API_KEY=your_api_key_here`
32+
33+
- Update `constant.py` with your configuration:
34+
- Set `SERVICE_ACCOUNT_PATH` to your Google service account credentials
35+
- Set `INPUT_FOLDER_ID` to your Google Drive folder ID
36+
- Configure other settings as needed
37+
38+
### 2. Start Required Services
39+
40+
#### Qdrant Vector Database
41+
```bash
42+
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
43+
```
44+
45+
#### Pebblo Server
46+
```bash
47+
pip install pebblo
48+
pebblo --config-file config.yaml
49+
```
50+
51+
### 3. Run the Application
52+
53+
```bash
54+
python pebblo_opensource_saferag.py
55+
```
56+
57+
## Features
58+
59+
- **Secure Document Ingestion**: Semantic filtering during document loading
60+
- **Identity-Based Access Control**: User-level permissions and authentication
61+
- **Content Filtering**: Topic and entity-based content filtering
62+
- **Interactive Interface**: User-friendly query interface
63+
- **Real-time Search**: Efficient semantic search and retrieval
64+
- **Privacy-Focused**: Local embeddings and secure data handling
65+
66+
## Project Structure
67+
68+
```
69+
pebblo_google_drive_opensource/
70+
├── pebblo_opensource_saferag.py # Main application file
71+
├── constant.py # Configuration settings
72+
├── utils.py # Utility functions
73+
├── google_auth.py # Google authentication utilities
74+
├── .env # Environment variables
75+
└── README.md # This file
76+
```
77+
78+
## Security and Privacy
79+
80+
This implementation prioritizes security and privacy while maintaining high-quality retrieval and generation capabilities. Key security features include:
81+
82+
- Semantic filtering of sensitive content
83+
- Identity-based access control
84+
- Local embedding generation
85+
- Secure API key management
86+
- Privacy-preserving document processing
87+
88+
## Contributing
89+
90+
Contributions are welcome! Please feel free to submit a Pull Request.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
from dotenv import load_dotenv
2+
import os
3+
4+
# Load environment variables from .env file
5+
load_dotenv()
6+
7+
LLM_NAME = "llama-3.3-70b-versatile"
8+
LOADER_APP_NAME = "py_data_demo_loader"
9+
RETRIEVAL_APP_NAME = "py_data_demo_retriever"
10+
COLLECTION_NAME = "py_data_demo_collection"
11+
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
12+
VECTOR_DB_URL = "http://localhost:6333"
13+
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
14+
SERVICE_ACCOUNT_PATH = ""
15+
KEY_PATH = ""
16+
INPUT_FOLDER_ID = ""
17+
INGESTION_USER_EMAIL_ADDRESS = ""
18+
TOKEN_PATH = ""
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from google_auth_oauthlib.flow import InstalledAppFlow
2+
3+
# Define the API scopes you need:
4+
SCOPES = ["https://www.googleapis.com/auth/drive.readonly"] # Example
5+
6+
7+
def main():
8+
creds = None
9+
flow = InstalledAppFlow.from_client_secrets_file(
10+
"<Entere file name>", SCOPES
11+
) # Replace with your credentials file
12+
creds = flow.run_local_server(port=0) # Opens a browser for auth
13+
# Save the credentials to a file
14+
with open("<Enter output file name>", "w") as token:
15+
token.write(creds.to_json())
16+
print("Token saved to google_token.json")
17+
18+
19+
if __name__ == "__main__":
20+
main()
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from typing import List
2+
3+
from google.oauth2 import service_account
4+
from googleapiclient.discovery import build
5+
6+
7+
def get_authorized_identities(
8+
admin_user_email_address: str, credentials_file_path: str, user_email: str
9+
) -> List[str]:
10+
"""
11+
Get authorized identities from Google Directory API
12+
"""
13+
_authorized_identities = [user_email]
14+
print(
15+
f"User: {user_email}, \nAuthorized Identities: {admin_user_email_address}\n {credentials_file_path}"
16+
)
17+
credentials = service_account.Credentials.from_service_account_file(
18+
credentials_file_path,
19+
scopes=[
20+
"https://www.googleapis.com/auth/admin.directory.group.readonly",
21+
"https://www.googleapis.com/auth/admin.directory.group",
22+
],
23+
subject=admin_user_email_address,
24+
)
25+
directory_service = build("admin", "directory_v1", credentials=credentials)
26+
27+
try:
28+
groups = directory_service.groups().list(userKey=user_email).execute()
29+
for group in groups.get("groups", []):
30+
group_email = group["email"]
31+
_authorized_identities.append(group_email)
32+
except Exception as e:
33+
print(f"Error in : {e}")
34+
print(f"User: {user_email}, \nAuthorized Identities: {_authorized_identities}\n")
35+
return _authorized_identities

0 commit comments

Comments
 (0)