Skip to content

Commit 922e4b7

Browse files
authored
adds Google Drive backend for storing datasets as sheets (#2138)
This pull request introduces a new Google Drive backend for the Ragas framework, enabling cloud-based storage and collaboration for datasets and experiments. It includes documentation, examples, and integration into the package's backend registry. <img width="740" height="330" alt="Screenshot 2025-07-23 at 11 30 55 AM" src="https://github.com/user-attachments/assets/9f164f93-70ae-4a0f-97c7-a94e5c83b397" /> --------- Signed-off-by: Derek Anderson <[email protected]>
1 parent 677808e commit 922e4b7

File tree

9 files changed

+1359
-1
lines changed

9 files changed

+1359
-1
lines changed
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# Google Drive Backend for Ragas
2+
3+
The Google Drive backend allows you to store Ragas datasets and experiments in Google Sheets within your Google Drive. This provides a cloud-based, collaborative storage solution that's familiar to many users.
4+
5+
## Features
6+
7+
- **Cloud Storage**: Store your datasets and experiments in Google Drive
8+
- **Collaborative**: Share and collaborate on datasets using Google Drive's sharing features
9+
- **Google Sheets Format**: Data is stored in Google Sheets for easy viewing and editing
10+
- **Automatic Structure**: Creates organized folder structure (datasets/ and experiments/)
11+
- **Type Preservation**: Attempts to preserve basic data types (strings, numbers)
12+
- **Multiple Authentication**: Supports both OAuth and Service Account authentication
13+
14+
## Installation
15+
16+
```bash
17+
# Install with Google Drive dependencies
18+
pip install "ragas_experimental[gdrive]"
19+
```
20+
21+
## Setup
22+
23+
### 1. Google Cloud Project Setup
24+
25+
1. Go to the [Google Cloud Console](https://console.cloud.google.com/)
26+
2. Create a new project or select an existing one
27+
3. Enable the following APIs:
28+
- Google Drive API
29+
- Google Sheets API
30+
31+
### 2. Authentication Setup
32+
33+
Choose one of two authentication methods:
34+
35+
#### Option A: Service Account (Recommended)
36+
37+
1. In Google Cloud Console, go to "Credentials"
38+
2. Click "Create Credentials" → "Service account"
39+
3. Create the service account and download the JSON key file
40+
4. Share your Google Drive folder with the service account email
41+
42+
*This is the preferred method as it works well for both scripts and production environments without requiring user interaction.*
43+
44+
#### Option B: OAuth 2.0 (Alternative for Interactive Use)
45+
46+
1. In Google Cloud Console, go to "Credentials"
47+
2. Click "Create Credentials" → "OAuth client ID"
48+
3. Choose "Desktop application"
49+
4. Download the JSON file (save as `credentials.json`)
50+
51+
### 3. Google Drive Folder Setup
52+
53+
1. Create a folder in Google Drive for your Ragas data
54+
2. Get the folder ID from the URL: `https://drive.google.com/drive/folders/FOLDER_ID_HERE`
55+
3. If using Service Account, share the folder with the service account email
56+
57+
## Usage
58+
59+
### Basic Usage
60+
61+
```python
62+
from ragas_experimental.dataset import Dataset
63+
from pydantic import BaseModel
64+
65+
# Define your data model
66+
class EvaluationRecord(BaseModel):
67+
question: str
68+
answer: str
69+
score: float
70+
71+
# Create dataset with Google Drive backend
72+
dataset = Dataset(
73+
name="my_evaluation",
74+
backend="gdrive",
75+
data_model=EvaluationRecord,
76+
folder_id="your_google_drive_folder_id",
77+
credentials_path="path/to/credentials.json"
78+
)
79+
80+
# Add data
81+
record = EvaluationRecord(
82+
question="What is AI?",
83+
answer="Artificial Intelligence",
84+
score=0.95
85+
)
86+
dataset.append(record)
87+
88+
# Save to Google Drive
89+
dataset.save()
90+
91+
# Load from Google Drive
92+
dataset.load()
93+
```
94+
95+
### Authentication Options
96+
97+
#### Using Environment Variables
98+
99+
```bash
100+
export GDRIVE_FOLDER_ID="your_folder_id"
101+
export GDRIVE_CREDENTIALS_PATH="path/to/credentials.json"
102+
# OR for service account:
103+
export GDRIVE_SERVICE_ACCOUNT_PATH="path/to/service_account.json"
104+
```
105+
106+
```python
107+
# Environment variables will be used automatically
108+
dataset = Dataset(
109+
name="my_evaluation",
110+
backend="gdrive",
111+
data_model=EvaluationRecord,
112+
folder_id=os.getenv("GDRIVE_FOLDER_ID")
113+
)
114+
```
115+
116+
#### Using Service Account
117+
118+
```python
119+
dataset = Dataset(
120+
name="my_evaluation",
121+
backend="gdrive",
122+
data_model=EvaluationRecord,
123+
folder_id="your_folder_id",
124+
service_account_path="path/to/service_account.json"
125+
)
126+
```
127+
128+
#### Custom Token Path
129+
130+
```python
131+
dataset = Dataset(
132+
name="my_evaluation",
133+
backend="gdrive",
134+
data_model=EvaluationRecord,
135+
folder_id="your_folder_id",
136+
credentials_path="path/to/credentials.json",
137+
token_path="custom_token.json"
138+
)
139+
```
140+
141+
## File Structure
142+
143+
The backend creates the following structure in your Google Drive folder:
144+
145+
```text
146+
Your Google Drive Folder/
147+
├── datasets/
148+
│ ├── dataset1.gsheet
149+
│ ├── dataset2.gsheet
150+
│ └── ...
151+
└── experiments/
152+
├── experiment1.gsheet
153+
├── experiment2.gsheet
154+
└── ...
155+
```
156+
157+
Each dataset/experiment is stored as a separate Google Sheet with:
158+
159+
- Column headers matching your data model fields
160+
- Automatic type conversion for basic types (int, float, string)
161+
- JSON serialization for complex objects
162+
163+
## Environment Variables
164+
165+
| Variable | Description | Example |
166+
|----------|-------------|---------|
167+
| `GDRIVE_FOLDER_ID` | Google Drive folder ID | `1abc123...` |
168+
| `GDRIVE_CREDENTIALS_PATH` | Path to OAuth credentials JSON | `./credentials.json` |
169+
| `GDRIVE_SERVICE_ACCOUNT_PATH` | Path to service account JSON | `./service_account.json` |
170+
| `GDRIVE_TOKEN_PATH` | Path to store OAuth token | `./token.json` |
171+
172+
## Best Practices
173+
174+
### Security
175+
176+
- Never commit credential files to version control
177+
- Use environment variables for sensitive information
178+
- Regularly rotate service account keys
179+
- Use OAuth for development, service accounts for production
180+
181+
### Performance
182+
183+
- Google Sheets API has rate limits - avoid frequent saves with large datasets
184+
- Consider batching operations when possible
185+
- Use appropriate folder organization for large numbers of datasets
186+
187+
### Collaboration
188+
189+
- Share folders with appropriate permissions (view/edit)
190+
- Use descriptive dataset names
191+
- Document your data models clearly
192+
193+
## Troubleshooting
194+
195+
### Common Issues
196+
197+
1. **"Folder not found" error**
198+
- Verify the folder ID is correct
199+
- Ensure the folder is shared with your service account (if using one)
200+
- Check that the folder exists and is accessible
201+
202+
2. **Authentication errors**
203+
- Verify credential file paths are correct
204+
- Check that required APIs are enabled in Google Cloud Console
205+
- For OAuth: delete token file and re-authenticate
206+
- For Service Account: verify the JSON file is valid
207+
208+
3. **Permission errors**
209+
- Ensure your account has edit access to the folder
210+
- For service accounts: share the folder with the service account email
211+
- Check Google Drive sharing settings
212+
213+
4. **Import errors**
214+
- Install dependencies: `pip install "ragas_experimental[gdrive]"`
215+
- Verify all required packages are installed
216+
217+
### Getting Help
218+
219+
If you encounter issues:
220+
221+
1. Check error messages carefully for specific details
222+
2. Verify your Google Cloud project setup
223+
3. Test with a simple example first
224+
4. Check the Google Drive API documentation for rate limits
225+
226+
## Limitations
227+
228+
- Google Sheets has a limit of 10 million cells per spreadsheet
229+
- Complex nested objects are JSON-serialized as strings
230+
- API rate limits may affect performance with large datasets
231+
- Requires internet connection for all operations
232+
233+
## Examples
234+
235+
See `examples/gdrive_backend_example.py` for a complete working example.
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
"""Example showing how to append data to an existing Google Drive dataset.
2+
3+
This demonstrates the proper pattern for adding data to existing datasets
4+
while preserving the existing records.
5+
"""
6+
7+
from pydantic import BaseModel
8+
from ragas_experimental.dataset import Dataset
9+
10+
11+
# Example data model
12+
class EvaluationRecord(BaseModel):
13+
question: str
14+
answer: str
15+
context: str
16+
score: float
17+
feedback: str
18+
19+
20+
def append_to_existing_dataset():
21+
"""Example of appending to an existing dataset."""
22+
23+
folder_id = "folder_id_here" # Replace with your actual Google Drive folder ID
24+
25+
# Option 1: Load existing dataset and add more data
26+
print("=== Appending to Existing Dataset ===")
27+
28+
try:
29+
# Try to load existing dataset
30+
dataset = Dataset.load(
31+
name="evaluation_results",
32+
backend="gdrive",
33+
data_model=EvaluationRecord,
34+
folder_id=folder_id,
35+
credentials_path="credentials.json",
36+
token_path="token.json"
37+
)
38+
print(f"Loaded existing dataset with {len(dataset)} records")
39+
40+
except FileNotFoundError:
41+
# Dataset doesn't exist, create a new one
42+
print("Dataset doesn't exist, creating new one")
43+
dataset = Dataset(
44+
name="evaluation_results",
45+
backend="gdrive",
46+
data_model=EvaluationRecord,
47+
folder_id=folder_id,
48+
credentials_path="credentials.json",
49+
token_path="token.json"
50+
)
51+
52+
# Show existing records
53+
print("Existing records:")
54+
for i, record in enumerate(dataset):
55+
print(f" {i+1}. {record.question}")
56+
57+
# Add new records
58+
new_records = [
59+
EvaluationRecord(
60+
question="What is the largest planet in our solar system?",
61+
answer="Jupiter",
62+
context="Solar system knowledge question.",
63+
score=0.9,
64+
feedback="Correct answer"
65+
),
66+
EvaluationRecord(
67+
question="Who painted the Mona Lisa?",
68+
answer="Leonardo da Vinci",
69+
context="Art history question.",
70+
score=1.0,
71+
feedback="Perfect answer"
72+
)
73+
]
74+
75+
# Append new records
76+
for record in new_records:
77+
dataset.append(record)
78+
79+
print(f"\nAdded {len(new_records)} new records")
80+
81+
# Save the updated dataset (this replaces the sheet with all records)
82+
dataset.save()
83+
print(f"Saved updated dataset with {len(dataset)} total records")
84+
85+
# Verify by listing all records
86+
print("\nAll records in dataset:")
87+
for i, record in enumerate(dataset):
88+
print(f" {i+1}. {record.question} -> {record.answer}")
89+
90+
return dataset
91+
92+
93+
def create_multiple_datasets():
94+
"""Example of creating separate datasets instead of appending."""
95+
96+
folder_id = "folder_id_here" # Replace with your actual Google Drive folder ID
97+
98+
print("\n=== Creating Multiple Datasets ===")
99+
100+
# Create different datasets for different evaluation runs
101+
datasets = {}
102+
103+
for run_name, data in [
104+
("basic_qa", [
105+
EvaluationRecord(
106+
question="What is 1+1?",
107+
answer="Two",
108+
context="Basic math",
109+
score=1.0,
110+
feedback="Correct"
111+
)
112+
]),
113+
("advanced_qa", [
114+
EvaluationRecord(
115+
question="Explain quantum entanglement",
116+
answer="Quantum entanglement is a phenomenon...",
117+
context="Advanced physics",
118+
score=0.8,
119+
feedback="Good explanation"
120+
)
121+
])
122+
]:
123+
dataset = Dataset(
124+
name=f"evaluation_{run_name}",
125+
backend="gdrive",
126+
data_model=EvaluationRecord,
127+
folder_id=folder_id,
128+
credentials_path="credentials.json",
129+
token_path="token.json"
130+
)
131+
132+
for record in data:
133+
dataset.append(record)
134+
135+
dataset.save()
136+
datasets[run_name] = dataset
137+
print(f"Created dataset '{run_name}' with {len(dataset)} records")
138+
139+
# List all datasets
140+
available_datasets = list(datasets.values())[0].backend.list_datasets()
141+
print(f"\nAll available datasets: {available_datasets}")
142+
143+
return datasets
144+
145+
146+
if __name__ == "__main__":
147+
try:
148+
# Method 1: Append to existing dataset
149+
dataset = append_to_existing_dataset()
150+
151+
# Method 2: Create separate datasets
152+
datasets = create_multiple_datasets()
153+
154+
print("\n✅ Append operations completed successfully!")
155+
print("\nKey points:")
156+
print("- dataset.save() replaces the entire sheet (this is the intended behavior)")
157+
print("- To append: load existing data, add new records, then save")
158+
print("- For different evaluation runs, consider separate datasets")
159+
160+
except Exception as e:
161+
print(f"Error: {e}")
162+
import traceback
163+
traceback.print_exc()

0 commit comments

Comments
 (0)