Skip to content

Commit 41e9b33

Browse files
maxpillCopilotpocucan-dsmhordynski
authored
feat: add PPTX document parser (#693)
Co-authored-by: Copilot <[email protected]> Co-authored-by: pocucan-ds <[email protected]> Co-authored-by: Mateusz Hordyński <[email protected]>
1 parent a80813a commit 41e9b33

File tree

18 files changed

+750
-40
lines changed

18 files changed

+750
-40
lines changed

.github/workflows/shared-packages.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,7 @@ jobs:
184184
env:
185185
GOOGLE_DRIVE_CLIENTID_JSON: ${{ secrets.GOOGLE_DRIVE_CLIENTID_JSON }}
186186
GOOGLE_SOURCE_UNIT_TEST_FOLDER: ${{ secrets.GOOGLE_SOURCE_UNIT_TEST_FOLDER }}
187+
GOOGLE_DRIVE_TARGET_EMAIL: ${{ secrets.GOOGLE_DRIVE_TARGET_EMAIL }}
187188

188189
- name: Test Report
189190
uses: mikepenz/action-junit-report@v4

docs/how-to/sources/google-drive.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,43 @@ async def process_drive_documents():
187187
asyncio.run(process_drive_documents())
188188
```
189189

190+
## Impersonating Google Accounts
191+
192+
You can configure your Google service account to impersonate other users in your Google Workspace domain. This is useful when you need to access files or perform actions on behalf of specific users.
193+
194+
### Step 1: Enable Domain-Wide Delegation
195+
196+
1. **Sign in to the [Google Admin Console](https://admin.google.com/) as a Super Admin.**
197+
2. Navigate to:
198+
**Security > Access and data control > API controls > MANAGE DOMAIN WIDE DELEGATION**
199+
3. Add a new API client or edit an existing one, and include the following OAuth scopes:
200+
- `https://www.googleapis.com/auth/cloud-platform`
201+
- `https://www.googleapis.com/auth/drive`
202+
4. Click **Authorize** or **Save** to apply the changes.
203+
204+
### Step 2: Impersonate a User in Your Code
205+
206+
After configuring domain-wide delegation, you can specify a target user to impersonate when using the `GoogleDriveSource` in your code.
207+
208+
```python
209+
from ragbits.core.sources.google_drive import GoogleDriveSource
210+
211+
target_email = "[email protected]"
212+
credentials_file = "service-account-key.json"
213+
214+
# Set the path to your service account key file
215+
GoogleDriveSource.set_credentials_file_path(credentials_file)
216+
217+
# Set the email address of the user to impersonate
218+
GoogleDriveSource.set_impersonation_target(target_email)
219+
```
220+
221+
**Note:**
222+
- The `target_email` must be a valid user in your Google Workspace domain.
223+
- Ensure your service account has been granted domain-wide delegation as described above.
224+
225+
This setup allows your service account to act on behalf of the specified user, enabling access to their Google Drive files and resources as permitted by the assigned scopes.
226+
190227
## Troubleshooting
191228

192229
### Common Issues

packages/ragbits-core/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
- Add LLM Usage to LLMResponseWithMetadata (#700)
2929
- Split usage per model type (#715)
3030
- Add support for batch generation (#608)
31+
- Added Google Drive support for impersonation and presentation-to-pdf (#724)
3132
- Introduce new API for attachments in prompts (#711)
3233
- Fix issue with trying to store duplicated entries in Vector Stores (#762)
3334

packages/ragbits-core/src/ragbits/core/sources/google_drive.py

Lines changed: 104 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1-
import os # Import os for path joining and potential directory checks
1+
import os
22
from collections.abc import Iterable
33
from contextlib import suppress
4+
from enum import Enum
45
from pathlib import Path
56
from typing import Any, ClassVar
67

@@ -20,38 +21,61 @@
2021

2122
_SCOPES = ["https://www.googleapis.com/auth/drive"]
2223

24+
# Scopes that the service account is delegated for in the Google Workspace Admin Console.
25+
_IMPERSONATION_SCOPES = [
26+
"https://www.googleapis.com/auth/cloud-platform", # General Cloud access (if needed)
27+
"https://www.googleapis.com/auth/drive", # Example: For Google Drive API
28+
]
29+
2330
# HTTP status codes
2431
_HTTP_NOT_FOUND = 404
2532
_HTTP_FORBIDDEN = 403
2633

34+
35+
class GoogleDriveExportFormat(str, Enum):
36+
"""Supported export MIME types for Google Drive downloads."""
37+
38+
DOCX = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
39+
XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
40+
PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
41+
PDF = "application/pdf"
42+
PNG = "image/png"
43+
HTML = "text/html"
44+
TXT = "text/plain"
45+
JSON = "application/json"
46+
47+
2748
# Maps Google-native Drive MIME types → export MIME types
28-
_GOOGLE_EXPORT_MIME_MAP = {
29-
"application/vnd.google-apps.document": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", # noqa: E501
30-
"application/vnd.google-apps.spreadsheet": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", # noqa: E501
31-
"application/vnd.google-apps.presentation": "application/vnd.openxmlformats-officedocument.presentationml.presentation", # noqa: E501
32-
"application/vnd.google-apps.drawing": "image/png",
33-
"application/vnd.google-apps.script": "application/vnd.google-apps.script+json",
34-
"application/vnd.google-apps.site": "text/html",
35-
"application/vnd.google-apps.map": "application/json",
36-
"application/vnd.google-apps.form": "application/pdf",
49+
_GOOGLE_EXPORT_MIME_MAP: dict[str, GoogleDriveExportFormat] = {
50+
"application/vnd.google-apps.document": GoogleDriveExportFormat.DOCX,
51+
"application/vnd.google-apps.spreadsheet": GoogleDriveExportFormat.XLSX,
52+
"application/vnd.google-apps.presentation": GoogleDriveExportFormat.PDF,
53+
"application/vnd.google-apps.drawing": GoogleDriveExportFormat.PNG,
54+
"application/vnd.google-apps.script": GoogleDriveExportFormat.JSON,
55+
"application/vnd.google-apps.site": GoogleDriveExportFormat.HTML,
56+
"application/vnd.google-apps.map": GoogleDriveExportFormat.JSON,
57+
"application/vnd.google-apps.form": GoogleDriveExportFormat.PDF,
3758
}
3859

3960
# Maps export MIME types → file extensions
40-
_EXPORT_EXTENSION_MAP = {
41-
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx",
42-
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx",
43-
"application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx",
44-
"image/png": ".png",
45-
"application/pdf": ".pdf",
46-
"text/html": ".html",
47-
"text/plain": ".txt",
48-
"application/json": ".json",
61+
_EXPORT_EXTENSION_MAP: dict[GoogleDriveExportFormat, str] = {
62+
GoogleDriveExportFormat.DOCX: ".docx",
63+
GoogleDriveExportFormat.XLSX: ".xlsx",
64+
GoogleDriveExportFormat.PPTX: ".pptx",
65+
GoogleDriveExportFormat.PNG: ".png",
66+
GoogleDriveExportFormat.PDF: ".pdf",
67+
GoogleDriveExportFormat.HTML: ".html",
68+
GoogleDriveExportFormat.TXT: ".txt",
69+
GoogleDriveExportFormat.JSON: ".json",
4970
}
5071

5172

5273
class GoogleDriveSource(Source):
5374
"""
5475
Handles source connection for Google Drive and provides methods to fetch files.
76+
77+
NOTE(Do not define variables at class level that you pass to google client, define them at instance level, or else
78+
google client will complain.):
5579
"""
5680

5781
file_id: str
@@ -62,12 +86,31 @@ class GoogleDriveSource(Source):
6286

6387
_google_drive_client: ClassVar["GoogleAPIResource | None"] = None
6488
_credentials_file_path: ClassVar[str | None] = None
89+
impersonate: ClassVar[bool | None] = None
90+
impersonate_target_email: ClassVar[str | None] = None
6591

6692
@classmethod
6793
def set_credentials_file_path(cls, path: str) -> None:
6894
"""Set the path to the service account credentials file."""
6995
cls._credentials_file_path = path
7096

97+
@classmethod
98+
def set_impersonation_target(cls, target_mail: str) -> None:
99+
"""
100+
Sets the email address to impersonate when accessing Google Drive resources.
101+
102+
Args:
103+
target_mail (str): The email address to impersonate.
104+
105+
Raises:
106+
ValueError: If the provided email address is invalid (empty or missing '@').
107+
"""
108+
# check if email is a valid email.
109+
if not target_mail or "@" not in target_mail:
110+
raise ValueError("Invalid email address provided for impersonation.")
111+
cls.impersonate = True
112+
cls.impersonate_target_email = target_mail
113+
71114
@classmethod
72115
def _initialize_client_from_creds(cls) -> None:
73116
"""
@@ -82,7 +125,20 @@ def _initialize_client_from_creds(cls) -> None:
82125
HttpError: If the Google Drive API is not enabled or accessible.
83126
Exception: If any other error occurs during client initialization.
84127
"""
85-
creds = service_account.Credentials.from_service_account_file(cls._credentials_file_path, scopes=_SCOPES)
128+
cred_kwargs = {
129+
"filename": cls._credentials_file_path,
130+
"scopes": _SCOPES,
131+
}
132+
133+
# handle impersonation
134+
if cls.impersonate is not None and cls.impersonate:
135+
if not cls.impersonate_target_email:
136+
raise ValueError("Impersonation target email must be set when impersonation is enabled.")
137+
cred_kwargs["subject"] = cls.impersonate_target_email
138+
cred_kwargs["scopes"] = _IMPERSONATION_SCOPES
139+
140+
creds = service_account.Credentials.from_service_account_file(**cred_kwargs)
141+
86142
cls._google_drive_client = build("drive", "v3", credentials=creds)
87143
cls._google_drive_client.files().list(
88144
pageSize=1, fields="files(id)", supportsAllDrives=True, includeItemsFromAllDrives=True
@@ -162,7 +218,11 @@ def verify_drive_api_enabled(cls) -> None:
162218

163219
@traceable
164220
@requires_dependencies(["googleapiclient"], "google_drive")
165-
async def fetch(self) -> Path:
221+
async def fetch(
222+
self,
223+
*,
224+
export_format: "GoogleDriveExportFormat | None" = None,
225+
) -> Path:
166226
"""
167227
Fetch the file from Google Drive and store it locally.
168228
@@ -171,6 +231,9 @@ async def fetch(self) -> Path:
171231
The local directory is determined by the environment variable `LOCAL_STORAGE_DIR`. If this environment
172232
variable is not set, a temporary directory is used.
173233
234+
Args:
235+
export_format: Optional override for the export MIME type when downloading Google-native documents.
236+
174237
Returns:
175238
The local path to the downloaded file.
176239
@@ -186,7 +249,8 @@ async def fetch(self) -> Path:
186249
file_local_dir = local_dir / self.file_id
187250
file_local_dir.mkdir(parents=True, exist_ok=True)
188251

189-
export_mime_type, file_extension = self._determine_file_extension()
252+
override_mime = export_format.value if export_format else None
253+
export_mime_type, file_extension = self._determine_file_extension(override_mime=override_mime)
190254
local_file_name = f"{self.file_name}{file_extension}"
191255
path = file_local_dir / local_file_name
192256

@@ -496,22 +560,36 @@ async def from_uri(cls, path: str) -> Iterable[Self]:
496560
else:
497561
raise ValueError(f"Unsupported Google Drive URI pattern: {path}")
498562

499-
def _determine_file_extension(self) -> tuple[str, str]:
563+
def _determine_file_extension(self, override_mime: str | None = None) -> tuple[str, str]:
500564
"""
501565
Determine the appropriate file extension and export MIME type for the file.
502566
503567
Returns:
504568
A tuple of (export_mime_type, file_extension)
505569
"""
570+
if override_mime is not None:
571+
export_mime_type = override_mime
572+
try:
573+
export_format = GoogleDriveExportFormat(override_mime)
574+
file_extension = _EXPORT_EXTENSION_MAP.get(export_format, ".bin")
575+
except ValueError:
576+
file_extension = Path(self.file_name).suffix if "." in self.file_name else ".bin"
577+
return export_mime_type, file_extension
578+
506579
export_mime_type = self.mime_type
507580
file_extension = ""
508581

509582
if self.mime_type.startswith("application/vnd.google-apps"):
510-
export_mime_type = _GOOGLE_EXPORT_MIME_MAP.get(self.mime_type, "application/pdf")
511-
file_extension = _EXPORT_EXTENSION_MAP.get(export_mime_type, ".bin")
583+
export_format = _GOOGLE_EXPORT_MIME_MAP.get(self.mime_type, GoogleDriveExportFormat.PDF)
584+
export_mime_type = export_format.value
585+
file_extension = _EXPORT_EXTENSION_MAP.get(export_format, ".bin")
512586
elif "." in self.file_name:
513587
file_extension = Path(self.file_name).suffix
514588
else:
515-
file_extension = _EXPORT_EXTENSION_MAP.get(self.mime_type, ".bin")
589+
try:
590+
export_format = GoogleDriveExportFormat(self.mime_type)
591+
file_extension = _EXPORT_EXTENSION_MAP.get(export_format, ".bin")
592+
except ValueError:
593+
file_extension = ".bin"
516594

517595
return export_mime_type, file_extension

packages/ragbits-core/tests/unit/sources/test_google_drive.py

Lines changed: 71 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
import json # Import json for potential validation or pretty printing
1+
import json
22
import os
33
from pathlib import Path
44

55
import pytest
66
from googleapiclient.errors import HttpError
77

8-
from ragbits.core.sources.google_drive import GoogleDriveSource
8+
from ragbits.core.sources.google_drive import GoogleDriveExportFormat, GoogleDriveSource
99

1010

1111
@pytest.fixture(autouse=True)
@@ -50,6 +50,58 @@ def setup_local_storage_dir(tmp_path: Path):
5050
del os.environ["LOCAL_STORAGE_DIR"]
5151

5252

53+
@pytest.mark.asyncio
54+
async def test_google_drive_impersonate():
55+
"""Test service account impersonation with better error handling."""
56+
target_email = os.environ.get("GOOGLE_DRIVE_TARGET_EMAIL")
57+
credentials_file = "test_clientid.json"
58+
59+
GoogleDriveSource.set_credentials_file_path(credentials_file)
60+
61+
if target_email is None:
62+
pytest.skip("GOOGLE_DRIVE_TARGET_EMAIL environment variable not set")
63+
64+
GoogleDriveSource.set_impersonation_target(target_email)
65+
66+
unit_test_folder_id = os.environ.get("GOOGLE_SOURCE_UNIT_TEST_FOLDER")
67+
68+
if unit_test_folder_id is None:
69+
pytest.skip("GOOGLE_SOURCE_UNIT_TEST_FOLDER environment variable not set")
70+
71+
sources_to_download = await GoogleDriveSource.from_uri(f"{unit_test_folder_id}/**")
72+
downloaded_count = 0
73+
74+
try:
75+
# Iterate through each source (file or folder) found
76+
for source in sources_to_download:
77+
# Only attempt to fetch files, as folders cannot be "downloaded" in the same way
78+
if not source.is_folder:
79+
try:
80+
# Attempt to fetch (download) the file.
81+
local_path = await source.fetch()
82+
print(f" Downloaded: '{source.file_name}' (ID: {source.file_id}) to '{local_path}'")
83+
downloaded_count += 1
84+
except HttpError as e:
85+
# Catch Google API specific HTTP errors (e.g., permission denied, file not found)
86+
print(f" Google API Error downloading '{source.file_name}' (ID: {source.file_id}): {e}")
87+
except Exception as e:
88+
# Catch any other general exceptions during the download process
89+
print(f" Failed to download '{source.file_name}' (ID: {source.file_id}): {e}")
90+
else:
91+
print(f" Skipping folder: '{source.file_name}' (ID: {source.file_id})")
92+
93+
except Exception as e:
94+
# Catch any exceptions that occur during the initial setup or `from_uri` call
95+
print(f"An error occurred during test setup or source retrieval: {e}")
96+
97+
finally:
98+
# This block ensures the final summary is printed regardless of errors
99+
print(f"\n--- Successfully downloaded {downloaded_count} files from '{unit_test_folder_id}' ---")
100+
# Assert that at least one file was downloaded if that's an expectation for the test
101+
# If no files are expected, or it's acceptable for 0 files to be downloaded, remove or adjust this assertion.
102+
assert downloaded_count > 0, "Expected to download at least one file, but downloaded 0."
103+
104+
53105
@pytest.mark.asyncio
54106
async def test_google_drive_source_fetch_file_not_found():
55107
"""Test fetching a non-existent file."""
@@ -99,6 +151,9 @@ async def test_google_drive_source_fetch_file():
99151
"""
100152
unit_test_folder_id = os.environ.get("GOOGLE_SOURCE_UNIT_TEST_FOLDER")
101153

154+
if unit_test_folder_id is None:
155+
pytest.skip("GOOGLE_SOURCE_UNIT_TEST_FOLDER environment variable not set")
156+
102157
# Initialize a counter for successfully downloaded files
103158
downloaded_count = 0
104159

@@ -141,3 +196,17 @@ async def test_google_drive_source_fetch_file():
141196
# Assert that at least one file was downloaded if that's an expectation for the test
142197
# If no files are expected, or it's acceptable for 0 files to be downloaded, remove or adjust this assertion.
143198
assert downloaded_count > 0, "Expected to download at least one file, but downloaded 0."
199+
200+
201+
def test_determine_file_extension_override():
202+
"""Ensure overriding export MIME type yields expected extension."""
203+
src = GoogleDriveSource(
204+
file_id="dummy",
205+
file_name="MyDoc",
206+
mime_type="application/vnd.google-apps.document",
207+
)
208+
209+
export_mime, extension = src._determine_file_extension(override_mime=GoogleDriveExportFormat.PDF.value)
210+
211+
assert export_mime == GoogleDriveExportFormat.PDF.value
212+
assert extension == ".pdf"

0 commit comments

Comments
 (0)