Skip to content
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
2596a48
chore: update pyproject.toml and ingestion parsers
maxpill Jul 11, 2025
6514cfa
Merge branch 'main' into 687-feat-pptx-parser
maxpill Jul 15, 2025
30b0806
Merge branch '687-feat-pptx-parser' of https://github.com/deepsense-a…
maxpill Jul 15, 2025
00ad9dc
feat(pptx): add temporary testing script and enhance PPTX parser
maxpill Jul 15, 2025
55fee9c
fix(test): cast shapes in PPTX parser test for correct type handling
maxpill Jul 15, 2025
09f56cf
Merge branch 'main' into 687-feat-pptx-parser
maxpill Jul 15, 2025
2c4c5ee
feat: add pptx document parser to changelog
maxpill Jul 15, 2025
fd5a150
Merge branch '687-feat-pptx-parser' of https://github.com/deepsense-a…
maxpill Jul 15, 2025
d5a3e28
Apply suggestions from code review
maxpill Jul 15, 2025
c7fa386
Merge branch '687-feat-pptx-parser' of https://github.com/deepsense-a…
maxpill Jul 16, 2025
da462ae
application/pdf export to mime map
pocucan-ds Jul 16, 2025
5f6b8a6
impersonation feature for client.
pocucan-ds Jul 16, 2025
310968d
feat(pptx): implement error handling and logging in PPTX parser
maxpill Jul 17, 2025
426a6ba
chore: remove temporary PPTX parser test script
maxpill Jul 17, 2025
ea41ed3
refactor(pptx): convert extractor methods to static methods
maxpill Jul 17, 2025
38ffe8b
remove impersonation at class level and define it at instance level.
pocucan-ds Jul 18, 2025
466fe37
updated tests
pocucan-ds Jul 18, 2025
5dd7fe4
impersonation same loop
pocucan-ds Jul 18, 2025
238f8fd
updated environment variables.
pocucan-ds Jul 18, 2025
4524989
impersonation support changelog
pocucan-ds Jul 18, 2025
f3e5255
Merge branch 'main' into extension/source/googledrive/impersonator
pocucan-ds Jul 18, 2025
6a65354
updated how to and formatted.
pocucan-ds Jul 18, 2025
4041158
Merge branch 'extension/source/googledrive/impersonator' of https://g…
pocucan-ds Jul 18, 2025
20df7d2
updated for ruff
pocucan-ds Jul 18, 2025
54da82b
updated signature
pocucan-ds Jul 18, 2025
a42909e
Merge branch 'main' of https://github.com/deepsense-ai/ragbits into 6…
maxpill Jul 21, 2025
da272a8
Add impersonation attributes to GoogleDriveSource and update tests fo…
maxpill Jul 21, 2025
7f095c1
Merge branch 'main' into feat/gdrive-impersonator
maxpill Jul 21, 2025
b6efc38
Merge branch 'feat/gdrive-impersonator' of https://github.com/deepsen…
maxpill Jul 21, 2025
55c3f34
Add GoogleDriveExportFormat enum and update MIME type handling in Goo…
maxpill Jul 24, 2025
6932ccf
Merge branch 'feat/gdrive-impersonator' of https://github.com/deepsen…
maxpill Jul 24, 2025
3cc602a
Merge branch 'main' into 687-feat-pptx-parser
pocucan-ds Jul 25, 2025
7ab8b52
Merge branch '687-feat-pptx-parser' of https://github.com/deepsense-a…
maxpill Jul 25, 2025
1ca84ef
refactor(pptx): streamline extraction process and enhance element cre…
maxpill Aug 1, 2025
18cd706
Merge branch '687-feat-pptx-parser' of https://github.com/deepsense-a…
maxpill Aug 7, 2025
aa16967
Merge branch 'develop' of https://github.com/deepsense-ai/ragbits int…
maxpill Aug 11, 2025
797a26e
refactor: update PPTX extractor classes to use properties for extract…
maxpill Aug 12, 2025
7b730e6
feat(pptx): implement callback architecture for PPTX document parsing
maxpill Aug 12, 2025
aec57c4
test(pptx): add integration tests for PPTX document parser
maxpill Aug 12, 2025
a1ed241
Merge branch 'develop' into 687-feat-pptx-parser
mhordynski Sep 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/shared-packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ jobs:
env:
GOOGLE_DRIVE_CLIENTID_JSON: ${{ secrets.GOOGLE_DRIVE_CLIENTID_JSON }}
GOOGLE_SOURCE_UNIT_TEST_FOLDER: ${{ secrets.GOOGLE_SOURCE_UNIT_TEST_FOLDER }}
GOOGLE_DRIVE_TARGET_EMAIL: ${{ secrets.GOOGLE_DRIVE_TARGET_EMAIL }}

- name: Test Report
uses: mikepenz/action-junit-report@v4
Expand Down
37 changes: 37 additions & 0 deletions docs/how-to/sources/google-drive.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,43 @@ async def process_drive_documents():
asyncio.run(process_drive_documents())
```

## Impersonating Google Accounts

You can configure your Google service account to impersonate other users in your Google Workspace domain. This is useful when you need to access files or perform actions on behalf of specific users.

### Step 1: Enable Domain-Wide Delegation

1. **Sign in to the [Google Admin Console](https://admin.google.com/) as a Super Admin.**
2. Navigate to:
**Security > Access and data control > API controls > MANAGE DOMAIN WIDE DELEGATION**
3. Add a new API client or edit an existing one, and include the following OAuth scopes:
- `https://www.googleapis.com/auth/cloud-platform`
- `https://www.googleapis.com/auth/drive`
4. Click **Authorize** or **Save** to apply the changes.

### Step 2: Impersonate a User in Your Code

After configuring domain-wide delegation, you can specify a target user to impersonate when using the `GoogleDriveSource` in your code.

```python
from ragbits.core.sources.google_drive import GoogleDriveSource

target_email = "[email protected]"
credentials_file = "service-account-key.json"

# Set the path to your service account key file
GoogleDriveSource.set_credentials_file_path(credentials_file)

# Set the email address of the user to impersonate
GoogleDriveSource.set_impersonation_target(target_email)
```

**Note:**
- The `target_email` must be a valid user in your Google Workspace domain.
- Ensure your service account has been granted domain-wide delegation as described above.

This setup allows your service account to act on behalf of the specified user, enabling access to their Google Drive files and resources as permitted by the assigned scopes.

## Troubleshooting

### Common Issues
Expand Down
1 change: 1 addition & 0 deletions packages/ragbits-core/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Add LLM Usage to LLMResponseWithMetadata (#700)
- Split usage per model type (#715)
- Add support for batch generation (#608)
- Added Google Drive support for impersonation and presentation-to-pdf (#724)

## 1.1.0 (2025-07-09)

Expand Down
130 changes: 104 additions & 26 deletions packages/ragbits-core/src/ragbits/core/sources/google_drive.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os # Import os for path joining and potential directory checks
import os
from collections.abc import Iterable
from contextlib import suppress
from enum import Enum
from pathlib import Path
from typing import Any, ClassVar

Expand All @@ -20,38 +21,61 @@

_SCOPES = ["https://www.googleapis.com/auth/drive"]

# Scopes that the service account is delegated for in the Google Workspace Admin Console.
_IMPERSONATION_SCOPES = [
"https://www.googleapis.com/auth/cloud-platform", # General Cloud access (if needed)
"https://www.googleapis.com/auth/drive", # Example: For Google Drive API
]

# HTTP status codes
_HTTP_NOT_FOUND = 404
_HTTP_FORBIDDEN = 403


class GoogleDriveExportFormat(str, Enum):
"""Supported export MIME types for Google Drive downloads."""

DOCX = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
PDF = "application/pdf"
PNG = "image/png"
HTML = "text/html"
TXT = "text/plain"
JSON = "application/json"


# Maps Google-native Drive MIME types → export MIME types
_GOOGLE_EXPORT_MIME_MAP = {
"application/vnd.google-apps.document": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", # noqa: E501
"application/vnd.google-apps.spreadsheet": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", # noqa: E501
"application/vnd.google-apps.presentation": "application/vnd.openxmlformats-officedocument.presentationml.presentation", # noqa: E501
"application/vnd.google-apps.drawing": "image/png",
"application/vnd.google-apps.script": "application/vnd.google-apps.script+json",
"application/vnd.google-apps.site": "text/html",
"application/vnd.google-apps.map": "application/json",
"application/vnd.google-apps.form": "application/pdf",
_GOOGLE_EXPORT_MIME_MAP: dict[str, GoogleDriveExportFormat] = {
"application/vnd.google-apps.document": GoogleDriveExportFormat.DOCX,
"application/vnd.google-apps.spreadsheet": GoogleDriveExportFormat.XLSX,
"application/vnd.google-apps.presentation": GoogleDriveExportFormat.PDF,
"application/vnd.google-apps.drawing": GoogleDriveExportFormat.PNG,
"application/vnd.google-apps.script": GoogleDriveExportFormat.JSON,
"application/vnd.google-apps.site": GoogleDriveExportFormat.HTML,
"application/vnd.google-apps.map": GoogleDriveExportFormat.JSON,
"application/vnd.google-apps.form": GoogleDriveExportFormat.PDF,
}

# Maps export MIME types → file extensions
_EXPORT_EXTENSION_MAP = {
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx",
"image/png": ".png",
"application/pdf": ".pdf",
"text/html": ".html",
"text/plain": ".txt",
"application/json": ".json",
_EXPORT_EXTENSION_MAP: dict[GoogleDriveExportFormat, str] = {
GoogleDriveExportFormat.DOCX: ".docx",
GoogleDriveExportFormat.XLSX: ".xlsx",
GoogleDriveExportFormat.PPTX: ".pptx",
GoogleDriveExportFormat.PNG: ".png",
GoogleDriveExportFormat.PDF: ".pdf",
GoogleDriveExportFormat.HTML: ".html",
GoogleDriveExportFormat.TXT: ".txt",
GoogleDriveExportFormat.JSON: ".json",
}


class GoogleDriveSource(Source):
"""
Handles source connection for Google Drive and provides methods to fetch files.

NOTE(Do not define variables at class level that you pass to google client, define them at instance level, or else
google client will complain.):
"""

file_id: str
Expand All @@ -62,12 +86,31 @@ class GoogleDriveSource(Source):

_google_drive_client: ClassVar["GoogleAPIResource | None"] = None
_credentials_file_path: ClassVar[str | None] = None
impersonate: ClassVar[bool | None] = None
impersonate_target_email: ClassVar[str | None] = None

@classmethod
def set_credentials_file_path(cls, path: str) -> None:
"""Set the path to the service account credentials file."""
cls._credentials_file_path = path

@classmethod
def set_impersonation_target(cls, target_mail: str) -> None:
"""
Sets the email address to impersonate when accessing Google Drive resources.

Args:
target_mail (str): The email address to impersonate.

Raises:
ValueError: If the provided email address is invalid (empty or missing '@').
"""
# check if email is a valid email.
if not target_mail or "@" not in target_mail:
raise ValueError("Invalid email address provided for impersonation.")
cls.impersonate = True
cls.impersonate_target_email = target_mail

@classmethod
def _initialize_client_from_creds(cls) -> None:
"""
Expand All @@ -82,7 +125,20 @@ def _initialize_client_from_creds(cls) -> None:
HttpError: If the Google Drive API is not enabled or accessible.
Exception: If any other error occurs during client initialization.
"""
creds = service_account.Credentials.from_service_account_file(cls._credentials_file_path, scopes=_SCOPES)
cred_kwargs = {
"filename": cls._credentials_file_path,
"scopes": _SCOPES,
}

# handle impersonation
if cls.impersonate is not None and cls.impersonate:
if not cls.impersonate_target_email:
raise ValueError("Impersonation target email must be set when impersonation is enabled.")
cred_kwargs["subject"] = cls.impersonate_target_email
cred_kwargs["scopes"] = _IMPERSONATION_SCOPES

creds = service_account.Credentials.from_service_account_file(**cred_kwargs)

cls._google_drive_client = build("drive", "v3", credentials=creds)
cls._google_drive_client.files().list(
pageSize=1, fields="files(id)", supportsAllDrives=True, includeItemsFromAllDrives=True
Expand Down Expand Up @@ -162,7 +218,11 @@ def verify_drive_api_enabled(cls) -> None:

@traceable
@requires_dependencies(["googleapiclient"], "google_drive")
async def fetch(self) -> Path:
async def fetch(
self,
*,
export_format: "GoogleDriveExportFormat | None" = None,
) -> Path:
"""
Fetch the file from Google Drive and store it locally.

Expand All @@ -171,6 +231,9 @@ async def fetch(self) -> Path:
The local directory is determined by the environment variable `LOCAL_STORAGE_DIR`. If this environment
variable is not set, a temporary directory is used.

Args:
export_format: Optional override for the export MIME type when downloading Google-native documents.

Returns:
The local path to the downloaded file.

Expand All @@ -186,7 +249,8 @@ async def fetch(self) -> Path:
file_local_dir = local_dir / self.file_id
file_local_dir.mkdir(parents=True, exist_ok=True)

export_mime_type, file_extension = self._determine_file_extension()
override_mime = export_format.value if export_format else None
export_mime_type, file_extension = self._determine_file_extension(override_mime=override_mime)
local_file_name = f"{self.file_name}{file_extension}"
path = file_local_dir / local_file_name

Expand Down Expand Up @@ -496,22 +560,36 @@ async def from_uri(cls, path: str) -> Iterable[Self]:
else:
raise ValueError(f"Unsupported Google Drive URI pattern: {path}")

def _determine_file_extension(self) -> tuple[str, str]:
def _determine_file_extension(self, override_mime: str | None = None) -> tuple[str, str]:
"""
Determine the appropriate file extension and export MIME type for the file.

Returns:
A tuple of (export_mime_type, file_extension)
"""
if override_mime is not None:
export_mime_type = override_mime
try:
export_format = GoogleDriveExportFormat(override_mime)
file_extension = _EXPORT_EXTENSION_MAP.get(export_format, ".bin")
except ValueError:
file_extension = Path(self.file_name).suffix if "." in self.file_name else ".bin"
return export_mime_type, file_extension

export_mime_type = self.mime_type
file_extension = ""

if self.mime_type.startswith("application/vnd.google-apps"):
export_mime_type = _GOOGLE_EXPORT_MIME_MAP.get(self.mime_type, "application/pdf")
file_extension = _EXPORT_EXTENSION_MAP.get(export_mime_type, ".bin")
export_format = _GOOGLE_EXPORT_MIME_MAP.get(self.mime_type, GoogleDriveExportFormat.PDF)
export_mime_type = export_format.value
file_extension = _EXPORT_EXTENSION_MAP.get(export_format, ".bin")
elif "." in self.file_name:
file_extension = Path(self.file_name).suffix
else:
file_extension = _EXPORT_EXTENSION_MAP.get(self.mime_type, ".bin")
try:
export_format = GoogleDriveExportFormat(self.mime_type)
file_extension = _EXPORT_EXTENSION_MAP.get(export_format, ".bin")
except ValueError:
file_extension = ".bin"

return export_mime_type, file_extension
73 changes: 71 additions & 2 deletions packages/ragbits-core/tests/unit/sources/test_google_drive.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import json # Import json for potential validation or pretty printing
import json
import os
from pathlib import Path

import pytest
from googleapiclient.errors import HttpError

from ragbits.core.sources.google_drive import GoogleDriveSource
from ragbits.core.sources.google_drive import GoogleDriveExportFormat, GoogleDriveSource


@pytest.fixture(autouse=True)
Expand Down Expand Up @@ -50,6 +50,58 @@ def setup_local_storage_dir(tmp_path: Path):
del os.environ["LOCAL_STORAGE_DIR"]


@pytest.mark.asyncio
async def test_google_drive_impersonate():
"""Test service account impersonation with better error handling."""
target_email = os.environ.get("GOOGLE_DRIVE_TARGET_EMAIL")
credentials_file = "test_clientid.json"

GoogleDriveSource.set_credentials_file_path(credentials_file)

if target_email is None:
pytest.skip("GOOGLE_DRIVE_TARGET_EMAIL environment variable not set")

GoogleDriveSource.set_impersonation_target(target_email)

unit_test_folder_id = os.environ.get("GOOGLE_SOURCE_UNIT_TEST_FOLDER")

if unit_test_folder_id is None:
pytest.skip("GOOGLE_SOURCE_UNIT_TEST_FOLDER environment variable not set")

sources_to_download = await GoogleDriveSource.from_uri(f"{unit_test_folder_id}/**")
downloaded_count = 0

try:
# Iterate through each source (file or folder) found
for source in sources_to_download:
# Only attempt to fetch files, as folders cannot be "downloaded" in the same way
if not source.is_folder:
try:
# Attempt to fetch (download) the file.
local_path = await source.fetch()
print(f" Downloaded: '{source.file_name}' (ID: {source.file_id}) to '{local_path}'")
downloaded_count += 1
except HttpError as e:
# Catch Google API specific HTTP errors (e.g., permission denied, file not found)
print(f" Google API Error downloading '{source.file_name}' (ID: {source.file_id}): {e}")
except Exception as e:
# Catch any other general exceptions during the download process
print(f" Failed to download '{source.file_name}' (ID: {source.file_id}): {e}")
else:
print(f" Skipping folder: '{source.file_name}' (ID: {source.file_id})")

except Exception as e:
# Catch any exceptions that occur during the initial setup or `from_uri` call
print(f"An error occurred during test setup or source retrieval: {e}")

finally:
# This block ensures the final summary is printed regardless of errors
print(f"\n--- Successfully downloaded {downloaded_count} files from '{unit_test_folder_id}' ---")
# Assert that at least one file was downloaded if that's an expectation for the test
# If no files are expected, or it's acceptable for 0 files to be downloaded, remove or adjust this assertion.
assert downloaded_count > 0, "Expected to download at least one file, but downloaded 0."


@pytest.mark.asyncio
async def test_google_drive_source_fetch_file_not_found():
"""Test fetching a non-existent file."""
Expand Down Expand Up @@ -99,6 +151,9 @@ async def test_google_drive_source_fetch_file():
"""
unit_test_folder_id = os.environ.get("GOOGLE_SOURCE_UNIT_TEST_FOLDER")

if unit_test_folder_id is None:
pytest.skip("GOOGLE_SOURCE_UNIT_TEST_FOLDER environment variable not set")

# Initialize a counter for successfully downloaded files
downloaded_count = 0

Expand Down Expand Up @@ -141,3 +196,17 @@ async def test_google_drive_source_fetch_file():
# Assert that at least one file was downloaded if that's an expectation for the test
# If no files are expected, or it's acceptable for 0 files to be downloaded, remove or adjust this assertion.
assert downloaded_count > 0, "Expected to download at least one file, but downloaded 0."


def test_determine_file_extension_override():
"""Ensure overriding export MIME type yields expected extension."""
src = GoogleDriveSource(
file_id="dummy",
file_name="MyDoc",
mime_type="application/vnd.google-apps.document",
)

export_mime, extension = src._determine_file_extension(override_mime=GoogleDriveExportFormat.PDF.value)

assert export_mime == GoogleDriveExportFormat.PDF.value
assert extension == ".pdf"
Loading
Loading