Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,62 @@
# CHANGELOG

## [0.7.0] - 2025-01-15

### 🎉 Major Release - Enhanced SharePoint Integration

#### ✨ New Features

- **📄 SharePoint Page Reading**: Complete support for loading SharePoint site pages as documents

- Use `sharepoint_type=SharePointType.PAGE` to load pages instead of files
- Support for both all pages and specific page loading via `page_name`
- Full HTML content extraction with metadata

- **🔧 Custom File Parsers**: Advanced file parsing system

- Support for specialized parsers: PDF, DOCX, PPTX, HTML, CSV, Excel, Images, JSON, TXT
- `CustomParserManager` for efficient parser management
- Automatic file type detection and parser selection
- Complete file parser implementations in `file_parsers.py`

- **📊 Event System**: Real-time processing monitoring

- Comprehensive event classes: `PageDataFetchStartedEvent`, `PageDataFetchCompletedEvent`, `PageSkippedEvent`, `PageFailedEvent`, `TotalPagesToProcessEvent`
- Integration with LlamaIndex instrumentation system
- Event dispatching for monitoring document processing progress

- **🎯 Document Callbacks**: Advanced filtering and processing

- `process_document_callback` for custom document filtering logic
- `process_attachment_callback` for attachment handling
- Flexible callback system for custom processing workflows

- **⚙️ Enhanced Error Handling**: Configurable error behavior
- `fail_on_error` parameter for controlling error handling strategy
- Option to continue processing when individual files fail
- Improved error reporting and logging

#### 🛠️ Technical Improvements

- **Type Safety**: Complete FileType enum with all supported formats
- **Code Organization**: Modular architecture with separate event and parser modules
- **Test Coverage**: Comprehensive test suite with 27+ test scenarios
- **Documentation**: Extensive README with examples and configuration options
- **Performance**: Optimized file processing and memory management

#### 🔧 Breaking Changes

- Constructor signature updated to support new parameters
- `sharepoint_type` parameter added (defaults to `SharePointType.DRIVE` for backward compatibility)
- `custom_parsers` requires `custom_folder` parameter when used
- Event system integration may require dispatcher setup for monitoring

#### 📦 Dependencies

- Added optional `[file_parsers]` extra for enhanced file processing capabilities
- Updated core dependencies for better compatibility
- Support for Python 3.9+

## [0.5.1] - 2025-04-02

- Fix issue with folder path encoding when a file path contains special characters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,32 +4,55 @@
pip install llama-index-readers-microsoft-sharepoint
```

The loader loads the files from a folder in sharepoint site.
The loader loads files from a folder in a SharePoint site.

It also supports traversing recursively through the sub-folders.
It also supports traversing recursively through sub-folders.

## Prequsites
## ✨ New Features

### App Authentication using Microsoft Entra ID(formerly Azure AD)
- **📄 SharePoint Page Reading**: Load SharePoint site pages as documents
- **🔧 Custom File Parsers**: Use specialized parsers for different file types (PDF, DOCX, HTML, etc.)
- **📊 Event System**: Monitor document processing with real-time events
- **🎯 Document Callbacks**: Filter and process documents with custom logic
- **⚙️ Error Handling**: Configurable error handling behavior
- **🚀 Enhanced Performance**: Optimized loading with parallel processing support

---

## Prerequisites

### App Authentication using Microsoft Entra ID (formerly Azure AD)

1. You need to create an App Registration in Microsoft Entra ID. Refer [here](https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application)
2. API Permissions for the created app.
1. Microsoft Graph --> Application Permissions --> Sites.ReadAll (**Grant Admin Consent**)
2. Microsoft Graph --> Application Permissions --> Files.ReadAll (**Grant Admin Consent**)
3. Microsoft Graph --> Application Permissions --> BrowserSiteLists.Read.All (**Grant Admin Consent**)
2. API Permissions for the created app:
- Microsoft Graph → Application Permissions → **Sites.Read.All** (**Grant Admin Consent**)
_(Allows access to all sites in the tenant)_
- **OR**
Microsoft Graph → Application Permissions → **Sites.Selected** (**Grant Admin Consent**)
_(Allows access only to specific sites you select and grant permissions for)_
- Microsoft Graph → Application Permissions → Files.Read.All (**Grant Admin Consent**)
- Microsoft Graph → Application Permissions → BrowserSiteLists.Read.All (**Grant Admin Consent**)

> **Note:**
> If you use `Sites.Selected`, you must grant your app access to the specific SharePoint site(s) via the SharePoint admin center.
> See [Grant access to a specific site](https://learn.microsoft.com/en-us/sharepoint/dev/solution-guidance/security-apponly-azuread#grant-access-to-a-specific-site) for details.

More info on Microsoft Graph APIs - [Refer here](https://learn.microsoft.com/en-us/graph/permissions-reference)

---

## Usage

To use this loader `client_id`, `client_secret` and `tenant_id` of the registered app in Microsoft Azure Portal is required.
To use this loader, you need the `client_id`, `client_secret`, and `tenant_id` of the registered app in Microsoft Azure Portal.

This loader loads the files present in a specific folder in sharepoint.
This loader loads the files present in a specific folder in SharePoint.

If the files are present in the `Test` folder in SharePoint Site under `root` directory, then the input for the loader for `file_path` is `Test`
If the files are present in the `Test` folder in a SharePoint Site under the `root` directory, then the input for the loader for `sharepoint_folder_path` is `Test`.

![FilePath](file_path_info.png)

### Example: Using `sharepoint_site_name`

```python
from llama_index.readers.microsoft_sharepoint import SharePointReader

Expand All @@ -46,4 +69,215 @@ documents = loader.load_data(
)
```

The loader doesn't access other components of the `SharePoint Site`.
### Example: Using `sharepoint_host_name` and `sharepoint_relative_url`

If you have only been granted access to a specific site (using `Sites.Selected`), you can use the site host name and relative URL:

```python
loader = SharePointReader(
client_id="<Client ID of the app>",
client_secret="<Client Secret of the app>",
tenant_id="<Tenant ID of the Microsoft Azure Directory>",
sharepoint_host_name="contoso.sharepoint.com",
sharepoint_relative_url="sites/YourSiteName",
)

documents = loader.load_data(
sharepoint_folder_path="<Folder Path>",
recursive=True,
)
```

---

## Advanced Features

### 🔧 Custom File Parsers

You can use custom file readers for specific file types (e.g., PDF, DOCX, HTML, etc.) by passing the `custom_parsers` argument. This allows you to control how different file types are parsed.

```python
from llama_index.readers.microsoft_sharepoint.file_parsers import (
PDFReader,
HTMLReader,
DocxReader,
PptxReader,
CSVReader,
ExcelReader,
ImageReader,
)
from llama_index.readers.microsoft_sharepoint.event import FileType

custom_parsers = {
FileType.PDF: PDFReader(),
FileType.HTML: HTMLReader(),
FileType.DOCUMENT: DocxReader(),
FileType.PRESENTATION: PptxReader(),
FileType.CSV: CSVReader(),
FileType.SPREADSHEET: ExcelReader(),
FileType.IMAGE: ImageReader(),
}

loader = SharePointReader(
client_id="...",
client_secret="...",
tenant_id="...",
custom_parsers=custom_parsers,
custom_folder="/tmp", # Directory for temporary files
)
```

### 📄 SharePoint Page Reading

You can load SharePoint pages (not just files) by setting `sharepoint_type="page"` and providing a `page_name` if you want to load a specific page.

```python
from llama_index.readers.microsoft_sharepoint.base import SharePointType

# Load all pages from a site
loader = SharePointReader(
client_id="...",
client_secret="...",
tenant_id="...",
sharepoint_type=SharePointType.PAGE,
)

documents = loader.load_data(
sharepoint_site_name="<Sharepoint Site Name>",
download_dir="/tmp/pages", # Required for page content processing
)

# Load a specific page
loader = SharePointReader(
client_id="...",
client_secret="...",
tenant_id="...",
sharepoint_type=SharePointType.PAGE,
page_name="<Page Name>",
)
```

### 🎯 Document Filtering with Callbacks

Use callbacks to filter or modify documents during processing:

```python
def should_process_document(file_name: str) -> bool:
"""Filter out certain files based on name patterns."""
return not file_name.startswith("temp_") and not file_name.endswith(".tmp")


loader = SharePointReader(
client_id="...",
client_secret="...",
tenant_id="...",
process_document_callback=should_process_document,
)
```

### 📊 Event System for Monitoring

Monitor document processing with real-time events:

```python
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.readers.microsoft_sharepoint.event import (
PageDataFetchStartedEvent,
PageDataFetchCompletedEvent,
PageSkippedEvent,
PageFailedEvent,
)


class SharePointEventHandler(BaseEventHandler):
def handle(self, event):
if isinstance(event, PageDataFetchStartedEvent):
print(f"Started processing: {event.page_id}")
elif isinstance(event, PageDataFetchCompletedEvent):
print(f"Completed processing: {event.page_id}")
elif isinstance(event, PageSkippedEvent):
print(f"Skipped: {event.page_id}")
elif isinstance(event, PageFailedEvent):
print(f"Failed: {event.page_id} - {event.error}")


# Register event handler
dispatcher = get_dispatcher("llama_index.readers.microsoft_sharepoint.base")
dispatcher.add_event_handler(SharePointEventHandler())

# Now load data with event monitoring
documents = loader.load_data(sharepoint_site_name="YourSite")
```

### ⚙️ Error Handling

Configure how the reader handles errors:

```python
# Fail immediately on any error (default)
loader = SharePointReader(
client_id="...",
client_secret="...",
tenant_id="...",
fail_on_error=True,
)

# Continue processing even if some files fail
loader = SharePointReader(
client_id="...",
client_secret="...",
tenant_id="...",
fail_on_error=False, # Skip failed files and continue
)
```

---

## 📋 Installation Options

### Basic Installation

```bash
pip install llama-index-readers-microsoft-sharepoint
```

### With File Parser Support

For enhanced file parsing capabilities (PDF, DOCX, images, etc.):

```bash
pip install "llama-index-readers-microsoft-sharepoint[file_parsers]"
```

This includes additional dependencies:

- `pytesseract` - For OCR in images
- `pdf2image` - For PDF processing
- `python-pptx` - For PowerPoint files
- `docx2txt` - For Word documents
- `pandas` - For Excel/CSV files
- `beautifulsoup4` - For HTML parsing
- `Pillow` - For image processing

---

## 🔧 Configuration Options

| Parameter | Type | Description | Default |
| --------------------------- | --------------------- | ------------------------------------------------------------ | ------- |
| `sharepoint_type` | `SharePointType` | Type of SharePoint content (`DRIVE` or `PAGE`) | `DRIVE` |
| `custom_parsers` | `Dict[FileType, Any]` | Custom parsers for specific file types | `{}` |
| `custom_folder` | `str` | Directory for temporary files (required with custom_parsers) | `None` |
| `process_document_callback` | `Callable` | Function to filter/process documents | `None` |
| `fail_on_error` | `bool` | Whether to stop on first error or continue | `True` |

---

## Notes

- The loader does not access other components of the SharePoint Site.
- If you use `custom_parsers`, you must also provide `custom_folder` (a directory for temporary files).
- SharePoint page reading requires a download directory for content processing.
- Event monitoring is optional but provides valuable insights into processing status.
- For more advanced usage, see the docstrings in the code and the test files for examples.
Loading