A powerful Fess Data Store plugin that enables enterprise search across Microsoft 365 services including OneDrive, OneNote, Teams, SharePoint Document Libraries, and SharePoint Lists via Microsoft Graph API v6.
This plugin extends Fess enterprise search capabilities to comprehensively index Microsoft 365 content, providing unified search across your organization's cloud documents, conversations, and data with role-based access control integration.
- OneDrive: User and group files, folders with metadata extraction
- OneNote: Complete notebooks with aggregated content from all sections and pages, supporting site, user, and group notebooks
- Teams: Channels, messages, chats with conversation context
- SharePoint Document Libraries: Sites and document libraries with enhanced content aggregation
- SharePoint Lists: Custom lists and list items with dynamic field mapping
- Role-based Access Control: Seamless integration with Fess security model
- Azure AD Authentication: Client credentials flow with automatic token refresh
- Permission Inheritance: Preserves Microsoft 365 access permissions in search results
- Microsoft Graph SDK v6: Latest API with efficient pagination and caching
- Multi-threaded Processing: Configurable thread pools for optimal performance
- Smart Caching: Drive ID, user type, and group ID caching to reduce API calls
- Robust Error Handling: Comprehensive error tracking with configurable failure recovery
- Content Filtering: Advanced include/exclude patterns with system content filtering
- Maven Integration: Clean build process with dependency shading
- Extensive Testing: UTFlute-based test framework with mock Graph API responses
- Configurable Field Mapping: Customizable data extraction scripts for each service
- Java: 21 or higher
- Fess: 15.2.0 or higher
- Azure AD: App registration with Microsoft Graph API permissions
- Download the latest
fess-ds-microsoft365-X.X.X.jar
from Maven Central - Copy the JAR file to your Fess installation:
# For standard installation cp fess-ds-microsoft365-X.X.X.jar $FESS_HOME/app/WEB-INF/lib/ # For system-wide installation sudo cp fess-ds-microsoft365-X.X.X.jar /usr/share/fess/app/WEB-INF/lib/
- Restart Fess service
# Clone the repository
git clone https://github.com/codelibs/fess-ds-microsoft365.git
cd fess-ds-microsoft365
# Build the plugin
mvn clean package
# Copy to Fess installation
cp target/fess-ds-microsoft365-*.jar $FESS_HOME/app/WEB-INF/lib/
Before using this plugin, create an Azure App registration with the required permissions:
- Register Application in Azure Portal
- Add API Permissions (Microsoft Graph):
Files.Read.All
- OneDrive file accessSites.Read.All
- SharePoint sites and listsNotes.Read.All
- OneNote notebooksChat.Read.All
- Teams chat messagesChannelMessage.Read.All
- Teams channel messagesTeam.ReadBasic.All
- Teams basic informationUser.Read.All
- User directory accessGroup.Read.All
- Group information
- Grant Admin Consent for the permissions
- Create Client Secret and note the values
Configure the following authentication parameters in Fess:
# Required Azure AD credentials
tenant=********-****-****-****-************
client_id=********-****-****-****-************
client_secret=***********************
# Optional performance settings
number_of_threads=1
ignore_error=false
The plugin provides five specialized data store types, each optimized for different Microsoft 365 services:
Data Store | Service | Content Types | Use Cases |
---|---|---|---|
oneDriveDataStore |
OneDrive | Files, Folders, Metadata | Document search, file discovery |
oneNoteDataStore |
OneNote | Notebooks (with sections & pages content) | Knowledge base search, note finding, documentation search |
teamsDataStore |
Teams | Channels, Messages, Chats | Conversation search, team communication |
sharePointDocLibDataStore |
SharePoint | Document Libraries, Files | Document management, content discovery |
sharePointListDataStore |
SharePoint | Lists, List Items, Custom Fields | Structured data search, business process content |
- Navigate to System > Data Store
- Click Create New
- Select the desired data store type from the dropdown
- Configure authentication and crawling parameters
- Save and start crawling
title=file.name
content=file.description + "\n" + file.contents
mimetype=file.mimetype
created=file.created
last_modified=file.last_modified
url=file.web_url
role=file.roles
Key | Value |
---|---|
file.name | The name of the file. |
file.description | A short description of the file. |
file.contents | The text contents of the file (extracted using Tika) |
file.mimetype | The MIME type of the file. |
file.filetype | The file type category determined by MIME type. |
file.created | The time at which the file was created. |
file.last_modified | The last time the file was modified by anyone. |
file.size | The size of the file in bytes. |
file.web_url | A link for opening the file in an editor or viewer in a browser. |
file.url | The processed URL for the file (may differ from web_url for certain locations). |
file.roles | Users/groups who can access the file. |
file.ctag | Change tag for the file (used for change tracking). |
file.etag | Entity tag for the file (used for caching). |
file.id | The unique identifier of the file in OneDrive. |
file.webdav_url | WebDAV URL for the file (if available). |
file.location | Geographic location metadata (if available). |
file.createdby_application | Application that created the file. |
file.createdby_device | Device that created the file. |
file.createdby_user | User who created the file. |
file.deleted | Deletion information (if file was deleted). |
file.hashes | File hash values for integrity checking. |
file.last_modifiedby_application | Application that last modified the file. |
file.last_modifiedby_device | Device that last modified the file. |
file.last_modifiedby_user | User who last modified the file. |
file.image | Image metadata (for image files). |
file.parent | Parent reference information. |
file.parent_id | ID of the parent folder. |
file.parent_name | Name of the parent folder. |
file.parent_path | Path to the parent folder. |
file.photo | Photo metadata (for photo files). |
file.publication | Publication information (if applicable). |
file.search_result | Search result metadata (if file was found via search). |
file.special_folder | Special folder name (if file is in a special folder). |
file.video | Video metadata (for video files). |
title=notebook.name
content=notebook.contents
created=notebook.created
last_modified=notebook.last_modified
url=notebook.web_url
role=notebook.roles
size=notebook.size
Key | Value |
---|---|
notebook.name | The name of the notebook. |
notebook.contents | The extracted text contents from all sections and pages within the notebook. |
notebook.size | The size of the notebook content in characters. |
notebook.created | The time at which the notebook was created. |
notebook.last_modified | The last time the notebook was modified by anyone. |
notebook.web_url | A link for opening the notebook in OneNote web or desktop app. |
notebook.roles | Users/groups who can access the notebook. |
title=message.title
content=message.content
created=message.created_date_time
last_modified=message.last_modified_date_time
url=message.web_url
role=message.roles
Key | Value |
---|---|
message.title | The message title (sender name and timestamp). |
message.content | The text contents of the message including attachments if configured. |
message.created_date_time | The time at which the message was created. |
message.last_modified_date_time | The last time the message was modified. |
message.web_url | A link for opening the message in Teams. |
message.roles | Users/groups who can access the team/channel/chat. |
message.id | The unique identifier of the message. |
message.from | The sender information. |
message.subject | The subject of the message. |
message.body | The body content with type information. |
message.attachments | File attachments associated with the message. |
message.mentions | Users mentioned in the message. |
team | The team object containing team information (when applicable). |
channel | The channel object containing channel information (when applicable). |
parent | The parent message for replies (when applicable). |
title=doclib.name
content=doclib.content
created=doclib.created
last_modified=doclib.modified
url=doclib.url
role=doclib.roles
Key | Value |
---|---|
doclib.id | The unique identifier of the document library (Drive ID). |
doclib.name | The name of the document library. |
doclib.description | The description of the document library. |
doclib.content | Rich content combining document library name, description, and site name for enhanced search. |
doclib.url | The Microsoft Graph API web URL for the document library. |
doclib.canonical_url | The standardized SharePoint URL for accessing the document library. |
doclib.created | The time at which the document library was created. |
doclib.modified | The last time the document library was modified. |
doclib.type | The type of the drive (e.g., "documentLibrary"). |
doclib.site_name | The display name of the SharePoint site containing this document library. |
doclib.site_url | The web URL of the SharePoint site. |
doclib.roles | Users/groups who can access the document library. |
Note: SharePointDocLibDataStore indexes document libraries as individual searchable entities, combining library metadata with site information to provide comprehensive search content. The doclib.content
field aggregates the library name, description, and parent site name for enhanced discoverability.
title=item.title
content=item.content
created=item.created
last_modified=item.modified
url=item.url
role=item.roles
Key | Value |
---|---|
item.title | The title of the list item (extracted from Title, LinkTitle, or FileLeafRef fields). |
item.content | The text contents of the list item (extracted from Body, Description, Comments, or Notes fields) |
item.id | The unique identifier of the list item |
item.created | The time at which the list item was created. |
item.modified | The last time the list item was modified. |
item.url | A link for opening the list item in SharePoint. |
item.fields | All fields and values from the SharePoint list item as a map |
item.attachments | File attachments associated with the list item (if any) |
item.roles | Users/groups who can access the list item. |
item.site | Site information containing id , name , and url |
item.list | List information containing name , description , url , and template_type |
Data Structure: The item
object contains nested structures:
item.site
- Contains site metadata (site.id, site.name, site.url)item.list
- Contains list metadata (list.name, list.description, list.url, list.template_type)item.fields
- Dynamic map of all SharePoint list fields and their values
Note: The plugin automatically expands SharePoint list item fields to ensure content extraction. If fields are not initially available, it performs an individual API call with $expand=fields
to retrieve the complete field data.
Parameter | Description | Example |
---|---|---|
tenant |
Azure AD tenant ID | contoso.onmicrosoft.com or GUID |
client_id |
App registration client ID | 12345678-1234-1234-1234-123456789abc |
client_secret |
App registration client secret | abcdefghijk... |
Parameter | Description | Default | Example |
---|---|---|---|
number_of_threads |
Concurrent crawling threads | 1 |
3 |
ignore_error |
Continue on errors | true |
false |
include_pattern |
Regex pattern for inclusion | - | .*\.pdf$ |
exclude_pattern |
Regex pattern for exclusion | - | .*temp.* |
default_permissions |
Default role assignments | - | {role}admin |
Parameter | Description | Default | Notes |
---|---|---|---|
team_id |
Specific team ID to crawl | All teams | Microsoft 365 group ID |
exclude_team_ids |
Comma-separated team IDs to exclude | - | Multiple teams to skip |
include_visibility |
Team visibility levels to include | All | public , private |
channel_id |
Specific channel ID to crawl | All channels | Within specified team |
chat_id |
Specific chat ID to crawl | - | For 1:1 or group chats |
ignore_replies |
Skip reply messages | false |
Process only root messages |
append_attachment |
Include attachments in content | true |
Append attachment text to message body |
ignore_system_events |
Skip system event messages | true |
Filter out system notifications |
title_dateformat |
Date format for message titles | yyyy/MM/dd'T'HH:mm:ss |
Java date pattern |
title_timezone_offset |
Timezone offset for titles | Z |
e.g., +09:00 , -05:00 |
Crawling Modes:
- All Teams: Leave
team_id
empty to crawl all accessible teams - Specific Team: Set
team_id
to crawl only that team's channels - Specific Channel: Set both
team_id
andchannel_id
- Chat Messages: Set
chat_id
to crawl a specific chat (messages are aggregated)
Parameter | Description | Default | Notes |
---|---|---|---|
site_note_crawler |
Enable crawling of site notebooks | true |
Crawls notebooks at the root SharePoint site |
user_note_crawler |
Enable crawling of user notebooks | true |
Crawls personal OneNote notebooks for licensed users |
group_note_crawler |
Enable crawling of group notebooks | true |
Crawls shared notebooks in Microsoft 365 groups |
number_of_threads |
Number of processing threads | 1 |
Controls concurrent notebook processing |
Parameter | Description | Default | Notes |
---|---|---|---|
max_content_length |
Maximum content length in bytes | -1 (unlimited) |
Set size limit for file content |
ignore_folder |
Skip folder documents | true |
Process files only, ignore folders |
supported_mimetypes |
Supported MIME types pattern | .* |
Regex pattern for supported file types |
drive_id |
Specific drive ID to crawl | - | If specified, only crawls this drive |
shared_documents_drive_crawler |
Enable shared documents crawling | true |
Crawl default user's OneDrive |
user_drive_crawler |
Enable user drives crawling | true |
Crawl all licensed users' drives |
group_drive_crawler |
Enable group drives crawling | true |
Crawl Microsoft 365 group drives |
Parameter | Description | Default | Notes |
---|---|---|---|
site_id |
Specific site ID to crawl | All sites | Can be site URL or GUID |
exclude_site_id |
Site IDs to exclude | - | See format guide below |
site_type_filter |
Filter by type | - | root , subsite |
ignore_system_libraries |
Skip system libraries | true |
Excludes Form Templates, etc. |
ignore_folder |
Skip folder documents | true |
Index folder structure |
SharePoint site IDs contain commas as part of their format (hostname,siteCollectionId,siteId
). To properly exclude sites:
-
Single SharePoint site: Use the full site ID as-is
exclude_site_id=site1.sharepoint.com,686d3f1a-a383-4367-b5f5-93b99baabcf3,12048306-4e53-420e-bd7c-31af611f6d8a
-
Multiple SharePoint sites: Separate with semicolons (
;
)exclude_site_id=site1.sharepoint.com,guid1,guid1;site2.sharepoint.com,guid2,guid2
-
Legacy simple IDs: Comma-separated (for backward compatibility)
exclude_site_id=site1,site2,site3
Parameter | Description | Default | Notes |
---|---|---|---|
site_id |
SharePoint site ID containing lists | Required | Full site ID format: hostname,siteCollectionId,siteId |
list_id |
Specific list ID to crawl | All lists | If specified, only this list will be crawled |
exclude_list_id |
Comma-separated list IDs to exclude | - | Multiple list IDs separated by commas |
list_template_filter |
Filter by list template types | - | Comma-separated template IDs (e.g., 100,101 ) |
ignore_system_lists |
Skip system lists | true |
Excludes lists like User Information, Workflow Tasks |
ignore_error |
Continue crawling on errors | false |
Set to true to skip failed items |
include_pattern |
Regex pattern for item titles to include | - | Filter items by title matching |
exclude_pattern |
Regex pattern for item titles to exclude | - | Skip items with matching titles |
number_of_threads |
Number of processing threads | 1 |
Concurrent list processing |
default_permissions |
Default role assignments | - | Additional permissions for all items |
Recent Improvements: SharePoint List crawling now includes enhanced statistical tracking, improved error handling with configurable failure recording, comprehensive URL filtering support, and robust permission processing to ensure secure and efficient list item indexing.
- Language: Java 21
- Build Tool: Maven 3.8+
- Framework: Fess Data Store (LastaFlute/DBFlute)
- API Client: Microsoft Graph SDK v6
- Authentication: Azure Identity SDK
- Testing: UTFlute with JUnit 4
- Dependency Management: Maven Shade Plugin with relocation
src/
βββ main/java/org/codelibs/fess/ds/ms365/
β βββ Microsoft365DataStore.java # Abstract base class
β βββ OneDriveDataStore.java # OneDrive implementation
β βββ OneNoteDataStore.java # OneNote implementation
β βββ TeamsDataStore.java # Teams implementation
β βββ SharePointDocLibDataStore.java # SharePoint doc libs
β βββ SharePointListDataStore.java # SharePoint lists
β βββ client/
β βββ Microsoft365Client.java # Graph API wrapper
βββ main/resources/
β βββ fess_ds++.xml # DI configuration
βββ test/java/org/codelibs/fess/ds/ms365/ # Test classes
# Clean build
mvn clean package
# Run tests
mvn test
# Run specific test
mvn test -Dtest=OneDriveDataStoreTest
# Format code
mvn formatter:format
# Build without tests (faster)
mvn clean package -DskipTests
-
Prerequisites:
# Check Java version java -version # Should be 21+ # Check Maven version mvn -version # Should be 3.8+
-
Clone and Setup:
git clone https://github.com/codelibs/fess-ds-microsoft365.git cd fess-ds-microsoft365 # Install parent POM git clone https://github.com/codelibs/fess-parent.git cd fess-parent && mvn install -Dgpg.skip=true && cd .. # Build project mvn clean compile
-
IDE Setup:
- Import as Maven project
- Set Java 21 as project SDK
- Enable annotation processing
- Use the Eclipse formatter config in
src/config/eclipse/formatter/
The project uses UTFlute framework with mock Microsoft Graph API responses:
# Run all tests
mvn test
# Test specific data store
mvn test -Dtest=OneDriveDataStoreTest
mvn test -Dtest=Microsoft365ClientTest
# Test with debug output
mvn test -X -Dtest=SharePointDocLibDataStoreTest
- Code Style: Use the provided Eclipse formatter configuration
- Testing: Write tests for new functionality using existing patterns
- Documentation: Update README and JavaDocs for API changes
- Versioning: Follow semantic versioning for releases
- Pull Requests: Ensure CI passes before submitting
The project uses Maven Shade Plugin to bundle Microsoft Graph SDK dependencies with package relocation to avoid conflicts:
- Relocated Packages:
io.netty.*
βorg.codelibs.fess.ds.ms365.netty.*
- Bundled Libraries: Azure SDK, Microsoft Graph SDK v6, Reactor Netty
- Provided Dependencies: Fess framework, OpenSearch, Jakarta APIs
// Fess search script mapping for OneDrive
title=file.name
content=file.description + "\n" + file.contents
mimetype=file.mimetype
created=file.created
last_modified=file.last_modified
url=file.web_url
role=file.roles
# SharePoint list crawling with filtering
site_id=contoso.sharepoint.com,686d3f1a-a383-4367-b5f5-93b99baabcf3,12048306-4e53-420e-bd7c-31af611f6d8a
list_template_filter=100,101 # Generic lists and Document Libraries
ignore_system_lists=true
include_pattern=.*Important.*
exclude_pattern=.*Draft.*
ignore_error=false
number_of_threads=2
default_permissions={role}sharepoint-users
# Crawl specific list only
# list_id=12345678-1234-1234-1234-123456789abc
# Exclude multiple lists
# exclude_list_id=list1-id,list2-id,list3-id
// Teams message indexing script
title=message.title
content=message.content
created=message.created_date_time
last_modified=message.last_modified_date_time
url=message.web_url
role=message.roles
// Access additional fields
team_name=team.displayName
channel_name=channel.displayName
sender=message.from.user.displayName
# Crawl specific team with filters
team_id=12345678-1234-1234-1234-123456789abc
ignore_replies=true
ignore_system_events=true
append_attachment=true
number_of_threads=2
# Exclude multiple teams
exclude_team_ids=team1-id,team2-id,team3-id
include_visibility=public
Authentication Errors
Solution: Verify Azure AD app permissions and admin consent
- Check tenant ID format
- Ensure client secret hasn't expired
- Verify API permissions are granted
Rate Limiting
Solution: Adjust threading and implement backoff
- Reduce number_of_threads parameter
- Enable ignore_error to continue on throttling
- Monitor Microsoft Graph API limits
Large Content Issues
Solution: Configure content handling
- Implement exclude_pattern for large files
- Use OneDriveDataStore settings for file content extraction
Enable debug logging in Fess to troubleshoot issues:
<!-- Add to log4j2.xml -->
<Logger name="org.codelibs.fess.ds.ms365" level="DEBUG"/>
- Fess Documentation: https://fess.codelibs.org/
- Microsoft Graph API: https://docs.microsoft.com/en-us/graph/
- Azure AD App Registration: https://docs.microsoft.com/en-us/azure/active-directory/develop/
- Issue Tracker: https://github.com/codelibs/fess-ds-microsoft365/issues
Licensed under the Apache License 2.0. See LICENSE for details.