Skip to content

Commit 216d352

Browse files
authored
Merge pull request #170 from Azure/sharepoint-lists
Sharepoint lists
2 parents b03ca7c + 16b011d commit 216d352

32 files changed

+3433
-1140
lines changed

.devcontainer/devcontainer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
},
66
"features": {
77
"ghcr.io/devcontainers/features/git:1": {},
8-
"ghcr.io/devcontainers/features/azure-cli:1.2.7": {}
8+
"ghcr.io/devcontainers/features/azure-cli:1.2.9": {}
99
},
1010
"appPort": [80],
1111
"customizations": {

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
All notable changes to this project will be documented in this file.
44
This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres to [Semantic Versioning](https://semver.org/).
55

6+
## [TBD] – TBD
7+
### Added
8+
- Support to Sharepoint Lists
9+
610
## [v2.0.5] – 2025-10-02
711
### Fixed
812
- Fixed SharePoint ingestion re-indexing unchanged files

README.md

Lines changed: 3 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -19,136 +19,11 @@ Part of the [GPT-RAG](https://github.com/Azure/gpt-rag) solution.
1919

2020
The **GPT-RAG Data Ingestion** service automates the processing of diverse document types—such as PDFs, images, spreadsheets, transcripts, and SharePoint files—preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experies for agent-based RAG applications.
2121

22-
## How data ingestion works
22+
For full documentation, visit the **[GPT-RAG documentation site](https://azure.github.io/GPT-RAG/)**.
2323

24-
The service performs the following steps:
24+
## Contributing
2525

26-
* **Scan sources**: Detects new or updated content in configured sources
27-
* **Process content**: Chunk and enrich data for retrieval
28-
* **Index documents**: Writes processed chunks into Azure AI Search
29-
* **Schedule execution**: Runs on a CRON-based scheduler defined by environment variables
30-
31-
## Supported data sources
32-
33-
- [Blob Storage](docs/blob_data_source.md)
34-
- [NL2SQL Metadata](docs/nl2sql_data_source.md)
35-
- SharePoint
36-
37-
## Supported formats and chunkers
38-
39-
The ingestion service selects a chunker based on the file extension, ensuring each document is processed with the most suitable method.
40-
41-
* **`.pdf` files** — Processed by the [DocAnalysisChunker](chunking/chunkers/doc_analysis_chunker.py) using the Document Intelligence API. Structured elements such as tables and sections are extracted and converted into Markdown, then segmented with LangChain splitters. When Document Intelligence API 4.0 is enabled, `.docx` and `.pptx` files are handled the same way.
42-
43-
* **Image files** (`.bmp`, `.png`, `.jpeg`, `.tiff`) — The [DocAnalysisChunker](chunking/chunkers/doc_analysis_chunker.py) applies OCR to extract text before chunking.
44-
45-
* **Text-based files** (`.txt`, `.md`, `.json`, `.csv`) — Processed by the [LangChainChunker](chunking/chunkers/langchain_chunker.py), which splits content into paragraphs or sections.
46-
47-
* **Specialized formats**:
48-
49-
* `.vtt` (video transcripts) — Handled by the [TranscriptionChunker](chunking/chunkers/transcription_chunker.py), which splits content by time codes.
50-
* `.xlsx` (spreadsheets) — Processed by the [SpreadsheetChunker](chunking/chunkers/spreadsheet_chunker.py), chunked by rows or sheets.
51-
52-
## How to deploy the data ingestion service
53-
54-
### Prerequisites
55-
56-
Before deploying the application, you must provision the infrastructure as described in the [GPT-RAG](https://github.com/azure/gpt-rag) repo. This includes creating all necessary Azure resources required to support the application runtime.
57-
58-
<details markdown="block">
59-
<summary>Click to view <strong>software</strong> prerequisites</summary>
60-
<br>
61-
The machine used to customize and or deploy the service should have:
62-
63-
* Azure CLI: [Install Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli)
64-
* Azure Developer CLI (optional, if using azd): [Install azd](https://learn.microsoft.com/en-us/azure/developer/azure-developer-cli/install-azd)
65-
* Git: [Download Git](https://git-scm.com/downloads)
66-
* Python 3.12: [Download Python 3.12](https://www.python.org/downloads/release/python-3120/)
67-
* Docker CLI: [Install Docker](https://docs.docker.com/get-docker/)
68-
* VS Code (recommended): [Download VS Code](https://code.visualstudio.com/download)
69-
</details>
70-
71-
72-
<details markdown="block">
73-
<summary>Click to view <strong>permissions</strong> requirements</summary>
74-
<br>
75-
To customize the service, your user should have the following roles:
76-
77-
| Resource | Role | Description |
78-
| :---------------------- | :---------------------------------- | :------------------------------------------ |
79-
| App Configuration Store | App Configuration Data Owner | Full control over configuration settings |
80-
| Container Registry | AcrPush | Push and pull container images |
81-
| AI Search Service | Search Index Data Contributor | Read and write index data |
82-
| Storage Account | Storage Blob Data Contributor | Read and write blob data |
83-
| Cosmos DB | Cosmos DB Built-in Data Contributor | Read and write documents in Cosmos DB |
84-
85-
To deploy the service, assign these roles to your user or service principal:
86-
87-
| Resource | Role | Description |
88-
| :----------------------------------------- | :------------------------------- | :-------------------- |
89-
| App Configuration Store | App Configuration Data Reader | Read config |
90-
| Container Registry | AcrPush | Push images |
91-
| Azure Container App | Azure Container Apps Contributor | Manage Container Apps |
92-
93-
Ensure the deployment identity has these roles at the correct scope (subscription or resource group).
94-
95-
</details>
96-
97-
## Deployment steps
98-
99-
Make sure you're logged in to Azure before anything else:
100-
101-
```bash
102-
az login
103-
```
104-
105-
### Deploying the app with azd (recommended)
106-
107-
Initialize the template:
108-
```shell
109-
azd init -t azure/gpt-rag-ingestion
110-
```
111-
> [!IMPORTANT]
112-
> Use the **same environment name** with `azd init` as in the infrastructure deployment to keep components consistent.
113-
114-
Update env variables then deploy:
115-
```shell
116-
azd env refresh
117-
azd deploy
118-
```
119-
> [!IMPORTANT]
120-
> Run `azd env refresh` with the **same subscription** and **resource group** used in the infrastructure deployment.
121-
122-
### Deploying the app with a shell script
123-
124-
To deploy using a script, first clone the repository, set the App Configuration endpoint, and then run the deployment script.
125-
126-
##### PowerShell (Windows)
127-
128-
```powershell
129-
git clone https://github.com/Azure/gpt-rag-ingestion.git
130-
$env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"
131-
cd gpt-rag-ingestion
132-
.\scripts\deploy.ps1
133-
```
134-
135-
##### Bash (Linux/macOS)
136-
```bash
137-
git clone https://github.com/Azure/gpt-rag-ingestion.git
138-
export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"
139-
cd gpt-rag-ingestion
140-
./scripts/deploy.sh
141-
````
142-
143-
## Previous Releases
144-
145-
> [!NOTE]
146-
> For earlier versions, use the corresponding release in the GitHub repository (e.g., v1.0.0 for the initial version).
147-
148-
149-
## 🤝 Contributing
150-
151-
We appreciate contributions! See [CONTRIBUTING](https://github.com/Azure/gpt-rag/blob/main/CONTRIBUTING.md) for guidelines on submitting pull requests.
26+
We welcome contributions! See the [contribution guidelines](https://azure.github.io/GPT-RAG/contributing/) for details on how to contribute.
15227

15328
## Trademarks
15429

chunking/chunkers/doc_analysis_chunker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,13 +144,13 @@ def _process_document_chunks(self, document):
144144
current_page = self._update_page(text_chunk, current_page)
145145
chunk_page = self._determine_chunk_page(text_chunk, current_page)
146146
if num_tokens >= self.minimum_chunk_size:
147-
chunk_id += 1
148147
chunk = self._create_chunk(
149148
chunk_id=chunk_id,
150149
content=text_chunk,
151150
page=chunk_page
152151
)
153152
chunks.append(chunk)
153+
chunk_id += 1
154154
else:
155155
skipped_chunks += 1
156156

chunking/chunkers/json_chunker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,9 @@ def get_chunks(self):
6464
)
6565
# Optionally, you might decide to leave such chunks as is,
6666
# or further process them with a string splitter.
67-
chunk_id += 1
6867
chunk_dict = self._create_chunk(chunk_id, chunk_text)
6968
chunk_dicts.append(chunk_dict)
69+
chunk_id += 1
7070

7171
logging.info(f"[json_chunker][{self.filename}] Created {len(chunk_dicts)} chunk(s).")
7272
return chunk_dicts

chunking/chunkers/langchain_chunker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,13 +81,13 @@ def get_chunks(self):
8181
chunk_id = 0
8282
for text_chunk, num_tokens in text_chunks:
8383
if num_tokens >= self.minimum_chunk_size:
84-
chunk_id += 1
8584
chunk_size = self.token_estimator.estimate_tokens(text_chunk)
8685
if chunk_size > self.max_chunk_size:
8786
logging.info(f"[langchain_chunker][{self.filename}] truncating {chunk_size} size chunk to fit within {self.max_chunk_size} tokens")
8887
text_chunk = self._truncate_chunk(text_chunk)
8988
chunk_dict = self._create_chunk(chunk_id, text_chunk)
9089
chunks.append(chunk_dict)
90+
chunk_id += 1
9191
else:
9292
skipped_chunks += 1
9393
logging.debug(f"[langchain_chunker][{self.filename}] {len(chunks)} chunk(s) created")

chunking/chunkers/multimodal_chunker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,14 +143,14 @@ def _create_text_chunks(self, document):
143143
chunk_page = self._determine_chunk_page(text_chunk, current_page)
144144

145145
if num_tokens >= self.minimum_chunk_size:
146-
chunk_id += 1
147146
chunk = self._create_chunk(
148147
chunk_id=chunk_id,
149148
content=text_chunk,
150149
page=chunk_page,
151150
offset=chunk_offset,
152151
)
153152
chunks.append(chunk)
153+
chunk_id += 1
154154
else:
155155
skipped_chunks += 1
156156

chunking/chunkers/nl2sql_chunker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,6 @@ def get_chunks(self):
5959

6060
chunk_id = 0
6161
for query_id, data in json_data.items():
62-
chunk_id += 1
6362
content = json.dumps(data, indent=4, ensure_ascii=False)
6463
chunk_size = self.token_estimator.estimate_tokens(content)
6564
if chunk_size > self.max_chunk_size:
@@ -74,5 +73,6 @@ def get_chunks(self):
7473
summary=None
7574
)
7675
chunks.append(chunk_dict)
76+
chunk_id += 1
7777

7878
return chunks

chunking/chunkers/spreadsheet_chunker.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -89,8 +89,8 @@ def get_chunks(self):
8989
if not self.chunking_by_row:
9090
# Original behavior: Chunk per sheet
9191
start_time = time.time()
92-
chunk_id += 1
93-
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Starting processing chunk {chunk_id} (sheet).")
92+
current_chunk_id = chunk_id
93+
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Starting processing chunk {current_chunk_id} (sheet).")
9494
table_content = sheet["table"]
9595

9696
table_content = self._clean_markdown_table(table_content)
@@ -101,15 +101,16 @@ def get_chunks(self):
101101
table_content = sheet["summary"]
102102

103103
chunk_dict = self._create_chunk(
104-
chunk_id=chunk_id,
104+
chunk_id=current_chunk_id,
105105
content=table_content,
106106
summary=sheet["summary"] if not self.chunking_by_row else "",
107107
embedding_text=sheet["summary"] if (sheet["summary"] and not self.chunking_by_row) else table_content,
108108
title=sheet["name"]
109109
)
110110
chunks.append(chunk_dict)
111+
chunk_id += 1
111112
elapsed_time = time.time() - start_time
112-
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {chunk_id} in {elapsed_time:.2f} seconds.")
113+
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {current_chunk_id} in {elapsed_time:.2f} seconds.")
113114
else:
114115
# New behavior: Chunk per row
115116
logging.info(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Starting row-wise chunking.")
@@ -118,9 +119,9 @@ def get_chunks(self):
118119
for row_index, row in enumerate(rows, start=1):
119120
if not any(cell.strip() for cell in row):
120121
continue
121-
chunk_id += 1
122122
start_time = time.time()
123-
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processing chunk {chunk_id} for row {row_index}.")
123+
current_chunk_id = chunk_id
124+
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processing chunk {current_chunk_id} for row {row_index}.")
124125

125126
if self.include_header_in_chunks:
126127
table = tabulate([headers, row], headers="firstrow", tablefmt="github")
@@ -140,15 +141,16 @@ def get_chunks(self):
140141
embedding_text = table
141142

142143
chunk_dict = self._create_chunk(
143-
chunk_id=chunk_id,
144+
chunk_id=current_chunk_id,
144145
content=content,
145146
summary=summary,
146147
embedding_text=embedding_text,
147148
title=f"{sheet['name']} - Row {row_index}"
148149
)
149150
chunks.append(chunk_dict)
151+
chunk_id += 1
150152
elapsed_time = time.time() - start_time
151-
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {chunk_id} in {elapsed_time:.2f} seconds.")
153+
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {current_chunk_id} in {elapsed_time:.2f} seconds.")
152154

153155
total_elapsed_time = time.time() - total_start_time
154156
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks] Finished get_chunks. Created {len(chunks)} chunks in {total_elapsed_time:.2f} seconds.")

chunking/chunkers/transcription_chunker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,13 @@ def get_chunks(self):
7171
text_chunks = self._chunk_document_content(text)
7272
chunk_id = 0
7373
for text_chunk in text_chunks:
74-
chunk_id += 1
7574
chunk_size = self.token_estimator.estimate_tokens(text_chunk)
7675
if chunk_size > self.max_chunk_size:
7776
logging.debug(f"[transcription_chunker][{self.filename}] truncating {chunk_size} size chunk to fit within {self.max_chunk_size} tokens")
7877
text_chunk = self._truncate_chunk(text_chunk)
7978
chunk_dict = self._create_chunk(chunk_id=chunk_id, content=text_chunk, embedding_text=summary, summary=summary)
8079
chunks.append(chunk_dict)
80+
chunk_id += 1
8181
return chunks
8282

8383
def _vtt_process(self):

0 commit comments

Comments
 (0)