Azure
diff --git a/‎.devcontainer/devcontainer.json‎
Lines changed: 1 addition & 1 deletion b/‎.devcontainer/devcontainer.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 128 deletions b/‎README.md‎
Lines changed: 3 additions & 128 deletions
diff --git a/‎chunking/chunkers/doc_analysis_chunker.py‎
Lines changed: 1 addition & 1 deletion b/‎chunking/chunkers/doc_analysis_chunker.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎chunking/chunkers/json_chunker.py‎
Lines changed: 1 addition & 1 deletion b/‎chunking/chunkers/json_chunker.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎chunking/chunkers/langchain_chunker.py‎
Lines changed: 1 addition & 1 deletion b/‎chunking/chunkers/langchain_chunker.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎chunking/chunkers/multimodal_chunker.py‎
Lines changed: 1 addition & 1 deletion b/‎chunking/chunkers/multimodal_chunker.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎chunking/chunkers/nl2sql_chunker.py‎
Lines changed: 1 addition & 1 deletion b/‎chunking/chunkers/nl2sql_chunker.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎chunking/chunkers/spreadsheet_chunker.py‎
Lines changed: 10 additions & 8 deletions b/‎chunking/chunkers/spreadsheet_chunker.py‎
Lines changed: 10 additions & 8 deletions
diff --git a/‎chunking/chunkers/transcription_chunker.py‎
Lines changed: 1 addition & 1 deletion b/‎chunking/chunkers/transcription_chunker.py‎
Lines changed: 1 addition & 1 deletion
@@ -5,7 +5,7 @@
   },
   "features": {
     "ghcr.io/devcontainers/features/git:1": {},
-    "ghcr.io/devcontainers/features/azure-cli:1.2.7": {}
+    "ghcr.io/devcontainers/features/azure-cli:1.2.9": {}
   },
   "appPort": [80],
   "customizations": {
 
@@ -3,6 +3,10 @@
 All notable changes to this project will be documented in this file.  
 This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres to [Semantic Versioning](https://semver.org/).
 
+## [TBD] – TBD
+### Added
+- Support to Sharepoint Lists
+
 ## [v2.0.5] – 2025-10-02
 ### Fixed
 - Fixed SharePoint ingestion re-indexing unchanged files
 
@@ -19,136 +19,11 @@ Part of the [GPT-RAG](https://github.com/Azure/gpt-rag) solution.
 
 The **GPT-RAG Data Ingestion** service automates the processing of diverse document types—such as PDFs, images, spreadsheets, transcripts, and SharePoint files—preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experies for agent-based RAG applications.
 
-## How data ingestion works
+For full documentation, visit the **[GPT-RAG documentation site](https://azure.github.io/GPT-RAG/)**.
 
-The service performs the following steps:
+## Contributing
 
-* **Scan sources**: Detects new or updated content in configured sources
-* **Process content**: Chunk and enrich data for retrieval
-* **Index documents**: Writes processed chunks into Azure AI Search
-* **Schedule execution**: Runs on a CRON-based scheduler defined by environment variables
-
-## Supported data sources
-
-- [Blob Storage](docs/blob_data_source.md)
-- [NL2SQL Metadata](docs/nl2sql_data_source.md)
-- SharePoint
-
-## Supported formats and chunkers
-
-The ingestion service selects a chunker based on the file extension, ensuring each document is processed with the most suitable method.
-
-* **`.pdf` files** — Processed by the [DocAnalysisChunker](chunking/chunkers/doc_analysis_chunker.py) using the Document Intelligence API. Structured elements such as tables and sections are extracted and converted into Markdown, then segmented with LangChain splitters. When Document Intelligence API 4.0 is enabled, `.docx` and `.pptx` files are handled the same way.
-
-* **Image files** (`.bmp`, `.png`, `.jpeg`, `.tiff`) — The [DocAnalysisChunker](chunking/chunkers/doc_analysis_chunker.py) applies OCR to extract text before chunking.
-  
-* **Text-based files** (`.txt`, `.md`, `.json`, `.csv`) — Processed by the [LangChainChunker](chunking/chunkers/langchain_chunker.py), which splits content into paragraphs or sections.
-
-* **Specialized formats**:
-
-  * `.vtt` (video transcripts) — Handled by the [TranscriptionChunker](chunking/chunkers/transcription_chunker.py), which splits content by time codes.
-  * `.xlsx` (spreadsheets) — Processed by the [SpreadsheetChunker](chunking/chunkers/spreadsheet_chunker.py), chunked by rows or sheets.
-
-## How to deploy the data ingestion service
-
-### Prerequisites
-
-Before deploying the application, you must provision the infrastructure as described in the [GPT-RAG](https://github.com/azure/gpt-rag) repo. This includes creating all necessary Azure resources required to support the application runtime.
-
-<details markdown="block">
-<summary>Click to view <strong>software</strong> prerequisites</summary>
-<br>
-The machine used to customize and or deploy the service should have:
-
-* Azure CLI: [Install Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli)
-* Azure Developer CLI (optional, if using azd): [Install azd](https://learn.microsoft.com/en-us/azure/developer/azure-developer-cli/install-azd)
-* Git: [Download Git](https://git-scm.com/downloads)
-* Python 3.12: [Download Python 3.12](https://www.python.org/downloads/release/python-3120/)
-* Docker CLI: [Install Docker](https://docs.docker.com/get-docker/)
-* VS Code (recommended): [Download VS Code](https://code.visualstudio.com/download)
-</details>
-
-
-<details markdown="block">
-<summary>Click to view <strong>permissions</strong> requirements</summary>
-<br>
-To customize the service, your user should have the following roles:
-
-| Resource                | Role                                | Description                                 |
-| :---------------------- | :---------------------------------- | :------------------------------------------ |
-| App Configuration Store | App Configuration Data Owner        | Full control over configuration settings    |
-| Container Registry      | AcrPush                             | Push and pull container images              |
-| AI Search Service       | Search Index Data Contributor       | Read and write index data                   |
-| Storage Account         | Storage Blob Data Contributor       | Read and write blob data                    |
-| Cosmos DB               | Cosmos DB Built-in Data Contributor | Read and write documents in Cosmos DB       |
-
-To deploy the service, assign these roles to your user or service principal:
-
-| Resource                                   | Role                             | Description           |
-| :----------------------------------------- | :------------------------------- | :-------------------- |
-| App Configuration Store                    | App Configuration Data Reader    | Read config           |
-| Container Registry                         | AcrPush                          | Push images           |
-| Azure Container App                        | Azure Container Apps Contributor | Manage Container Apps |
-
-Ensure the deployment identity has these roles at the correct scope (subscription or resource group).
-
-</details>
-
-## Deployment steps
-
-Make sure you're logged in to Azure before anything else:
-
-```bash
-az login
-```
-
-### Deploying the app with azd (recommended)
-
-Initialize the template:
-```shell
-azd init -t azure/gpt-rag-ingestion 
-```
-> [!IMPORTANT]
-> Use the **same environment name** with `azd init` as in the infrastructure deployment to keep components consistent.
-
-Update env variables then deploy:
-```shell
-azd env refresh
-azd deploy 
-```
-> [!IMPORTANT]
-> Run `azd env refresh` with the **same subscription** and **resource group** used in the infrastructure deployment.
-
-### Deploying the app with a shell script
-
-To deploy using a script, first clone the repository, set the App Configuration endpoint, and then run the deployment script.
-
-##### PowerShell (Windows)
-
-```powershell
-git clone https://github.com/Azure/gpt-rag-ingestion.git
-$env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"
-cd gpt-rag-ingestion
-.\scripts\deploy.ps1
-```
-
-##### Bash (Linux/macOS)
-```bash
-git clone https://github.com/Azure/gpt-rag-ingestion.git
-export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"
-cd gpt-rag-ingestion
-./scripts/deploy.sh
-````
-
-## Previous Releases
-
-> [!NOTE]  
-> For earlier versions, use the corresponding release in the GitHub repository (e.g., v1.0.0 for the initial version).
-
-
-## 🤝 Contributing
-
-We appreciate contributions! See [CONTRIBUTING](https://github.com/Azure/gpt-rag/blob/main/CONTRIBUTING.md) for guidelines on submitting pull requests.
+We welcome contributions! See the [contribution guidelines](https://azure.github.io/GPT-RAG/contributing/) for details on how to contribute.
 
 ## Trademarks
 
 
@@ -144,13 +144,13 @@ def _process_document_chunks(self, document):
             current_page = self._update_page(text_chunk, current_page)
             chunk_page = self._determine_chunk_page(text_chunk, current_page)
             if num_tokens >= self.minimum_chunk_size:
-                chunk_id += 1
                 chunk = self._create_chunk(
                     chunk_id=chunk_id,
                     content=text_chunk,
                     page=chunk_page
                 )
                 chunks.append(chunk)
+                chunk_id += 1
             else:
                 skipped_chunks += 1
 
 
@@ -64,9 +64,9 @@ def get_chunks(self):
                 )
                 # Optionally, you might decide to leave such chunks as is,
                 # or further process them with a string splitter.
-            chunk_id += 1
             chunk_dict = self._create_chunk(chunk_id, chunk_text)
             chunk_dicts.append(chunk_dict)
+            chunk_id += 1
 
         logging.info(f"[json_chunker][{self.filename}] Created {len(chunk_dicts)} chunk(s).")
         return chunk_dicts
 
@@ -81,13 +81,13 @@ def get_chunks(self):
         chunk_id = 0
         for text_chunk, num_tokens in text_chunks:
             if num_tokens >= self.minimum_chunk_size:
-                chunk_id += 1
                 chunk_size = self.token_estimator.estimate_tokens(text_chunk)
                 if chunk_size > self.max_chunk_size:
                     logging.info(f"[langchain_chunker][{self.filename}] truncating {chunk_size} size chunk to fit within {self.max_chunk_size} tokens")
                     text_chunk = self._truncate_chunk(text_chunk)
                 chunk_dict = self._create_chunk(chunk_id, text_chunk)
                 chunks.append(chunk_dict)
+                chunk_id += 1
             else:
                 skipped_chunks += 1
         logging.debug(f"[langchain_chunker][{self.filename}] {len(chunks)} chunk(s) created")    
 
@@ -143,14 +143,14 @@ def _create_text_chunks(self, document):
             chunk_page = self._determine_chunk_page(text_chunk, current_page)
 
             if num_tokens >= self.minimum_chunk_size:
-                chunk_id += 1
                 chunk = self._create_chunk(
                     chunk_id=chunk_id,
                     content=text_chunk,
                     page=chunk_page,
                     offset=chunk_offset,
                 )
                 chunks.append(chunk)
+                chunk_id += 1
             else:
                 skipped_chunks += 1
 
 
@@ -59,7 +59,6 @@ def get_chunks(self):
 
         chunk_id = 0
         for query_id, data in json_data.items():
-            chunk_id += 1
             content = json.dumps(data, indent=4, ensure_ascii=False)
             chunk_size = self.token_estimator.estimate_tokens(content)
             if chunk_size > self.max_chunk_size:
@@ -74,5 +73,6 @@ def get_chunks(self):
                 summary=None
             )
             chunks.append(chunk_dict)
+            chunk_id += 1
 
         return chunks
@@ -89,8 +89,8 @@ def get_chunks(self):
             if not self.chunking_by_row:
                 # Original behavior: Chunk per sheet
                 start_time = time.time()
-                chunk_id += 1
-                logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Starting processing chunk {chunk_id} (sheet).")
+                current_chunk_id = chunk_id
+                logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Starting processing chunk {current_chunk_id} (sheet).")
                 table_content = sheet["table"]
 
                 table_content = self._clean_markdown_table(table_content)
@@ -101,15 +101,16 @@ def get_chunks(self):
                     table_content = sheet["summary"]
 
                 chunk_dict = self._create_chunk(
-                    chunk_id=chunk_id,
+                    chunk_id=current_chunk_id,
                     content=table_content,
                     summary=sheet["summary"] if not self.chunking_by_row else "",
                     embedding_text=sheet["summary"] if (sheet["summary"] and not self.chunking_by_row) else table_content,
                     title=sheet["name"]
                 )            
                 chunks.append(chunk_dict)
+                chunk_id += 1
                 elapsed_time = time.time() - start_time
-                logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {chunk_id} in {elapsed_time:.2f} seconds.")            
+                logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {current_chunk_id} in {elapsed_time:.2f} seconds.")            
             else:
                 # New behavior: Chunk per row
                 logging.info(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Starting row-wise chunking.")
@@ -118,9 +119,9 @@ def get_chunks(self):
                 for row_index, row in enumerate(rows, start=1):
                     if not any(cell.strip() for cell in row):
                         continue
-                    chunk_id += 1
                     start_time = time.time()
-                    logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processing chunk {chunk_id} for row {row_index}.")
+                    current_chunk_id = chunk_id
+                    logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processing chunk {current_chunk_id} for row {row_index}.")
 
                     if self.include_header_in_chunks:
                         table = tabulate([headers, row], headers="firstrow", tablefmt="github")
@@ -140,15 +141,16 @@ def get_chunks(self):
                         embedding_text = table
 
                     chunk_dict = self._create_chunk(
-                        chunk_id=chunk_id,
+                        chunk_id=current_chunk_id,
                         content=content,
                         summary=summary,
                         embedding_text=embedding_text,
                         title=f"{sheet['name']} - Row {row_index}"
                     )
                     chunks.append(chunk_dict)
+                    chunk_id += 1
                     elapsed_time = time.time() - start_time
-                    logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {chunk_id} in {elapsed_time:.2f} seconds.")
+                    logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {current_chunk_id} in {elapsed_time:.2f} seconds.")
 
         total_elapsed_time = time.time() - total_start_time
         logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks] Finished get_chunks. Created {len(chunks)} chunks in {total_elapsed_time:.2f} seconds.")
 
@@ -71,13 +71,13 @@ def get_chunks(self):
         text_chunks = self._chunk_document_content(text)
         chunk_id = 0
         for text_chunk in text_chunks:
-            chunk_id += 1
             chunk_size = self.token_estimator.estimate_tokens(text_chunk)
             if chunk_size > self.max_chunk_size:
                 logging.debug(f"[transcription_chunker][{self.filename}] truncating {chunk_size} size chunk to fit within {self.max_chunk_size} tokens")
                 text_chunk = self._truncate_chunk(text_chunk)
             chunk_dict = self._create_chunk(chunk_id=chunk_id, content=text_chunk, embedding_text=summary, summary=summary) 
             chunks.append(chunk_dict)      
+            chunk_id += 1
         return chunks
 
     def _vtt_process(self):